Is regex golf NP-Complete?

Question

As seen in this recent XKCD strip and this recent blog post from Peter Norvig (and a Slashdot story featuring the latter), "regex golf" (which might better be called the regular expression separation problem) is the puzzle of defining the shortest possible regular expression that accepts every word in set A and no word in set B. Norvig's post includes an algorithm for generating a reasonably short candidate, and he notes that his approach involves solving an NP-complete Set Cover problem, but he's also careful to point out that his approach doesn't consider every possible regular expression, and of course his isn't necessarily the only algorithm, so his solutions aren't guaranteed to be optimal, and it's also possible that some other assuredly polynomial-time algorithm could find equivalent or better solutions.

For concreteness' sake and to avoid having to solve the optimization question, I think the most natural formulation of Regular Expression Separation would be:

Given two (finite) sets $A$ and $B$ of strings over some alphabet $\Sigma$, is there a regular expression of length $\leq k$ that accepts every string in $A$ and rejects every string in $B$?

Is anything known about the complexity of this particular separation problem? (Note that since I've specified $A$ and $B$ as finite sets of strings, the natural notion of size for the problem is the total lengths of all strings in $A$ and $B$; this swamps any contribution from $k$). It seems highly likely to me that it is NP-complete (and in fact, I would expect the reduction to be to some sort of cover problem) but a few searches haven't turned up anything particularly useful.

Is it even in NP? Given a regular expression, how do you check whether a word is in the described language in polynomial time? The standard approach -- transform to NFA, then DFA and check -- takes exponential time in $k$ (?). — Raphael, Jan 13 '14 at 09:38
yeah... this seems outside NP. But, what about regular expressions in the mathematical sense rather than RegExps in the programming sense? — John Dvorak, Jan 13 '14 at 09:55
should be PSPACE-complete; see (Gramlich, Schnitger, Minimizing NFAs and Regular Expressions, 2005) at http://ggramlich.github.io/Publications/approximationSTACS05Pres.pdf and http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.5056&rep=rep1&type=pdf (PS: I'm posting this as a comment, because an answer should explain why, but I don't have time to do so at the moment; perhaps someone else can use the reference and explain how it works) — rgrig, Jan 13 '14 at 12:09
For regular expressions as understood in TCS, the problem is in NP (A certificate of polynomial size and verifiable in polynomial time would be the regular expression itself). It (probably) isn't in NP if we use e.g. PCREs for regular expressions, because even testing membership is NP-hard (http://perl.plover.com/NPC/NPC-3SAT.html). — Mike B., Jan 13 '14 at 12:20
@MikeB.: And how exactly do you check in polynomial time? Did you see the comment by @Raphael? — rgrig, Jan 13 '14 at 12:43
@rgrig: you can transform the regular expression into an NFA (e.g. with Glushkov) in polynomial time, and to test whether a word is matched by an NFA can be tested in polynomial time as well - The expensive transformation to DFA is not necessary. — Mike B., Jan 13 '14 at 12:47
@rgrig The part of the RE to NFA to DFA transformation which takes exponential time is the NFA to DFA part, since the minimal DFA for some REs is exponential in the size of the RE. However constructing an NFA (no epsilon-transitions) from an RE in linear time is a well-known construction. Since an NFA is also a nondeterministic TM, checking is trivially in NP. — Pseudonym, Jan 13 '14 at 12:48
@rgrig Also, it's not clear that the Meyer-Stockmeyer result is relevant here. Their result is for a general NFA/RE. It hasn't been established that the types of REs which "solve" the regex golf problem are a sufficiently general class. — Pseudonym, Jan 13 '14 at 12:53
@Pseudonym I thought about deterministic verification in polynomial time (that's how it's usually explained) which does not work for NFAs. Nondeterministic verification is fine, of course -- nice catch! — Raphael, Jan 13 '14 at 12:57
(1) OK, for strings in $A$ you can provide witness runs of the NFA. What about $B$? (2) I don't see any restrictions on the regular expression in the original question. Of course, MS73 might still be irrelevant, because the sets $A$ and $B$ make the problem rather different. — rgrig, Jan 13 '14 at 14:41
(1) You can run a deterministic algorithm in P to test membership of NFAs (start at start-state, and remember all the states you can be in after consuming a symbol of the word. Reach the end, check if you reached at least one final state.) (2) It depends on the definition of "regular expression" - do we use the one of computer scientists, or the one of programmers? Do we allow only regular languages, or (a subset of) context sensitive languages (hence PCREs)? — Mike B., Jan 13 '14 at 15:48
@rgrig (2) Little-known fact: given a RE, you can generate a (definitely non-minimal) NFA which accepts the complement of the RE in linear time. The proof is straightforward using Brzozowski derivatives, but this comment field is too small to contain it. — Pseudonym, Jan 14 '14 at 05:11
One more random thought: the separating NFA decision problem may be easier to show as NP-hard. Given an integer $k$, does there exist a $k$-state NFA which separates the two sets? — Pseudonym, Jan 14 '14 at 06:58
@MikeB.: Oh, dear! I will now go and shoot myself. I must have been deranged yesterday. Thanks for your patience. — rgrig, Jan 14 '14 at 09:00
@rgrig Not to pile on, but the key difference between the Gramlich/Schnitger result and the problem as I've posed it is that their paper talks about finding the minimal regular expression equivalent to a given regex, whereas I'm talking about (essentially) finding minimal regexes for an explicitly given set. Their input regex is effectively a succinct representation of the set to be expressed, and those are well-known to lead to higher-complexity problems (for instance, lots of graph-theory problems become harder in the binary-circuit model). — Steven Stadnicki, Jan 14 '14 at 17:49
So, people, I think we might have an open problem on our hands. Anyone up for a polymath-style brainstorm? The comment fields here seems like the wrong place for it. — Pseudonym, Jan 15 '14 at 00:14
@Pseudonym If it's open, wait a week or two and then migrate to [cstheory.SE] if no answer has come up. — Raphael, Jan 15 '14 at 08:09
@MikeB. Could you help in similar problem, please ?https://math.stackexchange.com/questions/2264231/prove-that-problem-is-np-complete-given-alphabet-and-regular-expression-check-i — , May 05 '17 at 15:53
@Pseudonym also you seems to be able to explaine these things — , May 05 '17 at 15:54

score 16 · Accepted Answer · answered Jan 22 '14 at 00:39

Assuming the TCS-variant of regex, the problem is indeed NP-complete.

We assume that our regexes contain

letters from $\Sigma$, matching themselves,
$+$, denoting union,
$\cdot$, denoting concatenation,
$*$, denoting Kleene-Star,
$\lambda$, matching the empty string

and nothing else. Length of a regex is defined as the number of characters from $\Sigma$. As in the comic strip, we consider a regex to match a word, if it matches a substring of the word. (Changing any of these assumptions should only influence the complexity of the construction below, but not the general result.)

That it is in NP is straightforward, as explained in the comments (verify a candidate-RE by translating it into an NFA and running that on all words from $A$ and $B$).

In order to show NP-hardness, we reduce Set cover:

Given a universe $U$ and a collection $C$ of subsets of $U$, is there a set $C' \subseteq C$ of size $\leq k$ so that $\bigcup_{S \in C'} S = U$?

We translate an input for Set cover into one for regex golf as follows:

$\Sigma$ contains one character for each subset in $C$ and one additional character (denoted $x$ in the following).
$A$ contains one word for each element $e$ of $U$. The word consists of exactly the characters representing subsets in $C$ that contain $e$ (in arbitrary order).
$B$ contains the single word $x$.
$k$ is simply carried over.

This reduction is obviously in P and equivalence is also quite simple to see:

If $c_1, \ldots, c_k$ is a solution for the set cover instance, the regex $c_1 + \cdots + c_k$ is a solution to regex golf.
A regex matching the empty subword would match $x$. Thus, any regex solving the golf problem has to contain at least one letter from each of the words in $A$. Hence, if the golf instance is solvable, there is a set of at most $k$ letters from $\Sigma$ so that each word in $A$ is covered by this set of letters. By construction, the corresponding set of subsets from $C$ is a solution to the set cover instance.

Very nice, let me add 2 points, for completeness: (1) As an additional assumption regarding problem specification, $A$ and $B$ must be finite sets (and all elements are enumerated explicitly?) (2) The RE-candidate's size is in $O(n)$, since $a_1+a_2+..., a_i\in A$ is a valid candidate with size in $O(n)$, so for every larger $k$ the answer is trivially true. — Mike B., Jan 22 '14 at 09:50
@Mike B.: (1): Finiteness of $A$ and $B$ is given in the question. In complexity theory, exhaustive listing is the default way of representing finite sets. (2) is indeed a required argument, if one wants to make the "in NP" part rigorous. — FrankW, Jan 22 '14 at 11:39

Is regex golf NP-Complete?

1 Answers1

Linked