Algorithm to test whether a language is regular

Question

Is there an algorithm/systematic procedure to test whether a language is regular?

In other words, given a language specified in algebraic form (think of something like $L=\{a^n b^n : n \in \mathbb{N}\}$), test whether the language is regular or not. Imagine we are writing a web service to help students with all their homeworks; the user specifies the language, and the web service responds with "regular", "not regular", or "I don't know". (We'd like the web service to answer "I don't know" as infrequently as possible.) Is there any good approach to automating this? Is this tractable? Is it decidable (i.e., is it possible to guarantee that we never need to answer "I don't know")? Are there reasonably efficient algorithms for solving this problem, and be able to provide an answer other than "don't know" for many/most languages that are likely to arise in practice?

The classic method for proving that a language is not regular is the pumping lemma. However, it looks like requires manual insight at some point (e.g., to choose the word to pump), so I'm not clear on whether this can be turned into something algorithmic.

A classic method for proving that a language is regular would be to use the Myhill–Nerode theorem to derive a finite-state automaton. This looks like a promising approach, but it does requires the ability to perform basic operations on languages in algebraic form. It's not clear to me whether there's a systematic way to symbolically perform all of the operations that may be needed, on languages in algebraic form.

To make this question well-posed, we need to decide how the user will specify the language. I'm open to suggestions, but I'm thinking something like this:

$$L = \{E : S\}$$

where $E$ is a word-expression and $S$ is a system of linear inequalities over the length-variables, with the following definitions:

Each of $x,y,z,\dots$ is a word-expression. (These represent variables that can take on any word in $\Sigma^*$.)
Each of $x^r,y^r,z^r,\dots$ is a word-expression. (Here $x^r$ represents the reverse of the string $x$.)
Each of $a,b,c,\dots$ is a word-expression. (Implicitly, $\Sigma=\{a,b,c,\dots\}$, so $a,b,c,\dots$ represent a single symbol in the underlying alphabet.)
Each of $a^\eta,b^\eta,c^\eta,\dots$ is a word-expression, if $\eta$ is a length-variable.
The concatenation of word-expressions is a word-expression.
Each of $m,n,p,q,\dots$ is a length-variable. (These represent variables that can take on any natural number.)
Each of $|x|,|y|,|z|,\dots$ is a length-variable. (These represent the length of a corresponding word.)

This seems broad enough to handle many of the cases we see in textbook exercises. Of course, you can substitute any other textual method of specifying a language in algebraic form, if you have a better suggestion.

I haven't yet had time to think much about your choice of language expressions. Roughly what kinds of languages does it cover? If you add the constraint that a word variable occurs only once, are all such languages context-free? — Gilles 'SO- stop being evil', Nov 14 '13 at 07:59
Maybe you can try to write $E$ itself with a grammar? Like $E ::= c^η ∣ x ∣ EE ∣ E^r$ and $η ::= n ∣ |x|$, is it succinctly what you describe? — jmad, Nov 14 '13 at 08:27
You can express ${a^nb^nc^n \mid n\in\mathbb{N}}$ so this goes well beyond context-free languages. Still, I'm suspect the problem is at least as hard as deciding whether a context-free grammar defines a regular language. — Gilles 'SO- stop being evil', Nov 14 '13 at 09:58
@jmad, yes, that'd be perfectly reasonable. I'm not wedded to this choice of language expressions: feel free to choose something else, if you see something else more appropriate. Gilles, great angle of attack! (For onlookers, there are known results showing that testing whether an arbitrary context-free grammar defines a regular language is undecidable.) If the problem is undecidable, I'd suggest we tweak the problem to allow the web service to respond "I don't know", and then ask for an algorithm that answers "I don't know" as rarely as possible. — D.W., Nov 14 '13 at 16:42
This class isn't closed under Kleene star, is it? Can you express balanced parentheses? — Gilles 'SO- stop being evil', Nov 15 '13 at 10:00
Interesting; it may be possible to specify criterions on the inequalities that enforce/break regularity. Do have an example of a) a regular language that requires inequalities with more than one length variable and b) a non-regular language that can be described using only inequalities with one length variable (each)? Assuming the word-expression may contain each length variable only once. — Raphael, Nov 15 '13 at 16:06
@Raphael, interesting approach! Here are some examples of the sort you asked for. a) ${a^m a^n : m \le n}$ is regular. So is ${x y a^m b^n z y^r x^r : m \le 1, n \le m, |x| \le m, |y| \le m}$ (to pick a crazy example). (Note that "requires" isn't quite the right requirement; the right question is whether it "can be expressed using" inequalities with more than one length variable.) b) ${x x^r }$ is non-regular, as is ${x a^n x^r : n \ge 0}$. — D.W., Nov 15 '13 at 19:09
@Gilles, agreed. It doesn't look likely to be closed under Kleene star as far as I can see, and I can't see how to express the language of balanced parentheses. — D.W., Nov 15 '13 at 19:11
@D.W. Oh, I completely missed reversal. Your comment regarding a) is spot-on, of course; I can describe even the simplest languages in very complicated ways. Maybe we can define reasonable "minimality" criteria; for instance, "every subword denoted by a letter $x,y,z,\dots$ has to be non-empty" or "there can be no sub-expression of the form $a^{\eta}a^{\kappa}$". — Raphael, Nov 15 '13 at 22:08
Maybe it would help to use logic for the expression $E$. Since it is known that a finite automata can be expressed in monadic second order logic and vice versa, I think your algebric description should finally look like MSO or something close to that. — Parham, Nov 16 '13 at 13:26
The specification does not allow finite union, which is a bit strange since you cannot express finite languages. — J.-E. Pin, Nov 17 '13 at 19:22
@J.-E.Pin, good point; from a theoretical perspective (e.g., closure properties), that is pretty odd. Do you have any suggestions on a better way of specifying languages? I'd suggest the following requirements/goals: (a) expressive enough to capture many of the textbook exercises of this sort, (b) limited enough that there's some hope for algorithms that will be reasonably effective at solving this problem, but I'm open to different advice. — D.W., Nov 17 '13 at 19:25
my understanding, there exist languages larger than the class of regular languages such that this problem is decidable. [itself an interesting question, what are they?]. a major part of this question is the nonstandard language class to describe the "larger" language. so half the trick is precisely characterizing the "larger" language according to basic known hierarchy. in other words it would help the question if you phrase the larger class in terms of stnadard known languages. reminds me of fortnows new counting languages — vzn, Nov 18 '13 at 02:14

score 13 · Answer 1 · edited Apr 22 '22 at 07:24

13

The answer is no. Deciding whether a given context-free grammar generates a regular language is an undecidable problem.

Update. I gave this negative answer to the general question

Given a language specified in algebraic form, test whether the language is regular or not

since context-free languages are solutions of algebraic equations in languages: see Chapter II, Theorems 1.4 and 1.5 in the book of J. Berstel Transductions and Context-Free Languages.

However, the same question is decidable for deterministic context-free languages, a nontrivial result due to Stearns [1] and improved by Valiant [2]:
[1] R. E. Stearns, A Regularity Test for Pushdown Machines, Information and Control 11 323-340 (1967). DOI:10.1016/S0019-9958(67)90591-8.
[2] L. G. Valiant. Regularity and related problems for deterministic pushdown automata J. ACM 22 (1975), pp. 1–10.

There is another positive result, closer to the specifications given in the second part of the question. Recall that the semilinear subsets of $\mathbb{N}^k$ are exactly the sets definable in Presburger arithmetic. There are also the rational subsets of $\mathbb{N}^k$. In particular, a subset of $\mathbb{N}^k$ defined by linear inequations is rational. Now, given a rational subset $R$ of $\mathbb{N}^k$, it is decidable whether the language $$ L(R) = \{ u_1^{n_1} \dotsm u_k^{n_k} \mid (n_1, ...,n_k) \in R \} $$ is regular. Indeed, it is known [Ginsburg-Spanier] that $L(R)$ is regular if and only if $R$ is a recognizable subset of $\mathbb{N}^k$ and it is decidable [Ginsburg-Spanier] whether a given rational subset of $\mathbb{N}^k$ is recognizable.

S. Ginsburg and E. H. Spanier., Semigroups, Presburger formulas, and languages, Pacific J. Math. 16 (1966), 285-296.

S. Ginsburg and E. H. Spanier. Bounded regular sets, Proc. of the American Math. Soc. 17, 1043–1049 (1966).

This does not solve the second part of the question, which might be undecidable because of the word variables, but it gives a reasonable fragment to start with.

edited Apr 22 '22 at 07:24

Martin

119
4

answered Nov 14 '13 at 22:42

J.-E. Pin

6,129
18
36

(a) Pedantic nit: It's not clear to me whether the algebraic syntax above is general enough to express all context-free-grammars (as Gilles and I hinted at in the comments), so it's not entirely clear whether that particular result applies here. (b) More important: please consider the problem statement suitably tweaked so that the web service is allowed to respond "I don't know", and we'd like to find an algorithm that answers "I don't know" as rarely as possible. I previously suggested this in the comments; I'll edit the question to make this clearer in the question itself. – D.W. Nov 14 '13 at 22:52
I suspect that you can adapt the proof, but the result does not follow. I think there are context-free languages that can't be expressed in this formalism: for example, how do you express balanced parentheses? The class of languages isn't closed under Kleene star, is it? – Gilles 'SO- stop being evil' Nov 15 '13 at 09:59
@Gilles, yeah, I thought about that. It's not immediately clear to me how to adapt the proof. The standard proof that it's undecidable to tell whether a context-free grammar is regular is via Greibach's theorem. However it does not look to me like this class of languages satisfies the premises of Greibach's theorem (it doesn't look likely to be closed under concatenation with regular sets and closed under union). Maybe there's some other proof approach that I'm not familiar with. I agree, it's not clear how to express the language of balanced parentheses in this algebraic form. – D.W. Nov 15 '13 at 18:58
Just added the references. – J.-E. Pin Nov 17 '13 at 19:42
Your post does not answer the question, because it addresses a different class of languages. The algebraic forms allowed here (with a single word expression) are (as far as we can tell) not as general as the algebraic forms needed to express arbitrary context-free languages. It could be the case that for the intersection of the two, the problem is decidable. – Gilles 'SO- stop being evil' Nov 18 '13 at 13:07
1

@Gilles I made it clear that my post does not answer the question, see my last sentence: "This does not solve the second part of the question". On the other hand, if you look at the comments by D.W., the spirit of the question is to find a reasonable algebraic specification for which the regularity question becomes decidable. I just indicated some positive and negative results in this direction, but there is certainly much more to say. At this stage, maybe a variation of this question would be appropriate for TCS. – J.-E. Pin Nov 19 '13 at 13:53

Algorithm to test whether a language is regular

1 Answers1

Linked