Proving that L = {x ∈ {a, b}∗ | na(x) < nb(x) < 2na(x)} is not a context free language

Question

I've been working on proving that this language

L = {x ∈ {a, b}∗ | na(x) < nb(x) < 2na(x)}

is not Context Free. "na(x)" stands for "number of a's in the string x".

I however can't find a string where all possible divisions of the string can be pumped out of the language. For example, I tried a^(p+1) b^(2p+1), but I can pick the last a and the first two b's and pump them however I like, without the string exiting the language.

I attached an image with the full text of the exercise and why I think that the provided solution is wrong.

What am I missing here? Also, if the language actually is free, what would be the context free grammar for it?

Hendrik Jan · Answer 1 · 2024-02-14T21:07:09.540

3

The language is context-free. Slightly different: Context free grammar construction $\{ a^mb^n \mid m≤n≤2m \}$.

For strings where the symbols $a,b$ may be in any order, you might find inspiration in the language $\{w∈\{a,b\}^∗:\#_a(w)=2\#_b(w)\}$.

edited Feb 14 '24 at 21:07

answered Feb 14 '24 at 14:27

Hendrik Jan

30,578
1
51
105

I was able to find the grammar for the language in the post you linked (S -> aSb | aSbb | epsilon), but I'm still really struggling with the one in my original post. I simply can't think of a grammar that could produce weird strings like "abbbababba" (which the language in my original post would allow). – Librapulpfiction Feb 14 '24 at 19:42
1

@Librapulpfiction You are right that the language from your question is more complicated than the one I linked in my answer. Note that D.W. in a recent other answer suggests an approach using a pushdown for your language. The trick is comparable to that for the language where all strings are of the form $a^m b^n$. There is a standard construction from pushdown automaton into context-free grammar. – Hendrik Jan Feb 14 '24 at 20:55

score 2 · Answer 2 · answered Feb 14 '24 at 20:37

Suppose that instead of a stack, you had a single counter, $x$, that you could either increment or decrement, under the control of a non-deterministic finite-state machine. (Each transition of the finite-state machine either increments $x$, decrements $x$, or leaves it unchanged.)

Can you think of how to recognize the language with such a machine?

Hint:

When you see a b, decrement $x$. When you see an a, non-deterministically either ____ or _____. You fill in the blanks.

Can you think about how to convert such a machine into a pushdown automaton? (It is known that they can be converted -- these machines are called one-counter machines, which are known to be equivalent to pushdown automata.)

Hint:

Somehow encode the value of $x$ on the stack. What are reasonable options for how you might represent an integer?

You might take a look at Context free grammar construction for inspiration and ideas.

You'll have to fill in the details, as it is your exercise.

score 2 · Answer 3 · answered Feb 25 '24 at 00:03

As the other answers indicate, the language is indeed context-free.

Writing a grammar for it is a bit tricky, but we can do it as follows.

$N_a(x) = N_b(x)$

First, let's consider the (much) simpler language where every string has the same number of a's as b's.

We don't need a very methodical approach to find that it can be written as this grammar (using Q (for "equal") as its start symbol, rather than the traditional S, to avoid confusion when I discuss other languages below):

$Q \rightarrow \epsilon$
$Q \rightarrow QQ$
$Q \rightarrow aQb$
$Q \rightarrow bQa$

. . . but it may help to think through why this works.

Taking our cue from D.W.'s answer, we can imagine iterating over the string while keeping a counter, adding 1 to the counter every time we see an a and subtracting 1 from the counter every time we see a b. At the end, the counter shows $N_a(x) - N_b(x)$, which is 0 if and only if there are the same number of a's as b's, that is, if and only if the string is in the language.

The reason that the above grammar covers all possibilities is that there only are four "paths" the counter can take from 0 back to 0:

It might never leave 0. This happens only if the string is empty.
It might cross 0 at one or more points along the way. In this case, the string can be split at the first point, giving two non-empty substrings that are also members of the language.
It might remain above 0 the whole way. In this case, the string must start with a (bringing the counter to 1) and end with b (bringing it down from 1 to 0), and the substring in between must also have the same number of a's as b's (because the counter ends up at the same value where it started, namely 1).
It might remain below 0 the whole way — this case is symmetric to #3.

(Note that the production rules "overlap" in a way that these cases don't — it would be valid to apply $Q \rightarrow QQ$ and $Q \rightarrow \epsilon$ a whole bunch of times for no reason — but these cases demonstrate why each production rule is necessary, and why they're collectively sufficient, to cover all valid strings. Also note that I'm not demonstrating that these production rules only generate valid strings; I think that's clear enough without further explanation.)

$N_a(x) \le N_b(x) \le 2N_a(x)$

OK, so, let's kick it up a notch, and consider a language that's a bit closer to yours. Here, every string has at least as many b's as a's, and at most twice as many b's as a's. (That's still simpler than your language, which uses strict inequalities instead of weak ones, but obviously it's more complex than the first language.)

We can apply similar reasoning as above, except that now, each a is allowed to balance out either 1 or 2 b's; so when we iterate over the string with a counter, we can add either 1 or 2 when we see an a, and a string is in the language if there's any way that this could result in the counter ending up back at 0. (This is what D.W. mentioned in his/her first hint.)

That puts a bit of a wrench in our case analysis, because it's now possible to cross 0 by "jumping" from −1 to +1, which means that the counter might register 0 at the end of a string of the form b…b even if it never registered 0 at any point in the middle. More generally, it's now possible to "jump" from x to x+2. (But the reverse is not true: the counter can't jump downwards from x to x−2.)

But, we can still take the same approach, just taking that into account. So, these are the cases:

The value of the counter might never leave 0. This happens only if the string is empty.
It might take the value 0 at one or more points along the way. In this case, the string can be split at the first point, giving two non-empty substrings that are also members of the language.
It might never take the value 0, but jump from −1 to +1 at some point along the way. In this case, the string must start with a b that lowers the counter from 0 to −1, then later have an a that raises the counter from −1 to +1, and end with a b that lowers the counter from +1 to 0.
It might remain above 0 the whole way. In this case it must start with an a, and can't end with a or ab unless the whole string is ab.
It might remain below 0 the whole way. In this case it must end with an a, and can't start with a or ba unless the whole string is ba.

which gives this grammar (using W, for "weak inequality", as the start symbol):

$W \rightarrow \epsilon$
$W \rightarrow WW$
$W \rightarrow bWaWb$
$W \rightarrow ab$
$W \rightarrow aWbb$
$W \rightarrow ba$
$W \rightarrow bbWa$

(Note: the rules $W \rightarrow ab$ and $W \rightarrow ba$ could also be written as $W \rightarrow aWb$ and $W \rightarrow bWa$ without changing the resulting language, but I figured it was clearer to hew a bit closer to the case analysis.)

$N_a(x) \lt N_b(x) \lt 2N_a(x)$

OK, finally, the language we're actually interested in.

Observe that we must have $N_a(x) \ge 2$, otherwise there's no valid value for $N_b(x)$. So $N_a(x) \ge 2$ and $N_b(x) \ge 3$, and the condition $N_a(x) < N_b(x) < 2N_a(x)$ can be written as $N_a(x) - 2 \le N_b(x) - 3 \le 2(N_a(x) - 2)$. This means that an element of this language can be thought of as an element of the prior language plus 2 extra a's and 3 extra b's. (Please take a moment to convince yourself of this.)

What's more, with the exception of bbaaabb, any string in this language must have a substring of length 5 that contains exactly 2 a's and 3 b's. This is because:

As we've seen, it must have length at least 5, which means it must have at least one substring of length 5.
If it has a length-5 substring with fewer than 2 a's and a length-5 substring with more than 2 a's, then it must have a length-5 substring with exactly 2 a's, because consecutive same-length substrings overlap in all but one character, so their number of a's can differ by at most one. (For example, aaaaabbbbb contains the substring aabbb on its way from aaaaa to bbbbb.)
If all of its substrings have fewer than 2 a's, then it has least 4 b's between any two a's, which means it has at least two b's per a, which is not allowed.
If all of its substrings have more than 2 a's, then . . . this case is a bit trickier. Let's write its length as 5m + k, where k is in {0,1,2,3,4}. The first 5m characters contain at least 3m a's and at most 2m b's, so they leave a deficit of at least m + 1 b's; so the last k characters need at least m + 1 more b's than a's. Since m is at least 1, and the last k characters can have at most 2 b's, the only solution is m = 1, k = 2 (for a total of 7 characters), with both of those last two characters being b's. Applying the same logic from the right end of the string, we find that the first two characters must also be b's, and the overall string must be bbaaabb.

This means that (1) any string from this language can be formed by inserting either bbaaabb or a permutation of aaabb somewhere in a string from the prior language (W), and (2) conversely, a string formed by doing that will be a string from this language.

Let's label those eleven possible insertions (bbaaabb and the ten permutations of aaabb) with the nonterminal I:

$I \rightarrow bbaaaabb$
$I \rightarrow aabbb$
$I \rightarrow ababb$
$I \rightarrow abbab$
$I \rightarrow abbba$
$I \rightarrow baabb$
$I \rightarrow babab$
$I \rightarrow babba$
$I \rightarrow bbaab$
$I \rightarrow bbaba$
$I \rightarrow bbbaa$

And our production rules for this language's start symbol S are just copies of those for W (which we carry forward), except that they also ensure that an I gets inserted in there somewhere:

$S \rightarrow I$
$S \rightarrow WSW$
$S \rightarrow bSaWb$
$S \rightarrow bWaSb$
$S \rightarrow aSb$
$S \rightarrow aSbb$
$S \rightarrow aWbSb$
$S \rightarrow bSa$
$S \rightarrow bbSa$
$S \rightarrow bSbWa$

(Actually, some of those might be redundant — I'm not sure if $S \rightarrow aWbSb$, for example, enables any strings that would otherwise be impossible — but that's OK, redundant production rules don't make the grammar invalid!)

(I hope it's OK to post this. I waited a week and a half to do so, figuring that that's long enough that it wouldn't be spoiling the OP's opportunity to figure it out on their own.) — ruakh, Feb 25 '24 at 00:04

Proving that L = {x ∈ {a, b}∗ | na(x) < nb(x) < 2na(x)} is not a context free language

3 Answers3

$N_a(x) = N_b(x)$

$N_a(x) \le N_b(x) \le 2N_a(x)$

$N_a(x) \lt N_b(x) \lt 2N_a(x)$