Is the language of pairs of words of equal length whose hamming distance is 2 or greater context-free?

Question

Is the following language context free? $$L = \{ uxvy \mid u,v,x,y \in \{ 0,1 \}^+, |u| = |v|, u \neq v, |x| = |y|, x \neq y\} $$

As pointed out by sdcvvc, a word in this language can also be described as the concatenation of two words of the same length the hamming distance of which is 2 or greater.

I think it's not context free but I'm having a hard time proving it. I tried intersecting this language with a regular language (like $ \ 0^*1^*0^*1^* $ for example) then use the pumping lemma and \ or homomorphisms but I always get a language that is too complicated to characterize and write down.

Yes, but I've failed to pump this string out of the language (it doesn't mean that it's not possible, just that I've failed to do so). — Robert777, Apr 27 '13 at 15:45
@PålGD, you'd probably need a way to "mark" the pieces, like $1^u 0 1^x 0 1^u 0 1^x 0$ — vonbrand, Apr 27 '13 at 21:44
In this question Yuval Filmus point to this paper: A strong pumping lemma for context-free languages. I didn't went through it yet but it seem to prove non context free language that can still be pumped (which is the case here I believe). You can read his answer to a quick view of the paper. — wece, May 16 '13 at 23:04
It looks very much like what linguists call cross-serial dependency, which is known not to be context-free. — babou, Jun 10 '13 at 00:18
This language can be written as ${uv:|u|=|v|,d(u,v) \geq 2}$ where $d$ is the Hamming distance. Note that if we replace 2 by 1, it is contextfree (http://cs.stackexchange.com/questions/307/) but the trick used there will not work. Personally I'm betting it is not contextfree. — sdcvvc, Jun 10 '13 at 22:19
@dscvvc: your language includes $uvuw$ where $d(v,w)\ge 2$, whereas this is not in the original language. — András Salamon, Jun 12 '13 at 20:19
@András Salamon: I believe my characterization is correct: we do not require that every split of the word into $u x v y$ satisfies $u \neq v, x \neq y$, merely that there exists one. If you disagree, please give me a word from ${u v:|u|=|v|,d(u,v)\geq 2}$ and I'll try to find $u,x,v,y$. — sdcvvc, Jun 12 '13 at 20:55
@Andr I believe @ dscvvc is correct. It gives a clearer view of the problem ... but so far without a result. — babou, Jun 12 '13 at 21:19
@sdcvvc: You are right, one partitions the $u$ into $u'x$ so that one of the differing bits is in $u'$ and the other in $x$. I stand corrected. — András Salamon, Jun 12 '13 at 21:20
Based on @sdcvvc's characterization, it would seem to be enough to prove that ${\Sigma^i 0 \Sigma^j 0 \Sigma^i \Sigma^k 1 \Sigma^j 1 \Sigma^k \mid i,j,k \ge 0}$ is (or is not) context-free. — András Salamon, Jun 18 '13 at 17:02
@AndrásSalamon I have the beginning of a proof with the interchange lemma. I do not know whether it can be pulled off. I only tried so far one case that fell through but it possible that I was just in a case where the interchange lemma mimics too closely the structure of the language. I am trying to change my parameters, but the combinatorics is beyond me, at least within reasonable time. Should I put it up as an incomplete answer. — babou, Jun 19 '13 at 10:05
@sdcvvc Sorry, I did not realize you are the same as on my question at http://cstheory.stackexchange.com/questions/18057 asked in connection with this problem. — babou, Jun 19 '13 at 10:11
@AndrásSalamon How do you get that equivalence ? Your language misses all the strings where the differing symbols alternate. But it is an interesting construction as you are using the same trick that allows showing the language is CF when the Hamming distance is $\geq 1$, plus a rotation of the string. Why should it be equivalent to do the proof on this language ? — babou, Jun 19 '13 at 11:40
@babou: It is not an equivalence, just a subproblem. If it is context-free then so are all the others of the same form, hence so is their union. On the other hand, if it is not context-free, then the technique may be adaptable to the general case. — András Salamon, Jun 19 '13 at 13:05

Vor · Answer 1 · 2019-07-30T06:58:38.833

Note [2019-07-30] The proof is wrong ... the question is more complicated than it sounds.

After a failed attempt here it is another idea.

If we intersect $L$ with the regular language $L_{reg} = 0^*10^*10^*10^*$ we get a CF language.

Perhaps we can have more luck if we use $L_{reg}' = 0^*10^*10^*10^*10^*$ (a string with exactly 4 1s).

Let $L_1 = L \cap L_{reg}'$, informally $w \in L_1$ if it can be split in two halves, such that one half contains exactly $\{0,1,3,4\}$ $1s$ or both halves contain two $1$s but their positions don't match.

Suppose that $L_1$ is CF and let $G$ be its grammar in Chomsky normal form, and let

$$w = uv = 0^a 1 0^b 1 0^c 1 0^d 1 0^e \in L_1$$

We have $|u|=|v|$ (even length) and $d(u,v) \geq 2$

If we restrict our attention to the ways in which the four 1s of $w$ can be generated we have the three cases shown at the top of the figure 1. The central part of the figure 1 shows the first case (but the others are similar).

enter image description here
Figure 1 (the full picture can be downloaded here)

If we pick $a=e, c=2a$ and $b,d \gg a$ we see that the zeros between the two pairs of 1s must be independently pumpable (red nodes in the figure): in particular, for large enough $b \gg a$, we get a duplicate nonterminal node on a internal subtree (node X in figure 2) or a repeated subsequence in the path towards the first or the second 1 (node Y in figure 2). Note that Figure 2 is a little bit simplified: there can be more nonterminal nodes between the two $X$s, and also between the two $Ys$ ($Y\to ... \to Z_i \to ... Y$ but with $Z_i$ that produces only 0s on the right of the first 1).

enter image description here
Figure 2

So we can fix an arbitrary $a = e = k, c = 2a$, then pick large enough $b$ to get an independently pumpable node on the sequence of zeros between first and second $1$. For the sequence of zeros between the third and fourth 1, we can choose $d = b! +b$.
But $0^b$ is independtly pumpable so there is a $p \leq b$ pumpable substring $y$, i.e. such that $b = xyz, |y|=p, |x|\geq 0, |z|\geq 0$ and $xy^iz = b!+b$. The string we get is:

$$w' = 0^k 1 0^{b!+b} 1 0^{2k} 1 0^{b!+b} 1 0^k$$

but $w' \notin L_1$. Thus $L_1$ is not CF and finally $L$ is not CF.

If the proof is correct (???) it can be extended to every language $L_k = \{ uv : |u|=|v|, d(u,v)\geq k\}, k\geq 2$

I'm afraid the bounty will expire before we can actually verify this proof, so unless any drastic information arises in the next 4 hours, this gets the points for being the best attempt so far. — Joey Eremondi, Jun 20 '13 at 16:09
@jmite: don't worry there are high chances that it is a wrong attempt like the previous one (which lasted for about 30 mins before discovering a trivial error) :-) :-) — Vor, Jun 20 '13 at 16:17
Why the case distinction? The branches in the grammar have no relation with the halves of the word. But I think it does not matter; if the proof works, this case distinction is not needed. Looking at an assumed grammar and using the proof of the Pumping lemma instead of the lemma itself is a nice trick (one should do this more often). I have one (real) concern: if you pump a substring of $0^b$, you get $0^{b+p(i-1)}$; I don't see how you get to $b+b!$. Don't think that should harm the proof, but better check. Also, you might want to straighten out some notation (and typos). — Raphael, Jun 21 '13 at 06:13
@Raphael: thanks for the comments. Perhaps I'm wrong, but if you pick as target length $b+b!$ then for every pumping length $p$, the string $0^b$ can be decomposed in $0^{xyz}, (|xyz|=b, |y|=p \leq b)$ and can be pumped to $xy^iz = b + b!$, indeed in your example p surely divides $b!$, so there is a $(i-1)$ for which $p(i-1)=b!$, but the original string length is $b$, so the total pumped length is $|xy^{(i-1)}z| = b+b!$. I remember it from a couple of exercises that use the Ogden's lemma ... now I'll double check them. — Vor, Jun 21 '13 at 06:40
@Raphael: ... I didn't find the proof anywhere but only a paper by Zach Tomaszewski that proves that the complement of $L_{dup} = { ww }$ is CF (see question ), so perhaps it is a new result (though simple); and a pumping-lemma-style theorem can be derived for languages with strings that contain a finite number of a particular symbol and substrings of arbitary length between them. — Vor, Jun 21 '13 at 06:53
Ah, you are right about $b+b!$. It is indeed a "standard" trick I had forgotten about. (We have a question here about $\overline{L_{dup}}$; very nice one indeed.) — Raphael, Jun 21 '13 at 07:06

babou · Answer 2 · 2013-06-18T00:27:45.087

After 2 failed attempts, that were disproved by @Hendrik Jan (thank you), here is another one, that is not more successful. @Vor found an example of a deterministic CF language where the same construction would apply, if correct. This allowed identifying an error in the anchoring of the $y$ string in the application of the lemma. The lemma itself does not seem at fault. This is clearly too simplistic a construction. See more details in the comments.

The language $L = \{ uxvy \mid u,v,x,y \in \{ 0,1 \}^*\text \{ \epsilon \} \ ,\ \mid u \mid = \mid v \mid \ , \ u \not= v \ , \ \mid x \mid = \mid y \mid \ , \ x \not= y \ \} $ is not Context-Free.

It is helpful to keep in mind the characterization $L= \{uv:|u|=|v|,d(u,v) \geq 2\}$ where d is the Hamming distance, proposed by @sdcvvc. What one needs to think about are 2 selected positions in each half string such that the corresponding symbols differ.

Then you consider a string $10^i10^j$ such that $i \lt j$ and $i+j$ is even. It is clearly in the language L, by cutting $u$ and $x$ anywhere between the two 1's. We want to pump that string on the first part between the 1's, so that it will become $10^j10^j$ which is not supposed to be in the language.

We first try to use Ogden's lemma, which is like the pumping lemma, but applies to $p$ or more distinguished symbols that are marked on the string, $p$ being the pumping length for marked symbols (but the lemma can pump more because it can pump also unmarked symbols). The pumping marked-length $p$ depends only on the language. This attempt will fail, but the failure will be a hint.

We can then choose $i=p$ and we mark symbols on the first sequence of $i$ 0's. We know that none of the two 1's will be in the pump, because it can pump out once (exponent 0) instead of pumping in. And pumping out the 1's would get us out of the language.

However, we could be pumping on both sides of the second 1 as fast or even faster on the right side, so that the second 1 would never get across the middle of the string. Also Ogden's lemma does not fix an upper limit to the size of what is being pumped, so that it is not possible to organize the pumping to get the rightmost 1 exactly across the middle of the string.

We use a modified version of the lemma, here called Nash's Lemma, which can handle these difficulties.

We first need a definition (it probably has another name in the literature, but I do not know which - help is welcome). A string $u$ is said to be an erasure of a string $v$ iff it is obtained from $v$ by erasing symbols in $v$. We will note $u \prec v$.

Nash's Lemma : If $L$ is a context-free language, then there exists two numbers $p\gt0$ and $q\gt 0$ such that for any string $w$ of length at least $p$ in $L$, and every way of “marking” $p$ or more of the positions in $w$, $w$ can be written as $w=uxyzv$ with string $u$, $x$, $y$, $z$, $v$, such that

$xz$ has at least one marked position,
$xyz$ has at most $p$ marked positions, and
there are 3 strings $\hat x$, $\hat y$, $\hat z$ such that
1. $\hat x \prec x$, $\hat y \prec y$, $\hat z \prec z$,
2. $1 \leq \mid \hat x \hat z \mid \leq q$, $1 \leq \mid \hat y \mid \leq q$, and
3. $ux^j\hat x^i\hat y\hat z^iz^jv$ is in $L$ for every $i \geq 0$ and for every $j \geq 0$.

Proof: Similar to the proof of Ogden's lemma, but the subtrees corresponding to the strings $y$ and $xz$ are pruned so that they do not contain any path with twice the same non-terminal (except for the roots of these two subtrees). This necessarily limits the size of the generated strings $\hat x\hat z$ and $\hat y$ by a constant $q$. The strings $x^j$ and $z^j$, for $ j \geq 0$, corresponding to an unpruned version of the tree, are used mainly with $j=1$ to simplify the accounting when the lemma is applied.

We modify the above proof attempt by marking the $p$ leftmost symbols 0, but they are followed by $2q$ symbols 0 to make sure that we pump in the left part of the string, between the two 1's. That make a total of $i = p + 2q$ 0's between the 1's (actually $i = p + q$ would be sufficient, since the rightmost 1 cannot be in $\hat z$, which would allow to simply remove it).

What is left is to have chosen $j$ so that we can pump exactly the right number of 0's so that the two sequences are equal. But so far, the only constraint on $j$ is to be greater than $i$. And we also know that the number of 0's that are pumped at each pumping is between 1 and q. So let $h$ be product of the first $q$ integers. We choose $j=i+h$.

Hence, since the pumping increment $d$ - whatever it is - is in $[1,q]$, it divides $h$. Let $k$ be the quotient. If we pump exactly $k$ times, we get a string $10^j10^j$ which is not in the language. Hence L is not context-free.

.

I think that I shall never see
A string lovely as a tree.
For if it does not have a parse,
The string is naught but a farce

Note however that the pass over the second half reads the stack in reverse. That seems to mean that the two positions are in the same position in both halves, but in reverse? — Hendrik Jan, Jun 11 '13 at 13:36
you are correct ... I goofed ... now I know what was nagging me at the back of my head. — babou, Jun 11 '13 at 13:45
I recognized the argument (because I could not make it work when I tried myself). — Hendrik Jan, Jun 11 '13 at 14:00
Should I leave this wrong answer ? It is somehow helping, I think, as it make the problem suspiciously similar to ${a^ib^jc^ka^ib^jc^k}$. The problem is that rules of the site are not intended to encourage wrong results for discussion ( I mean I do not enjoy downvotes more than anyone else). — babou, Jun 11 '13 at 14:04
@HendrikJan Did I goof again ? (BTW, thanks for making it a discussion) — babou, Jun 11 '13 at 17:58
While $L$ might be the image of $M$, $M$ is not the inverse image of $L$, and I doubt this can easily be fixed. Also I would remove the clearly wrong part, the mistake is elementary and it's doubtful it will help. — sdcvvc, Jun 11 '13 at 18:06
I guess i'm getting old ... and I am doing administrative work at the same time. — babou, Jun 11 '13 at 19:48
@babou: I'm trying to understand the proof; $L' = { w \mid w = uv = 10^i10^j, |u|=|v|, d(u,v)\geq 2 }$ is context free (guess the middle, check that the two halves have the same length, accept if w begins with a 1, contains exactly another 1 and the second half begins with a zero). Why can't your proof be applied to $L'$? — Vor, Jun 17 '13 at 10:05
@Vor Actually, your Language $L'$ is even deterministic. And I do not see off hand why the proof should not apply. Indeed, I use only strings in $L'$. So your example is very appropriate. But off-hand I do not see where I erred. My first bet is that I lost something in the achoring of the substrings on marked symbols because of the erasure. But I am really not sure. The other point that bothered me is that such an extension to Ogden's lemma should be known if correct. But I found no trace. I have no one to exchange ideas with right now, and my books are in boxes. It does not help. — babou, Jun 17 '13 at 12:14
@Vor Thinking about it, I am pretty sure it is the anchoring that failed. There is a simple grammar for your language that will allow the $y$ only on the second $1$. Hence the anchoring must be wrong. The problem is that I get only one marked symbol in $xz$. It can be in $x$ but I have no bound on the length of $x$. I think a version of the lemma also has a marked symbol in $y$, but again without a bound on the size of $y$. Hence, I am pumping on the wrong side. If this explains it, it would leave the lemma extension OK. But it must exist somewhere. — babou, Jun 17 '13 at 12:33
@babou: if you want take a look at my answer (mutual proof shoot-down :-) — Vor, Jun 20 '13 at 13:29

score -1 · Answer 3 · edited Apr 13 '17 at 12:48

-1

by this question I think $L$ is context-free and generated by the following grammar $\qquad\begin{align} S &\to AXBY \mid BYAX \\ A &\to 0 \mid 0A0 \mid 0A1 \mid 1A0 \mid 1A1 \\ B &\to 1 \mid 0B0 \mid 0B1 \mid 1B0 \mid 1B1 \\ X &\to 0 \mid 0X0 \mid 0X1 \mid 1X0 \mid 1X1 \\ Y &\to 1 \mid 0Y0 \mid 0Y1 \mid 1Y0 \mid 1Y1 \\ \end{align}$

edited Apr 13 '17 at 12:48

Community

1

answered Jun 15 '13 at 20:11

M.K. Dadsetani

354
3
8

4

This is incorrect; you cannot guard that length of AX is the same as BY. For example, your grammar generates S -> AXBY -> A011 -> 0A1011 -> 001011 which is not in the original language. Also, your symbols A and X generate the same language, same for B and Y; they can be merged. – sdcvvc Jun 15 '13 at 20:41

Is the language of pairs of words of equal length whose hamming distance is 2 or greater context-free?

3 Answers3

Linked