Is it decidable if a language described by number of occurences is regular?

Question

It is known that the language of words containing equal number of 0 and 1 is not regular, while the language of words containing equal number of 001 and 100 is regular (see here).

Given two words $w_1,w_2$, is it decidable if the language of words containing equal number of $w_1$ and $w_2$ is regular?

Can you give other examples of regular languages so defined, other than $1^i0$ and $01^i$, or $0^i1$ and $10^i$ ? What about an example on a 3 symbols alphabet? — babou, Jun 19 '13 at 13:54
If $w_1$ is a strict subword of $w_2$, there is a big chance the language is empty, therefore regular. I don't know other examples. — sdcvvc, Jun 19 '13 at 14:39
I baguely suspect that the above examples are the only ones, which would make the problem decidable.If you specify only two substrings, I would guess it is CF ... depending on what you can specify regarding occurences. You do not make precise enough what you mean by "described by number of occurences". — babou, Jun 19 '13 at 15:32
Sorry, you are right. Then that makes it a CF language. But I am not sure it buys anything. But then we may have another example of non-cf language if you request equal number for 3 words. — babou, Jun 19 '13 at 15:45
the solutions so far for special cases seem to hinge on the idea that occurrences of substrings of $w_1$ guarantee only single occurrences of intervening $w_2$. so somehow assuming current answers are correct [it is not clear to me yet] it seems there is some relation between $w_1$, $w_2$ that guarantees in the middle of scanning the string that one can be in either states "equal" or "unequal", but only off by a max finite number for the "unequal" case. — vzn, Jun 19 '13 at 22:16
@vzn Yes, that is the general idea. I added a more formal definition for that relation. Now, completely formal proofs would require a lot of machinery, It would probably be at least as complex as writing the program that actually produces the finite automaton. I suspect it requires some strange lemmas such as: "there is no string that can only occur at least twice". Possibly writing a complete proof and the program that builds the automaton (FA or PDA) could be an interesting project in developing formal machinery for some students, but I do not think I would want to do it. — babou, Jun 20 '13 at 10:21
@babou I do not ask for completely formal proofs that define transition functions, automata states etc. I ask for a rigorous proof, one which can be understood by someone with mathematical maturity. Perhaps you need weird lemmas, I don't know. But I will not accept handwaving. — sdcvvc, Jun 20 '13 at 14:13
@sd I did enough to convince myself. I also know I make mistakes and I never tried to hide it. So I am quite willing to be shown wrong with a counter example. However, though interesting, this stuff is technically complex without, i.m.o., being conceptually very hard. It is already over two pages and you are asking for a lot more work. Now, unless someone convinces me that this kind of result is worth publishing in a reasonable peer-reviewed professional forum, I do not see why I should go through that punishment. As I said, given the now existing outline, I consider it more a student project. — babou, Jun 20 '13 at 16:17
@sdcwc I must apologize. The last posted version of this document, which I tried to do at your request, is not as good as the previous one. I did rewrite it all with better presentation. However, due to a disagreement to which you are not party, I will not post it. I you wish to receive a copy, just let me know and I will find a way. But I really think that, whatever the quality of the presentation, the essential information is here. — babou, Jun 22 '13 at 00:33

score 5 · Accepted Answer · edited May 30 '22 at 10:56

I and several colleagues answered this question here with a necessary and sufficient criterion for when the language $L_{x=y}$ (all words having an equal number of occurrences of $x,y$) is regular. We also show the same for fewer $x$ than $y$, and more $x$ than $y$.

C.J. Colbourn, R.E. Dougherty, T.F. Lidbetter and J. Shallit. Counting Subwords and Regular Languages. Developments in Language Theory. DLT 2018. LNCS 11088. Springer. doi:10.1007/978-3-319-98654-8_19

score 4 · Answer 2 · edited Apr 13 '17 at 12:48

Given two words $w_1$,$w_2$, is it decidable if the language $L$ of words containing equal number of $w_1$ and $w_2$ is regular?

First some definitions:
They could be made more concise, and the notations could be improved if they are to be used in proofs. This is only a first draft.

Given two words $w_1$ and $w_2$, we say that:

$w_1$ always occurs with $w_2$, noted $w_1\triangleleft w_2$, iff
1. for any string $s$ such that $s=xw_2y$ with $\mid x\mid,\, \mid y\mid\ \geq \mid w_1\mid +\mid w_2\mid$ and $|x|_0,|x|_1|,|y|_0,|y|_1| \geq 1$ there is another decomposition $s=x'w_1y'$.
  Note: The condition that $x$ and $y$ each contain at least a 0 and a 1 is required by a pathological case (found by @sdcvvc): $w_1=1^i0$, $w_2=v1^{i+j}$ and $y\in1^*$, and its symetrical variants.
2. there is a string $s=xw_2y$ with $\mid x\mid,\, \mid y\mid\ \geq \mid w_1\mid +\mid w_2\mid$ such that there is at most one decomposition $s=x'w_1y'$
$w_1$ always cooccurs with $w_2$, noted $w_1\triangleleft \triangleright\,w_2$, iff each always occur with the other,
$w_1$ and $w_2$ occur independently, noted $w_1\triangleright \triangleleft\,w_2$, iff neither one always occur with the other,
$w_1$ always occurs $m$ times or more than $w_2$, noted $w_1\triangleleft_m w_2$, iff for any string $s$ such that $s=xw_2y$ with $\mid x\mid,\ \mid y\mid|\ \geq \mid w_1\mid +\mid w_2\mid$ there are $m$ other decompositions $s=x_iw_1y_i$ for $i\in[1,m]$ such that $i\neq j$ implies $x_i\neq x_j$.

These definitions are constructed so that we can ignore what happens at the ends of the string where $w_1$ and $w_2$ are supposed to occur. Boundary effects at the end of the string have to be analyzed separately, but they represent a finite number of cases (actually I think I forgot one or two such boundary sub-cases in my first analysis below, but it does not really matter). The definitions are compatible with overlap of occurrences.

There are 4 main cases to consider (ignoring the symetry between $w_1$ and $w_2$):

$w_1\triangleleft \triangleright\,w_2$
Both words come necessarily together, except possibly at the ends of the string. This concerns only pairs of the form $1^i0$ and $01^i$, or $0^i1$ and $10^i$. This is easily recognized by a finite automaton that only checks for lone occurences at both ends of the string to be recognized, to make sure there is a lone occurrence at both ends or at neither end. There is also the degenerate case when $w_1=w_2$: then the language L is obviously regular.
$w_1\triangleleft w_2$, but not $w_2\triangleleft w_1$
One of the 2 words cannot occur without the other, but the converse is not true (except possibly at the ends of the string). This happens when:
- $w_1$ is a substring of $w_2$:then a finite automaton can just check that $w_1$ does not occur outside an instance of $w_2$.
- $w_1=1^i0$ and $w_2=v1^j$ for some word $v\in\{0,1\}^*$, $v\neq01^i$: then a finite automaton check as in the previous case that $w_1$ does not occur separated from $w_2$. However, the automaton allows counting one extra instance of $w_1$ that will allow acceptance if $w_2$ is a suffix of the string. There are three other symetrical cases (1-0 symmetry and left-right symetry).
$w_1\triangleleft_2 w_2$
One of the 2 words occurs twice in the other. That can be recognized by an a finite automation that checks that the smaller word never occurs in the string. The is also a slightly more complex variant that combines the two variations of case 2. In this case the automaton checks that the smaller string $1^i0$ never occurs, except possibly as part of $v$ in the larger one $v1^j$ coming as a suffix of the string (and 3 other cases by symetry).
$w_1\triangleright \triangleleft\,w_2$
The 2 words can occur independently of each other. We build a generalized-sequential-machine (gsm) $G$ that output $a$ when it recognizes an occurrence of $w_1$ and $b$ when recognizing an occurrence of $w_2$, and forgets everything else. The language $L$ is regular only if the language $G(L)$ is regular. But $G(L)=\{w\in\{a,b\}^*\mid\ \mid w\mid_a=\mid w\mid_b\}$ which is clearly context-free and not regular. Hence $L$ is not regular.
Actually we have $L=G^{-1}(G(L))$. Since regular languages and context-free languages are closed under gsm mapping and inverse gsm mapping, we know also that $L$ is context free.

One way to organize a formal proof could be the following. First build a PDA that recognizes the language. Actually it can be done with a 1-counter machine, but it is easier to have two stack symbols to avoid duplicating the finite control. Then, for the cases where it should be a FA, show that the counter can be bounded by a constant that depends only on the two words. For the other cases show that the counter can reach any arbitrary value. Of course, the PDA should be organized so that the proofs are easy enough to carry.

Representing the FA as a 2-stack-symbols PDA is probably the simplest representation for it. In the non-regular case, the finite control part of the PDA is the same as that of the GSM in the proof sketch above. Instead of outputting $a$'s and $b$'s like the GSM, the PDA counts the difference in number with the stack.

I asked you about precise definitions of the terms mentioned in the comment. Thank you for writing them. Was I supposed to guess them previously? Anyway, you seem to claim that $0^i 1 \triangleleft \triangleright 1 0^i$. This does not satisfy condition 1. of the definition of "$w_1$ always occurs with $w_2$", since there is no occurrence of $1 0^i$ in $s=0^M 0^i 1 1^M$. — sdcvvc, Jun 20 '13 at 14:10
Sorry, I did not mean to make you guess. It only took me time to understand what exactly you wanted. My failing only. Regarding your counter example, you are correct. But for me it only means that I have to be a little bit more careful about telomeres, in the definition of the relations. I defined them too quickly, but $0^M$ or $1^M$ do not convey much information in this context. This is really a boundary pathological example within a pathological case, that actually cannot occur when more than 2 symbols are used. I just do not believe it changes anything. — babou, Jun 20 '13 at 15:08
I am not asserting I have no bugs, only that they are not significant. And I do not believe that going into many formal details will prevent bugs. I was told that one of the fundamental results of lambda calculus, analyzed by a lot of very good people, took some ten years (more maybe) and several publications, to finally have a correct proof (by Barendregt, I think). You do not do much more formal than lambda calculus. I know only two ways to be almost sure. One is to have many readers (which I obviously cannot do alone). The other is to run the stuff thru Coq or a similar system. — babou, Jun 20 '13 at 15:09

Is it decidable if a language described by number of occurences is regular?

2 Answers2