2

This is my first post here. We are currently studying regular expressions, and I have been tasked to write a regular expression for the language of all words which do not contain the substring $aba$, for the alphabet $\Sigma=\{a,b\}$.

We were firstly tasked to write a regular expression for all words which do contain the substring $aba$, and I came up with:

$$(a+b)^*aba(a+b)^*$$

However, I don't know how to write the second one because I can't think of a way to formalize something which cannot be included in the regex.

Yuval Filmus
  • 276,994
  • 27
  • 311
  • 503
john doe
  • 177
  • 1
  • 9

3 Answers3

2

A word doesn't contain $aba$ if after every $ab$, the word either terminates or contains $b$. Imagine that you start reading your word from left to right. Denoting by $\newcommand{\eos}{\#}\eos$ the "end of string" symbol, one of the following must be a prefix of your string: $$ \eos \\ a\eos,aa\eos,aaa\eos,\ldots \\ ab\eos,aab\eos,aaab\eos,\ldots \\ abb,aabb,aaabb,\ldots \\ b $$ Furthermore, each of these prefixes $p$ not ending with $\eos$ satisfies the following: a word $w$ doesn't contain $aba$ iff $pw$ doesn't contain $aba$. This leads to the following unambiguous regular expression: $$ (a^+bb + b)^*(\epsilon + a^+ + a^+b) $$ You can simplify it further if you're fine with ambiguous regular expressions; I leave such simplifications for you to ponder, if you are so inclined.

Yuval Filmus
  • 276,994
  • 27
  • 311
  • 503
1

Think of all the possible combinations you can make which are not aba:

  • Whenever we get "ab" we must either end the string or add a "b" by force: a+bb

  • If we are starting from b then we can append as many a's as we want at the end: b+a*bb

  • Joining both together: ( a+bb + b+a*bb )* a*b*

  • The a* at the end is for the edge case where we have all a's or when we have ab.

1

Such a word contains atleast 2 consecutive $b$'s whenever a $b$ occurs in the middle of the word, or the word ends with a single $b$. We thus replace the language of all words made of some number of $a$s or $b$s, represented by $(a^\ast + b^\ast)^\ast$, with the language of words made of $a$s or atleast two $b$'s, which is $$(a^\ast+ bb b^\ast)^\ast$$ However, we can optionally have a single $b$ at the end or the beginning, so we add that as $$ (\varepsilon+b) \cdot (a^\ast+ bb b^\ast)^\ast \cdot (\varepsilon+b)$$