Constructing a NFA from a regular expression

Question

I have the following regular expression $R=ab^*(\epsilon \cup c) \cup c^*a$ and I want to construct the NFA that accepts languages defined by that regular expression. I started by constructing the NFA that accepts $R=ab^*(\epsilon \cup c)$

Next I tried to continue the concatenation part by adding the following to the NFA as follows:

Is that correct? is there a quick approach to construct NFA from long regular expressions?

Note that $\varepsilon$ is not an ordinary letter, it represents the empty word. Commonly $\varepsilon$ is not written along edges (except for "extended" finite automata models). In your case there is a quick solution, just add $S1$ to the accepting states, so we may accept already after reading $ab^*$, the next $c$ is optional. — Hendrik Jan, Feb 25 '21 at 21:18

Watercrystal · Answer 1 · 2021-02-26T09:50:18.813

Your solution is nearly correct, however you need to remove the $c$-transition of $S_0$ and the $a$-transition from $S_0$ to $S_2$ and add a new state, say $S_3$, which is reached from $S_0$ via an $\varepsilon$-transition, has a $c$-transition to itself and an $a$-transition to $S_2$ (so that the word $cab$ is not recognized anymore, see benrg's comment below).

As for a general algorithm not that REs are built in a recursive manner by taking the symbols in your alphabet as "atoms" and then chaining them together using the basic operations $+$ (or $\cup$ as you denote it), $\ast$ and concatenation. A NFA accepting the same language can then be built in a similar manner:

If the RE is just an atom $\alpha$, consider the NFA with one $\alpha$-transition to an accepting state
Suppose our RE is of the form $XY$ where $X$ and $Y$ are REs themselves and let $\mathcal A_X$ and $\mathcal A_Y$ denote the NFAs representing $X$ and $Y$ respectively. A NFA for $XY$ can be constructed by taking the NFAs $\mathcal A_X$ and $\mathcal A_Y$ and adding $\varepsilon$-transitions from all accepting states of $\mathcal A_X$ to the initial state of $A_Y$ and making the states of $\mathcal A_X$ not accepting.
Suppose our RE is of the form $X^\ast$ where $X$ is an RE. Take the associated NFA $\mathcal A_X$ and add $\varepsilon$-edges from its accepting states back to the initial one, which also needs to be made accepting.
Suppose our RE is of the form $X + Y$ where again $X$ and $Y$ are REs. Take $\mathcal A_X$ and $\mathcal A_Y$ and add a new initial state that has $\varepsilon$-transitions to the initial states of $A_X$ and $A_Y$.

Of course, this algorithm does not produce "optimal" NFAs, but it is a simple and general procedure to generate an NFA which accepts the same language as some given RE.

The solution in the question isn't correct: it accepts $cab$ which isn't in the language. In rule 3 you also need to make the initial state an accepting state (or the only one). — benrg, Feb 25 '21 at 22:33

Pseudonym · Answer 2 · 2021-03-01T03:34:09.490

This may not be computationally fast, but the most straightforward algorithm that I know for converting REs to NFAs is to use Brzozowski derivatives. This is so simple that it can be done by hand, and results in NFAs with a number of states that is linear (for the usual RE operators; more on this in a moment) in the number of terminal symbols.

To understand Brzozowski derivatives, we need to think of regular expressions as an algebra. The algebra in question is an idempotent semi-ring, with $0$ meaning the empty set, $1$ meaning the empty string, $+$ meaning set union and $\cdot$ meaning the concatenation product:

$$(A + B) + C = A + (B + C)$$ $$A + B = B + A$$ $$0 + A = A + 0 = A$$ $$0 \cdot A = A \cdot 0 = 0$$ $$1 \cdot A = A \cdot 1 = A$$ $$A \cdot (B + C) = (A \cdot B) + (A \cdot C)$$ $$(A + B) \cdot C = (A \cdot C) + (B \cdot C)$$ $$A + A = A$$

That last axiom is that addition is idempotent. And note that I'm going to drop the explicit $\cdot$ when it's clear or if I forget to add it.

To this, we add the Kleene closure operator $A^*$, which I won't axiomatise here; see the Wikipedia page on Kleene algebra for details.

We also have terminal symbols drawn from an alphabet $\Sigma$. This is going to sound weird, but we're going to interpret these terminal symbols as variables. So, for example, we can define the "evaluation at zero" operator $A(0)$ as the regular expression $A$ with all of the terminal symbols replaced with $0$.

$$(ab^* (1 + c) + c^* a)(0) = 0\cdot 0^* (1 + 0) + 0^*\cdot 0 = 0$$

For any regular expression $A$, $A(0)$ simplifies to either $0$ or $1$.

And now, we can define a derivative operator:

$$\begin{eqnarray*} \frac{\partial}{\partial x} 0 & = & 0 \\ \frac{\partial}{\partial x} 1 & = & 0 \\ \frac{\partial}{\partial x} x & = & 1 \\ \frac{\partial}{\partial x} y & = & 0, \hbox{if } x \ne y \\ \frac{\partial}{\partial x} (A + B) & = & \frac{\partial}{\partial x} A + \frac{\partial}{\partial x} B \\ \frac{\partial}{\partial x} (A \cdot B) & = & A(0) \cdot \frac{\partial}{\partial x} B + \frac{\partial}{\partial x} A \cdot B \\ \frac{\partial}{\partial x} A^* & = & \frac{\partial}{\partial x} A \cdot A^*\end{eqnarray*}$$

Apart from one tweak to the product rule to deal with the fact that concatenation is not commutative, this looks exactly like a partial derivative operator. Note that the Kleene closure operator acts like exponentiation, where $e^0 = 1$ and $\frac{\partial e^{A}}{\partial x} = \frac{\partial A}{\partial x} e^A$.

The Brzozowski derivative has the following interpretation:

$$\frac{\partial E}{\partial a} = \left\{\,w\,|\,aw \in E\,\right\}$$

That is, it is the set of all strings in $E$ which start with $a$, with that $a$ removed.

This gives us the following identity:

$$E = E(0) + \sum_{x \in \Sigma} x \frac{\partial E}{\partial x}$$

Remembering that terminal symbols are analogous to variables, this is just the Taylor expansion of the regular expression around $0$. But this identity is also an algorithm for the construction of a DFA or NFA state, since $E(0)$ is $1$ if $E$ is nullable, otherwise $0$, and the sum is the transitions.

You can do this using the definition of the derivative above, or use the following transformation $\mathbf{C}[]$:

$$\begin{eqnarray*}\mathbf{C}[0] & = & 0 \\ \mathbf{C}[1] & = & 1 \\ \mathbf{C}[a] & = & a\cdot 1 \\ \mathbf{C}[A + B] & = & \mathbf{C}[A] + \mathbf{C}[B] \\ \mathbf{C}[A^*] & = & 1 + \mathbf{C}[A\cdot A^*] \\ \mathbf{C}[0 \cdot B] & = & 0 \\ \mathbf{C}[1 \cdot B] & = & \mathbf{C}[B] \\ \mathbf{C}[a \cdot B] & = & a \cdot B \\ \mathbf{C}[(A + B) \cdot C] & = & \mathbf{C}[A \cdot C] + \mathbf{C} [B \cdot C] \\ \mathbf{C}[A^* \cdot B] & = & \mathbf{C}[B] + \mathbf{C}[A\cdot(A^* B)] \\ \mathbf{C}[(A \cdot B) \cdot C] & = & \mathbf{C}[A\cdot (B \cdot C)]\end{eqnarray*}$$

Note that $E$ and $\mathbf{C}[E]$ is the same language. Then we can find a NFA by applying this transformation to a regular expression, then recursively applying it to the transition states.

In your case:

$$\begin{eqnarray*}q_0 & = & ab^* (1 + c) + c^* a \\ & = & \mathbf{C}[ab^* (1 + c)] + C[c^* a] \\ & = & a \cdot (b^* (1 + c)) + a \cdot 1 + c\cdot (c^* a) \\ & = & a\cdot q_1 + a \cdot q_2 + c \cdot q_3\end{eqnarray*}$$

where:

$$\begin{eqnarray*}q_1 & = & b^* (1 + c) \\ q_2 & = & 1 \\ q_3 & = & c^* a\end{eqnarray*}$$

If a $1$ appears in the sum. then the state is final, so $q_2$, for example, is a final state with no outgoing transitions.

You only need to invent new states for regular expressions that don't already have names. So, for example:

$$\begin{eqnarray*}q_3 & = & c^* a \\ & = & \mathbf{C}[c^* a] \\ & = & a \cdot 1 + c \cdot(c^* a) \\ & = & a \cdot q_2 + c\cdot q_3\end{eqnarray*}$$

Note that this method never produces "epsilon transitions". Converting this into an algorithm to transform a regular expression to a DFA is straightforward, but I'll let you discover that for yourself.

One advantage of this algorithm is that it can be extended with other operators, such as set intersecction:

$$\frac{\partial}{\partial x} (A \cap B) = \frac{\partial}{\partial x} A \cap \frac{\partial}{\partial x}B$$

or set difference:

$$\frac{\partial}{\partial x} (A - B) = \frac{\partial}{\partial x} A - \frac{\partial}{\partial x}B$$

However, introducing these "negation" operators does not preserve the property that the final NFA is linear in the size of the original regular expression.

EDIT

You may have noticed that the $\mathbf{C}[]$ transformation can infinitely loop on pathological input involving $0$, $1$, and the Kleene star; try $\mathbf{C}[1^*\cdot a]$ as an example. The simple fix is to simplify the regular expression first to remove as many literal mentions of $0$ and $1$ as possible, and second to remove redundant Kleene stars:

$$\begin{eqnarray*} 0^* & \mapsto & 1 \\ 1^* & \mapsto & 1 \\ 0 + E & \mapsto & E \\ (1 + E)^* & \mapsto & E^* \\ {E^{*}}^{*} & \mapsto & E^{*}\end{eqnarray*}$$

I can't remember full set of simplifications, but they are fairly straightforward.

Constructing a NFA from a regular expression

2 Answers2

Linked