Left-Factoring a grammar into LL(1)

Question

I have a homework assignment where I need to convert a grammar into LL(1). I've already removed the left recursion, but I'm having trouble doing left-factoring. All of the examples I've found are simple, and look something like this:

A -> aX | aY
becomes:
A -> aZ
Z -> X | Y

I understand that. However, my grammar looks more like this:

X -> aE | IXE | (X)E
E -> IE | BXE | ϵ
I -> ++ | --
B -> + | - | ϵ

I'm not sure how to apply the simpler example to this. I've been trying for at least a couple of hours and I've lost track of all of the things I've tried. Generally, my attempts have looked something like this:

X  -> X' | IXE
X' -> aE | (X)E
E  -> IE | BIX'E | BX'E | ϵ

And I then try to convert the E rules into ones having only one production starting with + or -:

X  -> X' | IXE
X' -> aE | (X)E
B' -> + | -
E  -> IE | B'IX'E | IX'E | B'X'E | X'E | ϵ

And then...

X  -> X' | IXE
X' -> aE | (X)E
B' -> + | -
E  -> +P | -M | ϵ
P  -> +E | IX'E | +X'E | X'E
M  -> -E | IX'E | -X'E | X'E

And so on. But I continually end up with a lot of extra nonterminals, and some very long productions / chains of productions, without actually having left-factored it. I'm not sure how to approach this - I can't seem to eliminate some nonterminal having multiple productions starting with a + and with a -.

Welcome! 1) Have you looked at the formal definition of left-factoring? 2) Are you certain your language can be described by an LL(1) grammar? Not all can. See also here. 3) Check out other questions about left-factoring. — Raphael, Oct 04 '12 at 09:38
Thanks! 1) Yes, we had assigned reading in our textbook covering left factoring, as well as lecture slides on it. 2) I'm pretty sure it should be - converting the grammar to LL(1) is the first part of the homework, and then we need to write a recursive descent parser for the converted grammar. I know the same assignment has been used for this class in the past. 3) I did look on here and Google, but everything I found was an explanation of the purpose of left factoring and/or the simple example I included at the beginning of my question - I didn't find any more complex examples. — Kami's Aibou, Oct 05 '12 at 01:03
1+4) Given the definition, examples should be superfluous. 2) Long-standing "wrong" homework assignments are not unheard of, but I suspect your phrasing "more like this" now. What is the original grammar? Maybe you left out essential parts without realising. — Raphael, Oct 05 '12 at 09:33
I usually find that I learn better from examples than from definitions. Anyway, it turns out that the input will be tokenized first, so the grammar doesn't have to decide between a pair of + and a single +. (Cue headdesk moment...) — Kami's Aibou, Oct 08 '12 at 02:19
It uses a greedy scanner, so if we get 1+++2, that would be 1, ++, +, 2. It also does use whitespace, so if instead it was 1+ ++2 that would be 1, +, ++, 2. — Kami's Aibou, Oct 08 '12 at 17:07

score 13 · Accepted Answer · answered Oct 04 '12 at 10:55

Let's have a look at your grammar:

$\qquad \begin{align} X &\to aE \mid IXE \mid (X)E \\ E &\to IE \mid BXE \mid \varepsilon \\ I &\to \text{++} \mid \text{--} \\ B &\to \text{+} \mid \text{-} \mid \varepsilon \end{align}$

Note that $X$ does not need left-factoring: all rules have disjoint FIRST sets¹. If you want to make this obvious, you can drop $I$ and inline it:

$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{++}E \mid \text{--}E \mid BXE \mid \varepsilon \\ B &\to \text{+} \mid \text{-} \mid \varepsilon \end{align}$

Similarly, we can inline $B$:

$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{++}E \mid \text{--}E \mid \text{+}XE \mid \text{-}XE \mid XE \mid \varepsilon \end{align}$

Now we see that we actually have to do left-factoring on $E$: we have obvious conflicts, and we get additional conflicts via $XE$. So, let's inline $X$ once at $XE$:

$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{++}E \mid \text{--}E \mid \text{+}XE \mid \text{-}XE \mid aEE \mid \text{++}XEE \mid \text{--}XEE \mid (X)EE \mid \varepsilon \end{align}$

And now we can left-factor as easily as in your example:

$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{+}P \mid \text{-}M \mid aEE \mid (X)EE \mid \varepsilon \\ P &\to \text{+}E \mid XE \mid \text{+}XEE \\ M &\to \text{-}E \mid XE \mid \text{-}XEE \end{align}$

By now we can see that we are not getting anywhere: by factoring away $\text{+}$ or $\text{-}$ from the alternatives, we dig up another $X$ which again has both $\text{+}$ and $\text{-}$ in its FIRST set.

So let's have a look at your language. Via

$\qquad \displaystyle X \Rightarrow aE \Rightarrow^* aI^n E \Rightarrow aI^nBXE$

and

$\qquad \displaystyle X \Rightarrow aE \Rightarrow^* aI^n E \Rightarrow aI^nIE$

you have arbitrarily long prefixes of the form $+^+$ which end differently, semantic-wise: an LL(1) parser can not decide whether any given (next) $\text{+}$ belongs to a pair -- which would mean choosing alternative $IE$ -- or comes alone -- which would mean choosing $BXE$.

In consequence, it doesn't look like your language can be expressed by any LL(1) grammar, so trying to convert yours into one is futile.

It's even worse: as $BXE \Rightarrow BIXEE \Rightarrow^* BI^n X E^n E$, you can not decide to chose $BXE$ with any finite look-ahead. This is not a formal proof, but it strongly suggests that your language is not even LL.

If you think about what you are doing -- mixing Polish notation with unary operators -- it is not very surprising that parsing should be hard. Basically, you have to count from the left and from the right to identify even a single $B$-$\text{+}$ in a long chain of $\text{+}$. If I think of multiple $B$-$\text{+}$ in a chain, I'm not even sure the language (with two semantically different but syntactically equal $\text{+}$) can be parsed deterministically (without backtracking) at all.

That would be the sets of terminals that can come first in derivations of a non-terminal/rule alternative.

Thank you - that's a much clearer way of expressing the problem I was running into, and also helps me understand what exactly was happening - since no matter how many times I added new non-terminals, the new ones ended up having more than one choice for productions beginning with +/-. It seems I was creating a new non-terminal for each step in the X ⇒*, but that would result in having an infinite number of non-terminals without actually solving the problem, since like you said there can be arbitrarily many +/-. — Kami's Aibou, Oct 05 '12 at 01:24
Since I'm pretty sure that the grammar is supposed to be one that can be converted to LL(1), I think it's possible I may have misunderstood some part of the homework explanation - I'm going to ask my professor for clarification. (Before you answered, I figured I just didn't know how to do more complex left factoring and so was missing some obvious step.) — Kami's Aibou, Oct 05 '12 at 01:33

Left-Factoring a grammar into LL(1)

1 Answers1

Linked