Grammar rewriting
Left-recursive grammars can be automatically rewritten to eliminate the left-recursion. Any recursion must have some base case, so we can substitute that instead. Assume this example grammar:
A ::= A B
| A C
| D
| E
We can introduce a helper rule A_rest
to group all left-recursive alternatives into a single alternative, and a helper rule A_init
that contains all alternatives without left recursion,
A ::= A A_rest | A_init
A_rest ::= B | C
A_init ::= D | E
We can now rewrite A
to
A ::= A_init | A_init A'
A' ::= A_rest | A_rest A'
and have exchanged left-recursion for right-recursion. This allows an LL parser to recognize left-recursive grammars – with one caveat: this rewrite changes the structure of the resulting parse tree!
Assume the input 1 - 2 - 3
, where -
is the normal left-associative subtraction operator. Because it is left-associative, this has to be interpreted as (1 - 2) - 3
, and the interpretation 1 - (2 - 3)
is invalid. So the description of this operator is necessarily left-recursive:
Subtraction ::= Subtraction '-' Number | Number
This rule would give us the parse tree
Subtraction
/ | \
Subtraction '-' Number=3
/ | \
Subtraction '-' Number=2
|
Number=1
, which corresponds to the precedence (1 - 2) - 3
. But if we rewrite the rule to eliminate left-recursion, we get the rule
Subtraction ::= Number | Number Subtraction'
Subtraction' ::= '-' Number | '-' Number Subtraction'
This rule would give us the parse tree
Subtraction
/ \
Number=1 Subtraction'
/ | \
'-' Number=2 Subtraction'
/ \
'-' Number=3
, which corresponds to the precedence 1 - (2 - 3)
which is wrong. Also, the parse tree looks severely mangled and does not represent the logical structure of the input any longer. This is bad, and unusable for practical applications. We could try to fix up the parse tree by juggling nodes around, but there is a better alternative:
Use a better algorithm: LR parsers
Top-down parsing algorithms of the kind you are probably using (recursive descent?) are limited in their expressiveness, their correctness, and in their efficiency. The only thing they have going for them is that they use a very intuitive approach, and this is probably the only reason they are still being commonly used. Other parsing algorithms such as LR or LALR can parse a broader range of grammars and can do so in linear time, but are less intuitive and difficult to write by hand. Which is why we generally use parser generators to create a parser from a grammar, rather than writing the grammar by hand.
The grammar file you linked to is a Yacc grammar. Yacc is a parser generator tool that uses the LALR algorithm – no problems with left recursion, almost as general as LR, but needs smaller “state tables” than real LR. LALR has some annoying limitations, but it is currently the most common parsing algorithm in practical use for implementing “real” programming languages.
Bottom-up algorithms such as LR can parse a broader range of grammars because they decide which alternative to apply after they have seen the appropriate input. Using the Subtraction
example, assume we see the token stream
Number=1 '-' Number=2 '-' Number=3
and have the grammar
Subtraction → Subtraction '-' Number
Subtraction → Number
. I will use a ·
to denote the current position in a rule
initialize the parser. We have an empty token stack, and are at the beginning of all alternatives:
Subtraction → · Subtraction '-' Number
Subtraction → · Number
read token Number=1
. The token stack of the parser is now
| Number=1 |
this completes the rule Subtraction → Number ·
, so we can reduce the right-hand side of the rule to the left-hand side by adding a parse tree fragment to the stack. The token stack is now:
| Subtraction |
| | |
| Number=1 |
read token '-'
. The token stack is
| Subtraction | '-' |
| | | |
| Number=1 | |
we are in the rules
Subtraction → Subtraction '-' · Number
read token Number=2
. The token stack is
| Subtraction | '-' | Number=2 |
| | | | |
| Number=1 | | |
this completes the rule
Subtraction → Subtraction '-' Number ·
so we can reduce to
| Subtraction |
| / | \ |
| Subtraction '-' Number=2 |
| | |
| Number=1 |
and so on for the remaining tokens, until no input tokens are left, and the token stack consists of a single item: the final parse tree.
Recognizing where we are in which rule and whether a rule was completed by this token is generally done by a DFA state machine, which is why we need to calculate state tables from the grammar when constructing the parser.