Why left recursion not looping?

Question

I'm trying to read the parser of PHP, that says:

top_statement_list:
        top_statement_list  top_statement
    |   /* empty */
;

It is a left-recursion, but why it don't looping infinitely?

My question is because I'm trying building a parser for study, but I don't know how code that. Maybe it is similar to (regex-alike) {top_statement}*? In case, it'll try resolve how much top_statement it can.

Presumably, the parser generator eliminates the left recursion when it actually generates the code. — Telastyn, Jun 20 '15 at 03:09
So what it turns to? How parser generator eliminates the left recursion? — David Rodrigues, Jun 20 '15 at 03:15

score 7 · Accepted Answer · answered Jun 20 '15 at 10:01

Grammar rewriting

Left-recursive grammars can be automatically rewritten to eliminate the left-recursion. Any recursion must have some base case, so we can substitute that instead. Assume this example grammar:

A ::= A B
    | A C
    | D
    | E

We can introduce a helper rule A_rest to group all left-recursive alternatives into a single alternative, and a helper rule A_init that contains all alternatives without left recursion,

A ::= A A_rest | A_init
A_rest ::= B | C
A_init ::= D | E

We can now rewrite A to

A ::= A_init | A_init A'
A' ::= A_rest | A_rest A'

and have exchanged left-recursion for right-recursion. This allows an LL parser to recognize left-recursive grammars – with one caveat: this rewrite changes the structure of the resulting parse tree!

Assume the input 1 - 2 - 3, where - is the normal left-associative subtraction operator. Because it is left-associative, this has to be interpreted as (1 - 2) - 3, and the interpretation 1 - (2 - 3) is invalid. So the description of this operator is necessarily left-recursive:

Subtraction ::= Subtraction '-' Number | Number

This rule would give us the parse tree

              Subtraction
                 / | \
      Subtraction '-' Number=3
           / | \
Subtraction '-' Number=2
    |
Number=1

, which corresponds to the precedence (1 - 2) - 3. But if we rewrite the rule to eliminate left-recursion, we get the rule

Subtraction ::= Number | Number Subtraction'
Subtraction' ::= '-' Number | '-' Number Subtraction'

This rule would give us the parse tree

    Subtraction
        / \
Number=1   Subtraction'
          /     |     \
      '-'    Number=2  Subtraction'
                          / \
                       '-'   Number=3

, which corresponds to the precedence 1 - (2 - 3) which is wrong. Also, the parse tree looks severely mangled and does not represent the logical structure of the input any longer. This is bad, and unusable for practical applications. We could try to fix up the parse tree by juggling nodes around, but there is a better alternative:

Use a better algorithm: LR parsers

Top-down parsing algorithms of the kind you are probably using (recursive descent?) are limited in their expressiveness, their correctness, and in their efficiency. The only thing they have going for them is that they use a very intuitive approach, and this is probably the only reason they are still being commonly used. Other parsing algorithms such as LR or LALR can parse a broader range of grammars and can do so in linear time, but are less intuitive and difficult to write by hand. Which is why we generally use parser generators to create a parser from a grammar, rather than writing the grammar by hand.

The grammar file you linked to is a Yacc grammar. Yacc is a parser generator tool that uses the LALR algorithm – no problems with left recursion, almost as general as LR, but needs smaller “state tables” than real LR. LALR has some annoying limitations, but it is currently the most common parsing algorithm in practical use for implementing “real” programming languages.

Bottom-up algorithms such as LR can parse a broader range of grammars because they decide which alternative to apply after they have seen the appropriate input. Using the Subtraction example, assume we see the token stream

Number=1 '-' Number=2 '-' Number=3

and have the grammar

Subtraction → Subtraction '-' Number
Subtraction → Number

. I will use a · to denote the current position in a rule

initialize the parser. We have an empty token stack, and are at the beginning of all alternatives:
```
Subtraction → · Subtraction '-' Number
Subtraction → · Number
```
read token Number=1. The token stack of the parser is now
```
| Number=1 |
```
this completes the rule Subtraction → Number ·, so we can reduce the right-hand side of the rule to the left-hand side by adding a parse tree fragment to the stack. The token stack is now:
```
| Subtraction |
|     |       |
|  Number=1   |
```

read token '-'. The token stack is

| Subtraction | '-' |
|     |       |     |
|  Number=1   |     |

we are in the rules

Subtraction → Subtraction '-' · Number

read token Number=2. The token stack is

| Subtraction | '-' | Number=2 |
|     |       |     |          |
|  Number=1   |     |          |

this completes the rule

Subtraction → Subtraction '-' Number ·

so we can reduce to

|         Subtraction      |
|            / | \         |
| Subtraction '-' Number=2 |
|     |                    |
| Number=1                 |

and so on for the remaining tokens, until no input tokens are left, and the token stack consists of a single item: the final parse tree.

Recognizing where we are in which rule and whether a rule was completed by this token is generally done by a DFA state machine, which is why we need to calculate state tables from the grammar when constructing the parser.

Please, can you explain better about this part? "so we can reduce the right-hand side of the rule to the left-hand side by adding a parse tree fragment to the stack" — David Rodrigues, Jun 23 '15 at 14:49

score 2 · Answer 2 · answered Jun 20 '15 at 04:00

Left-recursive grammars can generally be rewritten by adding symbols and rearranging slightly. Wikipedia has a good example for a simple four-function calculator grammar. Since you know the first terminal you see in a top_statement_list must be a valid nonterminal for the beginning of a top_statement, you can rewrite to check for a top_statement instead of a top_statement_list first.

Why left recursion not looping?

2 Answers2

Grammar rewriting

Use a better algorithm: LR parsers