### Lookahead exclusions from LALR(1) sets

When dealing with grammars (eg., when writing parsers, or designing a programming language), it is not uncommon for the grammar to contain LR(0) conflicts. Oftentimes these comments will require some creative thought to be resolved. This usually results in grammars that are hard to understand, and/or code that is hard to maintain, etc.

For example, consider the following very simple grammar (italicized words are non-terminals, bold words are terminals):

$program:\\ \indent function \\ \indent|\;statement\\ \\ function:\\ \indent \textrm{\small\textbf{function}}\;Identifier\\ \\ statement:\\ \indent expression\\ \indent|\;\textrm{\small\textbf{\{}} expression \textrm{\small\textbf{\}}}\\ \indent|\;\epsilon\\ \\ expression:\\ \indent \textrm{\small\textbf{function}}\;Identifier_{\textrm{\tiny\textbf{opt}}}\\ \indent|\;Number\\ \indent|\;Identifier$

If we compute the LALR(1) sets for it, some of the generated states will be:

State 0:

program ::= * function
program ::= * statement
function ::= * FUNCTION ID
statement ::= * expression
statement ::= * LCURLY expression RCURLY
statement ::= *
expression ::= * FUNCTION maybe_id
expression ::= * NUMBER
expression ::= * ID

Actions for state 0:

  FUNCTION shift  3
ID shift  7
LCURLY shift  1
NUMBER shift  8
program accept
function shift  6
statement shift  5
expression shift  11
{default} reduce 5

State 3:

function ::= FUNCTION * ID
expression ::= FUNCTION * maybe_id
maybe_id ::= * ID
maybe_id ::= *

Actions for state 3:

        ID shift  13
maybe_id shift  12
{default} reduce 10

State 13:

function ::= FUNCTION ID *
maybe_id ::= ID *

Actions for state 13:

         \$ reduce 9
{default} reduce 2

In state 13 there’s a reduce/reduce conflict: once the parser has successfully gotten an ID from the scanner, it doesn’t know if it should reduce to state 9 or to state 2.

The reason for the conflict can be found by examining the states shown above:

1. from the start state (state 0), the parser expects a function or a statement
2. a function starts with the token function, followed by an Identifier

Let’s say we decide to get rid of the conflict by saying that if a statement is an expression then it cannot start with function. Effectively, we are removing item 3 from the list above. This is a very common solution in grammar specifications. The grammars for many popular languages, such as JavaScript, use this technique. I call it “lookahead exclusion”.

In practice, lookahead exclusion is not supported by parser generators. The practical solution for implementing grammars specified using this technique is to hand-write the parser. But hand-written parsers can be buggier, harder to maintain and are potentially slower, since they are usually top-down (which requires backtracking).

Another problem with top-down parsers is that they usually require modifying the original grammar (transforming the LR(k) grammar into an LL(1) grammar) by removing left recursion, ambiguities, etc.. I’ll leave the actual modification of the grammar above as an exercise to the reader.

Now consider the lookahead exclusion described above:

$statement:\\ \indent expression\;\textrm{\small{(lookahead\,}}\not\in \{\textrm{\small\textbf{function}}\}\textrm{\small{)}}\\ \indent|\;\textrm{\small\textbf{\{}} expression \textrm{\small\textbf{\}}}\\ \indent|\;\epsilon$

When generating the LALR(1) sets, we can see how the corresponding states have changed:

State 0:

program ::= * function
program ::= * statement
function ::= * FUNCTION ID
statement ::= * expression
statement ::= * LCURLY expression RCURLY
statement ::= *
expression ::= * NUMBER
expression ::= * ID

Actions for state 0:

  FUNCTION shift  4
ID shift  7
LCURLY shift  1
NUMBER shift  8
program accept
function shift  6
statement shift  5
expression shift  12
{default} reduce 5

State 4:

function ::= FUNCTION * ID

Actions for state 4:

ID shift  13

State 13:

function ::= FUNCTION ID *

Actions for state 13:

{default} reduce 2

The conflict is gone! Because we excluded function from the lookaheads of the production $statement \rightarrow expression$, when the keyword function is found at state 0, the parser is shifted to a state where it needs to find an Identifier in order to successfully reduce $function$, or the input is malformed.

I have made a change to the Lemon LALR(1) parser generator that allows one to specify exclusion sets for a given rule. The modifier grammar above can be described as:

program ::= function.
program ::= statement.
function ::= FUNCTION ID.
statement ::= expression - [ FUNCTION ].
statement ::= LCURLY expression RCURLY.
statement ::= .
expression ::= FUNCTION maybe_id.
expression ::= NUMBER.
expression ::= ID.
maybe_id ::= ID.
maybe_id ::= .

The Lemon syntax for the exclusions is:

exclusions ::= MINUS LBRACKET list_of_exclusions maybe_comma RBRACKET
list_of_exclusions ::= exclusion
list_of_exclusions ::= list_of_exclusions COMMA exclusion
exclusion ::= terminal
maybe_comma ::= COMMA.
maybe_comma ::= .

Exclusions must come after all right-hand side symbols in a grammar rule, immediately before the dot.

The source for the Lemon parser generator can be found below:
lemon.c
lemon.c.diff

Have fun!