Honalee LR(k) Algorithm

The Honalee LR(k) Algorithm
David R. Tribble (david@tribble.com)
Revision 1.0, 2006-04-27

Abstract

The Honalee algorithm produces the minimal collection of canonical LR item sets for a given LR(k) grammar. The algorithm starts from the initial item set derived from an augmented grammar rule, and then generates all of the other canonical LR item sets for the grammar. Item sets are merged where possible immediately after they are generated using a "merge as you go" strategy. The algorithm terminates when no more item sets can be generated. The resulting collection of item sets is the minimum number of sets possible for the given grammar. State tables for a bottom-up LR parser can then be constructed from the resulting item sets, providing the smallest tables necessary for an LR parser to recognize sentences from the grammar.

Background

The Honalee algorithm was invented as a result of trying to find the most efficient way to construct item sets for a canonical LR(k) parser for a given grammar. The original idea was to create all of the canonical LR(k) parser states and then merge them, so that the result would be the smallest possible collection of states capable of parsing the given grammar.

The traditional approach to canonical LR(k) parsers is to generate the entire set of parser states (item sets), but this is impractical since most real-world grammars produce several hundred to several thousand states. This is inherent in the nature of canonical LR(k) parser state construction, which builds states representing the most amount of information about the context recognized at any given point during a parse. There is a lot of redundancy in the LR states generated and many of the states are nearly identical, typically differing only by their look-ahead symbols.

These near-identical states can be merged to form a much smaller collection of states that still recognizes the same grammar, although with slightly different parsing behavior.

LALR parsing was invented as a variation on this idea, using the look-ahead symbols of each state to combine similar states. This technique combines most of the power of canonical LR parsing with the small state set size of SLR parsing. For many grammars (such as those used for most typical programming languages), LALR(1) parsing is sufficient and provides very practical parser implementations.

The drawback to LALR parsing is that it is not as powerful as LR parsing, and is not able to handle as wide a class of grammars. We desire the means to construct canonical LR parsing tables that are about the same size as LALR parsing tables, thus having the best of both worlds, i.e., efficient parsing of complex grammars provided by LR parsers, combined with the relatively small memory requirements of LALR parsers. A secondary goal is to find reasonably efficient algorithms for doing all of this, so that it is possible to implement a compiler generator that exhibits reasonable performance for real-world grammars.

To this end, we start with the notion of generating the entire set of canonical LR(k) item sets for a grammar and then merging nearly identical sets in order to reduce the total number of item sets needed to parse the grammar. While possible in theory, this approach is rather naive and would require rather large amounts of both memory and time. So instead the problem is approached with the idea of merging item sets as they are built, a "merge as you go" approach, if you will. This approach is predicated on the theory that as an item set is generated, it may be merged with another previously generated item set, assuming that the two sets are similar enough and can be merged without creating conflicts. If there does not exist a previously created set that can be merged with the newly created set, the new set is added to the collection of item sets for the grammar. Eventually, all of the item sets required for the grammar will be generated, and if all of the set merging goes well, the result should be a minimally sized collection of item sets that can be converted into parser states.

There are two criteria for merging two item sets:
a) both sets must have identical item cores, and
b) merging the two sets will not introduce a reduce/reduce conflict.

At this point, a merged LR, or MLR, parser set construction algorithm is introduced which operates in this merge as you go fashion. This algorithm is essentially a fairly simple modification of the standard canonical LR parser construction algorithm.

Theory

The Honalee LR parser construction algorithm is predicated on some assumptions, which are presented here as theoretical conjectures.

MLR Conjecture 1
Merging item sets as soon as possible results in a collection of item sets that is identical to the collection that results from merging item sets after all item sets have been generated.

MLR Corollary to Conjecture 1
If two item sets having identical LR(0) kernel item cores cannot be merged (because merging would introduce a reduce/reduce conflict), the two sets will not be merged regardless of whether an attempt is made to merge them as soon as possible (i.e., immediately afer one of the sets is generated) or as late as possible (i.e., waiting until after all distinct item sets are generated before).

MLR Conjecture 2
Merging an item set with another candidate item set having the the same LR(0) kernel item cores is a superoperation of replacing the item set with an identical item set having the same LR(k) kernel items. The merging operation will not allow a given set to be merged with a different set if there also exists another set identical to it.

The Honalee LR Algorithm

Two-Phase Algorithm

The algorithm consists of a short initialization step which creates the initial item set from the augmented grammar, followed by a main loop wherein all the other item sets for the grammar are generated. The key to the algorithm is the separation of the main loop into two phases, the first phase to generate new item sets and the second phase to merge the new sets into existing sets where possible.

The main loop employs an "incomplete" list, which holds item sets that were created in a previous iteration of the main loop but which have not yet been processed. As the algorithm progresses, new item sets are generated from existing sets previously created and are added to the "incomplete" list.

Each iteration of the main loop begins with phase 1, which takes the first item set from the "incomplete" list and processes it. This item set, called the "come-from" set, is processed, changing it from an "incomplete" state to a "completed" state. First, closure items are added to the set. Next, all of the items in the set calling for reduce actions are marked. Finally, new "transition" item sets are generated from the shift items in the set.

These new "transition" item sets are derived from items within the come-from set that shift a symbol, i.e., those that call for a shift action instead of a reduce action. All of the items shifting the same symbol are grouped together, and result in the creation of one new transition item set. This new set is the item set that the come-from set "transitions" to after shifting the symbol (i.e., the "goto" action that results when the the item set is converted into an LR parsing table entry). As each new transition item set is generated, it is added to a "to-do" list. At this point, each newly created transition item set consists of only its kernel items (none of its closure items have been added yet), and none of its shift or reduce actions have been marked.

When phase 1 completes, the come-from set is marked as "completed" and moved from the incomplete list to the "completed" list. At this point, the to-do list contains all of the new transition item sets generated from the come-from set. Assuming that the come-from set is not processed again (which can happen as a result of a subsequent merging operation), it is now in a "completed" state, containing all of its kernel and closure items, and having all of its reduce and shift actions marked appropriately.

Phase 2 of the main loop processes all the transition item sets that were generated from the come-from set in phase 1. Each new item set is taken from the to-do list and is merged, where possible, with a previously existing item set in the incomplete or completed lists. (The details of item set merging are described below.) Any new item sets that cannot be merged are moved to the incomplete list for later processing.

At the end of phase 2, the to-do list is completely empty, and all of the item sets on it have either moved to the incomplete list or have been merged with other item sets already in the done or incomplete list.

The main loop continues in this fashion, generating new transition item sets and merging them with existing sets as it can, popping item sets from the incomplete list in each iteration, until eventually the incomplete list is empty. At this point, all of the items sets that remain have been moved to the done list, and the algorithm terminates. The final collection of item sets in the done list is the resulting output of the algorithm.

Algorithm Pseudo-Code

The Honalee LR(k) item set generation algorithm is shown here in pseudo-code form.

The Honalee LR Algorithm [Simplified]

create initial item set and add it to toDoList; while (toDoList is not empty) or (incList is not empty), // Phase 1 pop first incomplete item set in incList; generate new transition item sets from shift items in set, adding them to the toDoList; mark set as complete and move set to done list; // Phase 2 for each set in the toDoList, generate closure items and mark all reduction items in set; if set can be merged with existing gSet, merge set into gSet; if gSet.complete and merging added shift items to gSet, reset shift actions in gSet mark gSet as incomplete and move gSet to incList; discard set; if set was not discarded, move set to incList; end while;

The Honalee LR Algorithm

// Initialize doneList = empty; incList = empty; toDoList = empty; comeFrom = null; setCount = 0; // Create initial item set (i0) create initial item set i0, containing the initial kernel item constructed from the augmented grammar rule: '$accept : $goal', [$accept -> . $goal, $end]*; append item set i0 to toDoList; // Construct all item sets mainLoop: while incList is not empty, // Phase 1 comeFrom = null; set = first (incomplete) item set in incList; comeFrom = set; // Generate shift transitions from the incomplete set transitionLoop: for each item in set, if item is a shift with no action, [A -> B . S X, u], // Create a new transition set by shifting S S = shifted symbol S in item; create new transition item set newSet; append newSet to toDoList; // Mark all items shifting S in set for each shift shItem in set, if marked symbol of shItem is S, [A -> H . S J, v], create new kernel item k, [A -> H S . J, v]*; add item k to newSet; mark shItem.action as (S, newSet); end if; end for; end if; end transitionLoop; // Item set is now complete mark set as complete; move set to end of doneList; // Phase 2 while toDoList is not empty, set = pop first incomplete item set from toDoList; // Prepare the transition set for possible merging generate closure items for set; mark all reduction items in set; // Attempt to merge set with an existing item set mergeLoop: for each gSet in doneList with identical kernel item cores as set, if no r/r conflicts arise from merging set and gSet, // Merge set into gSet merge set into gSet; if merging added items to gSet, increment mergedSetCount; // Fix previous transitions to the merged set for each item shItem in comeFrom if shItem.action is (S, set), change shItem.action to (S, gSet); // Fix merged set previously marked as complete if gSet.complete and merging added shift items to gSet, for each item shItem in gSet if gSet.action is (S, set), reset gSet.action; mark gSet as incomplete; move gSet from doneList to incList; end if; discard set; set = null; break mergeLoop; else the grammar is LR but not LALR; end if; end for; // Move the new (still incomplete) set to the incomplete list if set not null (was not discarded), increment setCount; set.number = setCount; append (incomplete) set to incList; end if; end while; end mainLoop;

Transition Item Set Generation

Given an incomplete item set to process, called the "come-from" set, phase 1 of the algorithm generates all of the "transition" sets resulting from the items in the come-from set with shift actions.

For example, consider the following come-from item set, which consists of four kernel items and two non-kernel closure items:

i3. shift('+') from i2
  1 [Expr -> Expr '+' . Term,  '+' ]k  (-, -)
  2 [Expr -> Expr '+' . Term,  '*' ]k  (-, -)
  3 [Term -> '+' . Term,       '+' ]k  (-, -)
  4 [Term -> '+' . Term,       '*' ]k  (-, -)
  5 [Term -> . Factor,         '+' ]   (-, -)
  6 [Term -> . Factor,         '*' ]   (-, -)

All six items in this item set have shift actions, and none are reduction items. Four of the items transition our of this set by shifting a Term symbol, and the other two shift a Factor symbol:

i3. shift('+') from i2
  1 [Expr -> Expr '+' . Term,  '+' ]k  (Term,   x6)  shift
  2 [Expr -> Expr '+' . Term,  '*' ]k  (Term,   x6)  shift
  3 [Term -> '+' . Term,       '+' ]k  (Term,   x6)  shift
  4 [Term -> '+' . Term,       '*' ]k  (Term,   x6)  shift
  5 [Term -> . Factor,         '+' ]   (Factor, x7)  shift
  6 [Term -> . Factor,         '*' ]   (Factor, x7)  shift

The two symbol shifts cause two new transition item sets to be generated from the come-from set:

x6. shift(Term) from i3
  1 [Expr -> Expr '+' Term .,  '+' ]k  (-, -)
  2 [Expr -> Expr '+' Term .,  '*' ]k  (-, -)
  3 [Term -> '+' Term .,       '+' ]k  (-, -)
  4 [Term -> '+' Term .,       '*' ]k  (-, -)

x7. shift(Factor) from i3
  1 [Term -> Factor .,         '+' ]k  (-, -)
  2 [Term -> Factor .,         '*' ]k  (-, -)

These new item sets are added to the to-do list as they are generated during phase 1. Note that the actions for the items in these sets are empty at this point. The reduction actions will be filled in during phase 2 of the same iteration of the main loop, and the shift actions will be filled in during phase 1 of later iterations.

Finding Candidate Item Sets For Merging

During phase 2 of a given iteration of the main loop, item sets are removed from the to-do list, having been added in phase 1 of the same loop iteration. As each set is removed, it is compared to the previously created sets that already reside in the done and incomplete lists to determine if it can be merged with any of them.

If an item set is found that contains the same core as the to-do set, it is considered as a candidate set for merging. The core of an item set are its LR(k) items stripped of their look-ahead symbols. For example, consider the following item set:

i3. shift('+') from i2
  1 [Expr -> Expr '+' . Term,  '+' ]k
  2 [Expr -> Expr '+' . Term,  '*' ]k
  3 [Term -> '+' . Term,       '+' ]k
  4 [Term -> '+' . Term,       '*' ]k
  5 [Term -> . Factor,         '+' ]
  6 [Term -> . Factor,         '*' ]

The core of this item set consists of the equivalent LR(0) items, which have no look-ahead symbols:

i3 core. shift('+') from i2
  1c [Expr -> Expr '+' . Term]k
  2c [Term -> '+' . Term     ]k
  3c [Term -> . Factor       ]

Other item sets that consist of the same core items are possible candidates for merging with this set. For example, the following set is such a candidate set:

i11. shift('+') from i5
  1 [Expr -> Expr '+' . Term,  '*' ]k
  2 [Expr -> Expr '+' . Term,  '(' ]k
  3 [Term -> '+' . Term,       '*' ]k
  4 [Term -> '+' . Term,       '(' ]k
  5 [Term -> . Factor,         '*' ]
  6 [Term -> . Factor,         '(' ]

Even though set i11 contains items with different look-ahead symbols than set i3, it consists of the same LR(0) core items:

i11 core. shift('+') from i5
  1c [Expr -> Expr '+' . Term]k
  2c [Term -> '+' . Term     ]k
  3c [Term -> . Factor       ]

As a point of efficiency, note that the cores of the two item sets have the same kernel items. As a matter of fact, only the kernel items need to be considered, i.e., the non-kernel items of a set can be ignored when comparing the cores of two item sets. This is due to the fact that the non-kernel items are closure items which are generated from the kernel items. A given set of kernel items will produce exactly the same closure items as any other set consisting of the same kernel items. Thus we can consider the non-kernel closure items as extraneous information for the purposes of comparing item set cores. Ignoring these items makes the comparison operation more efficient.

Merging Item Sets

Once a candidate item set has been found for possible merging with another set, it must be examined further to ensure that merging the two sets will not introduce a reduce/reduce conflict.

Consider the following item set i3, and the item set i11 which is a candidate for merging into i3:

i3. shift('+') from i2
  1 [Expr -> Expr '+' . Term,  '+' ]k  (Term,   x6)
  2 [Expr -> Expr '+' . Term,  '*' ]k  (Term,   x6)
  3 [Term -> '+' . Term,       '+' ]k  (Term,   x6)
  4 [Term -> '+' . Term,       '*' ]k  (Term,   x6)
  5 [Term -> . Factor,         '+' ]   (Factor, x7)
  6 [Term -> . Factor,         '*' ]   (Factor, x7)

i11. shift('+') from i5
  1 [Expr -> Expr '+' . Term,  '*' ]k  (Term,   x6)
  2 [Expr -> Expr '+' . Term,  '(' ]k  (Term,   x6)
  3 [Term -> '+' . Term,       '*' ]k  (Term,   x6)
  4 [Term -> '+' . Term,       '(' ]k  (Term,   x6)
  5 [Term -> . Factor,         '*' ]   (Factor, x7)
  6 [Term -> . Factor,         '(' ]   (Factor, x7)

Merging these two item sets results in the following combined items (duplicate items are discarded):

i3+i11 merged. shift('+') from i2,i5
  1 [Expr -> Expr '+' . Term,  '+' ]k  (Term,   x6)
  2 [Expr -> Expr '+' . Term,  '*' ]k  (Term,   x6)
  3 [Expr -> Expr '+' . Term,  '(' ]k  (Term,   x6)
  4 [Term -> '+' . Term,       '+' ]k  (Term,   x6)
  5 [Term -> '+' . Term,       '*' ]k  (Term,   x6)
  6 [Term -> '+' . Term,       '(' ]k  (Term,   x6)
  7 [Term -> . Factor,         '+' ]   (Factor, x7)
  8 [Term -> . Factor,         '*' ]   (Factor, x7)
  9 [Term -> . Factor,         '(' ]   (Factor, x7)

There are situations, however, where two item sets cannot be merged, because doing so would introduce a reduce/reduce conflict. Consider the following two item sets, which have the same core kernel items:

i7. shift(Factor) from i3
  1 [Term -> Factor .,            '+' ]k  ('+', r5)
  2 [Term -> Term '+' Factor .,   '*' ]k  ('*', r6)

i9. shift(Factor) from i5
  1 [Term -> Factor .,            '+' ]k  ('+', r6)
  2 [Term -> Term '+' Factor .,   '*' ]k  ('*', r5)

Merging these two item sets results in an item set with two reduce/reduce conflicts:

i7+i9. shift(Factor) from i3,i5
  1 [Term -> Factor .,            '+' ]k  ('+', r5)  r/r conflict
  2 [Term -> Factor .,            '+' ]k  ('+', r6)  r/r conflict
  3 [Term -> Term '+' Factor .,   '*' ]k  ('*', r6)  r/r conflict
  4 [Term -> Term '+' Factor .,   '*' ]k  ('*', r5)  r/r conflict

One of the reduce/reduce conflicts comes from the two items that reduce on look-ahead symbol '+', with one item calling for a reduction by rule 5 and the other for a reduction by rule 6. Similarly, the other conflict comes from the two items that reduce by rules 5 and 6 on look-ahead symbol '*'.

Such situations are handled by choosing not to merge the two item sets, and to instead leave them as separate item sets. The resulting collection of canonical LR(k) item sets will therefore contain extra sets, which are converted into extra parser states in an LR parser table. These extra states reflect the complexity of the grammar, indicating that it is a full LR grammar and not a simpler LALR or SLR grammar.

As a matter of fact, the resulting canonical LR item sets is the minimum collection of sets capable of recognizing the full LR grammmar, comprising the same item sets produced if the grammar is LALR or SLR, plus any additional sets to handle the additional parsing complexity if the grammar is not LALR or SLR.

Converting Item Sets to Parser States

The item sets generated by the algorithm are transformed into parser states for an LR parser. Each reduction item is transformed into a reduce action, each item that shifts a terminal symbol is transformed into a shift action, and each item that shifts a nonterminal symbol is transformed into a goto action.

Consider the following example item sets:

i3. shift('+') from i2,i5
  1 [Expr -> Expr '+' . Term,  '+' ]k  (Term,   i6)
  2 [Expr -> Expr '+' . Term,  '*' ]k  (Term,   i6)
  3 [Expr -> Expr '+' . Term,  '(' ]k  (Term,   i6)
  4 [Term -> '+' . Term,       '+' ]k  (Term,   i6)
  5 [Term -> '+' . Term,       '*' ]k  (Term,   i6)
  6 [Term -> '+' . Term,       '(' ]k  (Term,   i6)
  7 [Term -> . Factor,         '+' ]   (Factor, i7)
  8 [Term -> . Factor,         '*' ]   (Factor, i7)
  9 [Term -> . Factor,         '(' ]   (Factor, i7)

i4. shift(')') from i12,i14
  1 [Expr -> '(' Expr ')' .,   '+' ]k  ('+',    r8)
  2 [Expr -> '(' Expr ')' .,   '*' ]k  ('*',    r8)
  3 [Expr -> '(' Expr ')' .,   ')' ]k  ('*',    r8)
  4 [Expr -> '(' Expr ')' .,   $end]k  ($end,   r8)

These item sets are transformed into the following LR parser state table entries:

State Terminals Nonterminals Default

id '+' '*' '(' ')' $end Expr Term Factor

3 - - - - - - - s6 s7 -

4 - r8 r8 - r8 r8 - - - r8

State	Terminals	Nonterminals	Default
`id`	`'+'`	`'*'`	`'('`	`')'`	`$end`	`Expr`	`Term`	`Factor`
3	-	-	-	-	-	-	-	s6	s7	-
4	-	r8	r8	-	r8	r8	-	-	-	r8

Final Words

The Honalee Algorithm, as far as the author knows, is the first canonical LR(k) parser construction algorithm of its kind. It makes possible practical implementations of LR(k) parsers.

One last thing, concerning the name of the algorithm (with apologies to Peter Yarrow and Leonard Tipton):

Puff The Merging Dragon
Puff, the merging dragon lived by the sea
And frolicked in the item mist in a land called Honalee.
Little Jackie Parser loved that rascal Puff
And brought him strings and token streams and other lexer stuff.

Together they would travel on a boat with shifting stacks
Jackie kept a look-ahead for Puff's reduction acts.
Noble scripts and programs would bow when 'ere they ran
And semantic acts would set their flags when Puff derived his scan.

Oh, Puff, the merging dragon lived by the sea
And frolicked in the item mist in a land... called... Honalee.

(This should have some symbolic meaning to those familiar with the Red Dragon book [1], or its older edition, the Green Dragon book [2]. Okay, so it's not great poetry - the author admits that his expertise lies in the area of programming and not in the more sublime waters of poetastry.)

References

[1] Compilers - Principles, Techniques, and Tools: Alfred V. Aho, Ravi Sethi, and Jeffery D. Ullman,
Jan 1986, Addison-Wesley, ISBN 0-201-10088-6.

This is the reference for compiler theory and contruction. Affectionately known as "the dragon book" because of the red dragon on the cover.
[2] Principles of Compiler Design: Aho and Ullman,
Aug 1977, Addison-Wesly, ISBN 0-201-00022-9.

The first edition of "the dragon book", featuring a green dragon on the cover.
[3] The Theory of Parsing, Translation, and Compiling: Volume I: Parsing
Alfred V. Aho and Jeffrey D. Ullman,
Jun 1972, Prentice-Hall, ISBN 0-13-914556-7.

An earlier but more detailed and academic treatment of the subject.
[4] The Theory of Parsing, Translation, and Compiling: Volume II: Compiling
Alfred V. Aho and Jeffrey D. Ullman,
1973, Prentice-Hall, ISBN 0-13-914564-8.
Section 7.3, Transformations On Sets Of LR(k) Tables,
Section 7.4, Techniques For Constructing LR(k) Parsers, and
Section 7.5, Parsing Automata

The second half of their 1973 work, with sections covering LR(k) parsers in more detail.

The Honalee LR(k) Algorithm