David R. Tribble
Revision 1.0, 2006-01-16
The Honalee algorithm produces the minimal collection of canonical LR item sets for a given LR(k) grammar. This document details the evolution of the algorithm from its original conception to its final form.
The Honalee algorithm was invented as a result of trying to find the most efficient way to construct item sets for a canonical LR(k) parser for a given grammar. The original idea was to create all of the canonical LR(k) parser states and then merge them, so that the result would be the smallest possible collection of states capable of parsing the given grammar.
The traditional approach to canonical LR(k) parsers is to generate the entire set of parser states (item sets), but this is impractical since most real-world grammars produce several hundred to several thousand states. This is inherent in the nature of LR(k) parser state construction, which builds states representing the most amount of information about the context recognized at any given point during a parse. There is a lot of redundancy in the LR states generated and many of the states are nearly identical, typically differing only by their look-ahead symbols. These near-identical states can be merged to form a much smaller collection of states that still recognizes the same grammar, although with slightly different parsing behavior.
To this end, LALR parsing was invented, which combines most of the power of canonical LR parsing with the small state set size of SLR parsing. For many grammars (such as those used for most typical programming languages), LALR(1) parsing is sufficient and provides very practical parser implementations.
The drawback to LALR parsing is that it is not as powerful as LR parsing, and is not able to handle as wide a class of grammars. What is desired is the means to construct canonical LR parsing tables that are about the same size as LALR parsing tables, thus having the best of both worlds, i.e., efficient parsing of complex grammars provided by LR parsers, combined with the relatively small memory requirements of LALR parsers. A secondary goal is to find reasonably efficient algorithms for doing all of this, so that it is possible to implement a compiler generator that exhibits reasonable performance for real-world grammars.
To this end, we start with the notion of generating the entire set of canonical LR(k) item sets for a grammar and then merging nearly identical sets in order to reduce the total number of item sets needed to parse the grammar. While possible in theory, this approach is rather naïve and would require rather large amounts of both memory and time. So instead the problem is approached with the idea of merging item sets as they are built, a "merge as you go" approach, if you will. This approach is predicated on the theory that as an item set is generated, it may be merged with another previously generated item set, assuming that the two sets are similar enough and can be merged without creating conflicts. If there does not exist a previously created set that can be merged with the newly created set, the new set is added to the collection of item sets for the grammar. Eventually, all of the item sets required for the grammar will be generated, and if all of the set merging goes well, the result should be a minimally sized collection of item sets that can be converted into parser states.
There are two criteria for merging two item sets:
a) both sets must have identical item cores, and
b) merging the two sets will not introduce a reduce/reduce conflict.
At this point, a merged LR, or MLR, parser set construction algorithm is introduced which operates in this merge as you go fashion. This algorithm is essentially a fairly simple modification of the standard canonical LR parser construction algorithm.
The Honalee LR parser construction algorithm is predicated on some assumptions, which are presented below as theoretical conjectures.
MLR Conjecture 1
Merging item sets as soon as possible results in a collection of item sets that is identical to the collection that results from merging item sets after all item sets have been generated.
MLR Corollary to Conjecture 1
If two item sets having identical LR(0) kernel item cores cannot be merged (because merging would introduce a reduce/reduce conflict), the two sets will not be merged regardless of whether an attempt is made to merge them as soon as possible (i.e., immediately afer one of the sets is generated) or as late as possible (i.e., waiting until after all distinct item sets are generated before).
MLR Conjecture 2
Merging an item set with another candidate item set having the the same LR(0) kernel item cores is a superoperation of replacing the item set with an identical item set having the same LR(k) kernel items. The merging operation will not allow a given set to be merged with a different set if there also exists another set identical to it.
Canonical LR(k) item sets are generated for a given grammar, or set of syntax rules. - From these items sets, parser states can be constructed to implement a parser that is capable of recognizing input sentences for the grammar.
The term "LR(k)" refers to this particular form of parsing, which is a "Left-to-right" scan with "Right-most" derivations requiring up to "k" look-ahead symbols. This means simply that the input sentences are scanned from left to right, looking ahead up to k symbols in advance, recognizing the rules of the grammar and deriving the resulting parse tree in a right-most fashion. Such a derivation is also called a "bottom-up" parse.
For the purposes of this document, a canonical LR(k) item set is depicted using a certain notation, as this example shows:
i6. goto ('+') from i2 1 [Expr -> Expr '+' . Factor, '+']* (Factor, i10) 2 [Expr -> Expr '+' . Factor, '*']* (Factor, i10) 3 [Factor -> '+' . Factor, '+']* (Factor, i10) 4 [Factor -> '+' . Factor, '*']* (Factor, i10) 5 [Factor -> . '+' Factor, '+'] ('+', i11) 6 [Factor -> . '+' Factor, '*'] ('+', i11) 7 [Factor -> . Term, '+'] (Term, i12) 8 [Factor -> . Term, '*'] (Term, i12) 9 [Term -> . id, '+'] (id, i13) 10 [Term -> . id, '*'] (id, i13) 11 [Term -> . '(' Expr ')', '+'] ('(', i14) 12 [Term -> . '(' Expr ')', '*'] ('(', i14)
The notation used depicts item set number 6 (i6), which was generated by shifting symbol '+' out of state i2, and which is composed of 12 items. Each item is shown with its rule, shift position marker ".", and one or more look-ahead symbols, all enclosed within '[ ]' brackets. For instance, item 1 is the item derived from the rule:
Expr : Expr '+' Factor
The item has a shift marker "." positioned to the right of the Factor symbol, and a look-ahead symbol of '+'. The brackets are followed by an "*" mark to indicate that it is a kernel item; non-kernel items are not marked.
Following the brackets is the action for the item, enclosed within "( )" parentheses. Item actions are either reduction actions or shift actions. Reductions are designated with a look-ahead symbol(s) and the rule being reduced. Shifts are designated with the symbol being shifted and the item set to transition to (goto) after the shift occurs. Actions that have not yet been established (or "marked") for an item are specified with the notation "(-, -)".
Thus the items in the example item set shown above are implicitly designated as the following kinds of items:
i6. goto ('+') from i2 1 [Expr -> Expr '+' . Factor, '+']* (Factor, i10) « kernel, shift 2 [Expr -> Expr '+' . Factor, '*']* (Factor, i10) « kernel, shift 3 [Factor -> '+' . Factor, '+']* (Factor, i10) « kernel, shift 4 [Factor -> '+' . Factor, '*']* (Factor, i10) « kernel, shift 5 [Factor -> . '+' Factor, '+'] ('+', i11) « closure, shift 6 [Factor -> . '+' Factor, '*'] ('+', i11) « closure, shift 7 [Factor -> . Term, '+'] (Term, i12) « closure, shift 8 [Factor -> . Term, '*'] (Term, i12) « closure, shift 9 [Term -> . id, '+'] (id, i13) « closure, shift 10 [Term -> . id, '*'] (id, i13) « closure, shift 11 [Term -> . '(' Expr ')', '+'] ('(', i14) « closure, shift 12 [Term -> . '(' Expr ')', '*'] ('(', i14) « closure, shift
As each item set is processed by the algorithm, three things are done to it. The first step is to generate its closure, which means generating and adding its closure items. For example, consider the following item set, which contains only kernel items when it is created and before it has been processed:
i6. goto ('+') from i2 1 [Expr -> Expr '+' . Factor, '+']* (-, -) 2 [Expr -> Expr '+' . Factor, '*']* (-, -) 3 [Factor -> '+' . Factor, '+']* (-, -) 4 [Factor -> '+' . Factor, '*']* (-, -)
Adding closure items to the item set involves each item with its shift marker positioned immediately to the left of a nonterminal symbol. For example, closure items are created for item 1:
1 [Expr -> Expr '+' . Factor, '+']* (-, -)
The closure item created from this item result from all of the rules in the grammar having nonterminal Factor as their left-hand side (LHS) symbol, for example:
r5. Factor : '+' Factor r6. Factor : id
- From these rules, the following new non-kernel closure items are generated:
5 [Factor -> . '+' Factor, '+'] (-, -) 6 [Factor -> . Term, '+'] (-, -)
Each closure item has its shift marker positioned to the left of all of the right-hand side (RHS) symbols in the rule.
The look-ahead symbol for each closure item is derived from the First sets for the symbol immediately to the right of the shift marker in the generating item, which is symbol Factor in this example. The look-ahead symbol is the left-most symbol of the sequence formed from the First symbols for the RHS symbol following the marked symbol (Factor), appended with the look-ahead symbol of the generating item ('+'). Since symbol Factor is not followed by any other RHS symbols in item 1, the look-ahead symbols are taken from First(Factor) alone, which is only the symbol '+'.
The two new closure items are non-kernel items, so they are not marked with an "*", and their actions are empty at this point.
In a similar fashion, the other items are used to generate additional closure items, and the newly added closure items may also in turn generate additional closure items. The process of generating and adding closure items continues until no more new closure items can be added to the item set.
For example, generating closures for the item set shown above results in the following items:
i6. goto ('+') from i2 1 [Expr -> Expr '+' . Factor, '+']* (-, -) « kernel 2 [Expr -> Expr '+' . Factor, '*']* (-, -) « kernel 3 [Factor -> '+' . Factor, '+']* (-, -) « kernel 4 [Factor -> '+' . Factor, '*']* (-, -) « kernel 5 [Factor -> . '+' Factor, '+'] (-, -) « closure 6 [Factor -> . '+' Factor, '*'] (-, -) « closure 7 [Factor -> . Term, '+'] (-, -) « closure 8 [Factor -> . Term, '*'] (-, -) « closure 9 [Term -> . id, '+'] (-, -) « closure 10 [Term -> . id, '*'] (-, -) « closure 11 [Term -> . '(' Expr ')', '+'] (-, -) « closure 12 [Term -> . '(' Expr ')', '*'] (-, -) « closure
Reduction actions occur only for items with the shift marker positioned to the right of all of the right-hand side (RHS) symbols of the rule, and specify that the entire rule has been recognized and can be reduced, relacing all the RHS symbols of the rule on the parser stack with the single left-hand side (LHS) symbol of the rule.
For example, the following item set contains two reduction items, which are marked with their reduce actions (rule r3) appropriately:
i8. goto (Factor) from i5 1 [Expr -> Expr '+' Factor ., '+']* ('+', r3) « reduce 2 [Expr -> Expr '+' Factor ., '*']* ('*', r3) « reduce
These items specify that whenever the parser reaches this item set (or parser state, which is built from this item set), it is to perform a reduction action for rule r3 if the current input (look-ahead) symbol is '+' or '*'.
Shift actions occur on all items where the shift marker is positioned to the left of some RHS symbol. That symbol is shifted, resulting in a transition to another item set (a "shift transition"). That new transition item set is composed of the same rule(s) from the original item set (the "come-from" set), with the shift marker moved to the right of the shifted symbol.
For example, consider the shift items in item set i6 shown above:
i6. goto ('+') from i2 1 [Expr -> Expr '+' . Factor, '+']* (Factor, i10) « shift 2 [Expr -> Expr '+' . Factor, '*']* (Factor, i10) « shift 3 [Factor -> '+' . Factor, '+']* (Factor, i10) « shift 4 [Factor -> '+' . Factor, '*']* (Factor, i10) « shift 5 [Factor -> . '+' Factor, '+'] (Factor, i10) « shift 6 [Factor -> . '+' Factor, '*'] (Factor, i10) « shift 7 [Factor -> . Term, '+'] (Term, i11) « shift 8 [Factor -> . Term, '*'] (Term, i11) « shift 9 [Term -> . id, '+'] (id, i12) « shift 10 [Term -> . id, '*'] (id, i12) « shift 11 [Term -> . '(' Expr ')', '+'] ('(', i13) « shift 12 [Term -> . '(' Expr ')', '*'] ('(', i13) « shift
The shift items in i6 above generate four new transition item sets, composed of new kernel items:
i10. goto (Factor) from i6 1 [Expr -> Expr '+' Factor ., '+']* (-, -) « kernel 2 [Expr -> Expr '+' Factor ., '*']* (-, -) « kernel 3 [Factor -> '+' Factor ., '+']* (-, -) « kernel 4 [Factor -> '+' Factor ., '*']* (-, -) « kernel 5 [Factor -> '+' . Factor, '+']* (-, -) « kernel 6 [Factor -> '+' . Factor, '*']* (-, -) « kernel i11. goto (Term) from i6 7 [Factor -> Term ., '+']* (-, -) « kernel 8 [Factor -> Term ., '*']* (-, -) « kernel i12. goto (id) from i6 9 [Term -> id ., '+']* (-, -) « kernel 10 [Term -> id ., '*']* (-, -) « kernel i13. goto ('(') from i6 11 [Term -> '(' . Expr ')', '+']* (-, -) « kernel 12 [Term -> '(' . Expr ')', '*']* (-, -) « kernel
These newly created transition item sets do not yet have their closure items, and their actions have not yet been marked.
The items that denote shifting terminal symbols, such as id and '(', specify that the parser is to shift those symbols onto the parser stack if the parser reaches these item sets (parser states) and the current input (look-ahead) symbol matches the shift symbol.
The items that denote shifting nonterminal symbols, such as Factor and Term, specify that action that the parser takes after a reduction action, which pops zero or more RHS symbols off the parser stack and replaces them with a single LHS nonterminal symbol. If exposed nonterminal symbol matches one of the shifted nonterminal symbols for these item sets (parser states), the parser peforms a goto action, which leaves the exposed nonterminal symbol on the top of the parser stack, and is used to indicate which item set (parser state) to transition to next. So like a shift action for a terminal symbol, a goto is the equivalent action performed for a nonterminal symbol.
The first attempt at an algorithm for constructing MLR item sets (circa Jun 2001) is a simple, naïve approach.
An "incomplete" and a "done" list of item sets are maintained. Each item set in the incomplete list is visited exactly once, popping it from the list. Closure items are added to the item set, and items with reduction actions are marked appropriately. Then all of the items with shift actions are used to generate new transition item sets, one for each unique shifted symbol. These new shift transition item sets are added to the incomplete list. After all the shift item actions are marked, the item set is marked as "complete" and moved to the "done" list.
The algorithm repeats this for all the item sets in the incomplete list until no more remain in the list. The item sets in the done list are the output (the result) of the algorithm, which is the minimal collection of item sets needed to recognize the grammar.
A simplified form of the algorithm looks like:
MLR Algorithm - Revision 1 (Jun 2001), Simplifiedcreate initial item set i0; add item set i0 to doneList; while incomplete sets exist in doneList, pop next incomplete set from doneList; for each item in set, if item is a shift, create a new shift transition set newSet; generate closure items for newSet; if gSet exists in doneList identical to newSet, replace newSet with gSet; if newSet was not replaced by gSet, for each gSet in doneList with identical item cores as set, if no r/r conflicts arise from merging set and gSet, merge set into gSet; break for; if newSet was not merged with gSet, add incomplete newSet to doneList; mark set as complete; end;
The complete version of the algorithm is:
MLR Algorithm - Revision 1 (Jun 2001)// Initialize doneList = empty; create initial item set i0, containing the initial kernel item constructed from the augmented grammar rule: '$accept : $goal', [$accept -> . $goal, $end]*; add item set i0 to doneList; // Construct all item sets workLoop: while incomplete sets exist in doneList, set = next incomplete item set in doneList; // Generate transitions from the incomplete set transitionLoop: for each item in set, if item is a reduce, mark item.action as a reduction; else // Item is a shift item: [A -> B . S X, u], // Create a new transition set by shifting S S = shifted symbol S in item; create new transition item set newSet; generate closure items for newSet; // Mark all items shifting S in set for each shift shItem in set, if marked symbol of shItem is S, [A -> H . S J, v], create new kernel item k, [A -> H S . J, v]*; add item k to newSet; mark shItem.action as (S, newSet); end if; end for; // Attempt to find an existing set identical to newSet if gSet exists in doneList identical to newSet, replace newSet with gSet; for each item shItem in set, if shItem.action is (S, newSet), change shItem.action to (S, gSet); end for; newSet = null; end if; if newSet not null, // Attempt to merge newSet with an existing item set mergeLoop: for each gSet in doneList with identical item cores as set, if no r/r conflicts arise from merging set and gSet, merge set into gSet; for each item shItem in set shifting S, change shItem.action to (S, gSet); break mergeLoop; end if; end for; end if; if newSet not null and newSet was not merged with gSet, add incomplete newSet to doneList; end if; end transitionLoop; // Item set is now complete mark set as complete; end workLoop;
There is a fatal flaw in the first version of the algorithm, however. Some grammars result in item sets that generate shift transitions to themselves. Consider, for example, the following grammar rules:
r4. Expr : Expr '+' Factor r5. Factor : '+' Factor
These rules result in the following item set being constructed:
i6. 1 [Expr -> Expr '+' . Factor, #]* (Factor, i7) 2 [Factor -> '+' . Factor, #]* (Factor, i7) 3 [Factor -> . '+' Factor, #] ('+', i6i8)
Items 1 and 2 of the set are the kernel items generated by shifting symbol '+' from some other item set. Item 3 is a non-kernel closure item that is added as the closure of item 2. It shifts symbol '+', which results in the creation of a new transition set i8 that is identical to i6. Item sets i8 and i6 will be merged, so that shifting a '+' symbol from i6 results in a transition to i6 itself. This kind of situation commonly arises when constructing item sets for grammars with recursive production rules.
This kind of situation causes a problem in the algorithm above, because the merging of i8 and i6 occurs while transitions are still being generated from the items in i6. Merging new items from i8 into the existing set i6 disrupts the transitionLoop of the algorithm, which is still iterating over all of the items in set i6 at the time the merging occurs. Specifically, new items (if any) added into i6 from i8 might be added at a point within the collection of items in set i6 such that the transitionLoop will miss them in subsequent iterations. It might also be the case, depending on how the iteration over the items within the item set is implemented, that altering the collection of items in i6 during the merge operation will completely corrupt the transitionLoop that is currently iterating over those same items. In any case, this is a subtle but fatal flaw in the algorithm.
In order to rectify this problem, note that it would be more effective to generate all of the new transitions sets from a given set as a group, and only after this group of items is generated should any attempt be made to merge the new sets with existing sets. The simplest way to accomplish this is to separate the generation of transition sets into one phase of the algorithm and the merging of those sets into another phase.
The item set generation phase employs an additional "to-do" list to hold the new transition sets as they are created, which is separate from the "done" list that holds the previously generated item sets. After all of the transition sets are generated from a given set (the "come-from" set), the items in the "to-do" list are then merged with existing sets in the second phase of the algorithm. Any corrections to the goto actions of items in the "come-from" set that are needed due to merging of the transition sets are then made after each successful merge operation.
This separation into two phases solves the loop corruption problem by guaranteeing that newly generated transition item sets are not merged into the very sets that generated them (as was the case with sets i8 and i6 in the example above) until after the algorithm has completed the generation of the shift transitions from the generating come-from set itself.
The next version of the algorithm (Oct 2004) with the above correction separates the handling of the "done" list and the "to-do" list into two separate phases.
MLR Algorithm - Revision 2 (Oct 2004), Simplifiedcreate initial item set i0; append item set i0 to toDoList; while (toDoList is not empty) or (incomplete sets exist in doneList), // Phase 1 pop next incomplete item set from doneList; generate transitions from the incomplete set; for each item in set, create a new shift transition set newSet; generate closure items for newSet; if gSet exists in doneList identical to newSet, newSet = gSet; else append newSet to toDoList; mark set as complete; // Phase 2 for each set in toDoList, for each gSet in doneList with identical item cores as set, if no r/r conflicts arise from merging set and gSet, merge set into gSet; break for; if set was not merged with gSet, add incomplete set to doneList; end while;
MLR Algorithm - Revision 2 (Oct 2004)// Initialize doneList = empty; toDoList = empty; comeFrom = null; create initial item set i0, containing the initial kernel item constructed from the augmented grammar rule: '$accept : $goal', [$accept -> . $goal, $end]*; append item set i0 to toDoList; // Construct all item sets workLoop: while (toDoList is not empty) or (incomplete sets exist in doneList), // Phase 1 set = next incomplete item set in doneList; comeFrom = set; // Generate transitions from the incomplete set transitionLoop: for each item in set, if item is a reduce, mark item.action as a reduction; else // Item is a shift item: [A -> B . S X, u], // Create a new transition set by shifting S S = shifted symbol S in item; create new transition item set newSet; // Mark all items shifting S in set for each shift shItem in set, if marked symbol of shItem is S, [A -> H . S J, v], create new kernel item k, [A -> H S . J, v]*; add item k to newSet; mark shItem.action as (S, newSet); end if; end for; generate closure items for newSet; // Attempt to find an existing set identical to newSet if gSet exists in doneList identical to newSet, newSet = gSet; else append newSet to toDoList; end if; end transitionLoop; // Item set is now complete mark set as complete; // Phase 2 while toDoList is not empty, set = pop first incomplete item set from toDoList; // Attempt to merge set with an existing item set mergeLoop: for each gSet in doneList with identical item cores as set, if no r/r conflicts arise from merging set and gSet, merge set into gSet; for each item shItem in comeFrom, if shItem.action is (S, set), change shItem.action to (S, gSet); end for; break mergeLoop; end if; end for; if set was not merged with gSet, add incomplete set to doneList; end while; end workLoop;
This version of the algorithm performs much better than the first version and does not suffer from any pathological cases. It is guaranteed to construct the minimal collection of item sets needed to implement an LR(k) parser for a grammar, and works for any finite k. (Typical implementations of this algorithm use k=1, but the algorithm can employ any fixed look-ahead length provided that a suitable algorithm is used to generate the required First_{k}( ) look-ahead sets for nonterminal symbols.)
During the execution of the algorithm, all of the canonical LR(k) item sets are created for the grammar. Starting with the initial item set derived from the augmented grammar rule, each group of shift transitions from a given item set causes new item sets to be generated. Each new item set is then compared to previously generated sets, and is merged with any that are duplicates of it or that share the same kernel item cores. The result is that every possible canonical LR(k) item set is generated exactly once from some shift transition, and the resulting merged item sets are the minimal collection needed to recognize the grammar.
If the grammar is LALR, the number of item sets (and thus the number of resulting parser states) is exactly the same as the number of sets constructed using the LALR construction algorithm [Aho and Ullman, 1974]. This is because all of the states that are implictly merged in the LALR construction algorithm (due to the synthesizing of look-aheads) are identical to the states resulting from the MLR construction algorithm (due to the merging of similar states). If the LR grammar is more complex than LALR, the number of states resulting from the MLR algorithm is slightly greater than the number of LALR states (assuming that LALR states could be generated), reflecting exactly the places in the grammar where the parser requires more context, and thus more individual states, than what is possible with LALR parser construction.
This version of the algorithm can be improved further, however. Also, some details of the algorithm have been glossed over, which now need to be explored in greater detail.
First, note that it is not necessary to generate the closure items of a new transition item immediately after it has been constructed. So that task is moved to the point until just before any attempt to merge a set with an existing set (in phase 2).
This improvement saves some memory space, since the generation of closure items is done for each transition set, one at a time, just prior to the point at which an attempt is made to merge the transition set with an existing set. If the transition set is merged, its redundant items are discarded, which occurs before the closure of next transition set is generated. In constrast, generating closures for the transition sets as they are created requires potentially more memory for items which would only be discarded later during the merge operation.
Scondly, note that all of the reduction items in a set must be marked as such just prior to merging. This is necessary because in order to determine whether or not two items sets can be merged, it must be known if merging their reduction items will produce a reduce/reduce conflict. This means that all of the reduction items in a set must be present in the set and their reduce actions be established prior to the merging operation.
Also note that it is possible to add reduction items to a set as part of generating its closure. To illustrate this possibility, Consider the following grammar rules:
r3. Call : id '(' Parm ')' r4. Parm : (empty) r5. Parm : num
These rules lead to the generation of the following item set:
i5. [Call -> id '(' . Parm ')', # ]* [Parm -> ., ')'] [Parm -> . num, ')']
The last two items are closures generated from the first item. The second item is also a reduction item, resulting from an empty production rule of the form 'A : empty'.
The algorithm can therefore be improved by marking all of the reduction items in a set just after all of its closure items are generated and just prior to any attempt to merge the set with existing sets (in phase 2). This also means that the transitionLoop (in phase 1) only needs to deal with the shift items in the set, since the reduction items will have already been marked at that point.
Another improvement comes from the realization that the "done" list contains both complete and incomplete item sets. With a little closer examination, note that as the algorithm proceeds down this list, all of the item sets marked as complete end up in front of the list, while the incomplete sets in the list remaining to be processed end up at the end of the list.
The operation of retrieving the next incomplete set from the "done" list can thus be specified more precisely by using an additional "incomplete" pointer that points to the first incomplete set in the tail of the "done" list. All of the complete sets occur in front of this pointer, and as incomplete sets are processed and marked as complete, the pointer is advanced to the next incomplete set in the list. Once the end of the list is reached, there are no more incomplete sets left to process, and the "done" list contains all of the complete item sets for the grammar.
For example, consider this visual depiction of six item sets, the first three of which are complete (done) and the last three of which are incomplete:
doneList incList | | : : i0 - i1 - i2 - i3 - i4 - i5 Item set list ------------ ---------- Complete Incomplete sets sets
Another small improvement can be gained by swapping the phases of the outer workLoop so that processing new transition sets (generating their closures and moving them to the incomplete set list) occurs first (as phase 1), and the processing of incomplete sets (generating new transition sets from their shift items) occurs last (as phase 2). This improves the execution of the algorithm a tiny bit because every item set is first placed in the "to-do" list, then moved into the "done" list as an incomplete set, then finally marked as complete. Thus rearranging the phases within the outer loop directly reflects this progression (which, admittedly, is mostly an aesthetic improvement).
The result of these improvements is the next version (Oct 2004) of the algorithm.
MLR Algorithm - Revision 3A (Oct 2004)// Initialize doneList = empty; incList = empty; toDoList = empty; comeFrom = null; create initial item set i0, containing the initial kernel item constructed from the augmented grammar rule: '$accept : $goal', [$accept -> . $goal, $end]*; append item set i0 to toDoList; // Construct all item sets mainLoop: while (toDoList is not empty) or (incList is not empty), // Phase 1 while toDoList is not empty, set = pop first incomplete item set from toDoList; // Prepare the transition set for possible merging generate closure items for set; mark all reduction items in set; // Attempt to merge set with an existing item set mergeLoop: for each gSet in doneList with identical item cores as set, if no r/r conflicts arise from merging set and gSet, merge set into gSet; for each item shItem in comeFrom, if shItem.action is (S, set), change shItem.action to (S, gSet); end for; set = null; break mergeLoop; end if; end for; if set not null, append set (still incomplete) to incList (doneList tail); end while; // Phase 2 set = next item set in incList; comeFrom = null; // Generate transitions from the incomplete set transitionLoop: for each item in set, if item is a shift, [A -> B . S X, u], // Create a new transition set by shifting S S = shifted symbol S in item; create new transition item set newSet; create new kernel item k for transition from set: comeFrom = set; // Mark all items shifting S in set for each shift shItem in set, if marked symbol of shItem is S, [A -> H . S J, v], create new kernel item k, [A -> H S . J, v]*; add item k to newSet; mark shItem.action as (S, newSet); end if; end for; // Attempt to find an existing set identical to newSet if gSet exists in doneList identical to newSet, newSet = gSet; else append newSet to toDoList; end if; end transitionLoop; // Item set is now complete mark set as complete; advance incList; end mainLoop;
Delving deeper into the algorithm, notice that newly generated transition item sets (created in phase 2) contain only kernel items, since their closures are not generated until later in phase 1 of the next iteration of mainLoop.
Also note that once a new transition set is generated (from a group of shift items in the comeFrom set), the algorithm attempts to find an already existing set (in the "done" list) that is identical this new transition set. This is done because during the course of generating LR transition sets, the new sets quite frequently are duplicates of sets that have already been generated.
This simply reflects that fact that the process of constructing canonical LR item sets results in many redundant sets being generated, especially if the grammar contains nonterminals that occur in the RHS of many different rules. For example, many programming languages allow arithmetic expressions to be used within expression statements, as function call arguments, as variable initializers, as array type size declarations, etc. The fact that expressions occur in so many places within a grammar means that many identical transition item sets will be generated as the sets involving these different rules (i.e., these different contexts) shift the nonterminal symbols comprising the expression rules.
To determine whether or not a transition set is identical to an existing set, their items must be compared. Note that all of the non-kernel closure items in a given set are generated from the kernel items, and that a set with a given group of kernel items will produce exactly the same closure items as another set with the same starting kernel items. It can thus be concluded that it is sufficient to compare only the kernel items of two sets in order to determine whether they are identical or not. Comparing the non-kernel closure items is simply not necessary.
Along similar lines, comparing two item sets in order to determine whether or not they can be merged together involves comparing the cores of their items (i.e., comparing their items while ignoring the look-ahead symbols of the items). As noted above, all of the non-kernel closure items in a set are generated from the starting kernel items, so two sets with identical kernel item cores will end up with identical closure item cores as well. So the conclusion is that it is only necessary to compare the cores of the kernel items of each set, and that the non-kernel item cores can be safely ignored.
Taking advantage of these two observations can reduce the amount of time that the algorithm spends comparing item sets for both replacement (because a new transition set is a duplicate of an existing set) and for possible merging.
At this point, the rather remarkable observation can be made that the process of replacing a duplicate item set with an already existing set is a sub-operation of merging two sets, or, looking at it another way, that merging two sets is a super-operation of eliminating a duplicate set. Specifically, duplicate item sets have identical kernel items, and similar sets that can be merged have identical kernel item cores. This leads to the conclusion that the step in phase 2 of looking for a duplicate set matching the newly created transition set is not necessary, since essentially the same operation will be performed in phase 1 of the next iteration when the transition set is merged with a similar set. If the transition set is an identical duplicate of another existing set, then that existing set is in fact the set that will be found in phase 2 and into which the transition set will be merged.
The conclusion is that the search for duplicate existing sets is an unnecessary and costly operation, and further, that the process of merging a set with an existing set yields the same results that the duplicate replacement operation would have achieved. Thus duplicate search can be removed from phase 2.
If there is a desire to keep track of whether the merging operation is actually the merging of identical item sets, the merging operation can be modified so that it indicates whether the sets are identical or not (since it can determine that fact as the two sets are being merged).
These changes result in version 3B (Oct 2004) of the algorithm.
MLR Algorithm - Revision 3B (Oct 2004)// Initialize globList = null; incList = null; toDoList = null; comeFrom = null; nStates = 0; // Create the initial item set (i0) set = create initial item set (i0); set.complete = false; item = create kernel item from augmented grammar production, [G' -> . G, $]* ; add item to set; append set to end of toDoList; [begin optional steps] [these steps are not necessary because set i1 will be generated] [from set i0 by the algorithm anyway] // Create the final item set (i1) set = create final item set (i1); set.complete = false; item = create kernel item from final goal production, [G' -> G ., $]* ($, accept); add item to set; append set to end of toDoList; [end optional steps] // Main loop // Generate all item sets for the grammar from the initial set mainLoop: while toDoList not empty or incList not empty, // Phase 1 // Process item sets on the to-do list phase1: while toDoList not empty, set = pop toDoList; // Prepare the set for possible merging generate all closure items for set; // Note A mark all reduce actions in the set; // Note B // Attempt to merge the new item set with an existing set mergeLoop: for gSet in globList, if set and gSet can be merged, // Note C (they have the same core kernel items and there are no reduce/reduce conflicts between them), merge set into gSet, // Note D moving all non-duplicate items from set into gSet, setting any shift/goto actions on symbol s to the same gotos existing in gSet items; (otherwise mark the grammar as non-LALR) // Note E // Fix transitions to the merged set if comeFrom not null, for shItem in comeFrom, // Note F if shItem.action is goto(t, set), shItem.action = goto(t, gSet); // Discard the new set, replace with merged gSet set = null; break mergeLoop; end if; end for; // Move the new set (still incomplete) to the global list if set not null, set.stateNumber = nStates++; append set to end of globList; if incList is null, incList = set; end if; end phase1; // Phase 2 // Process the next incomplete item set in the global list comeFrom = null; if incList not null, set = first incomplete set in incList; // Generate new transition sets for all shift items in set loop1: for item in set, if item.action not null, skip item, continue loop1; // Create a new transition item set sym = item.markedSymbol, which is S in [A -> B . S C, t]; newSet = create item set; newSet.complete = false; // Create new transition items for all items shifting S for shItem in set, if shItem.markedSymbol is sym, [X -> H . S J, u], k = create kernel item, [X -> H S . J, u]; add k to newSet; item.action = goto(sym, newSet); end for; // Discard the transition set if it is redundant loop2: for gSet in globList, if newSet an gSet have identical kernel items, // Fix transitions to the duplicate new set for shItem in set, if shItem.action is goto(T, newSet), shItem.action = goto(T, gSet); // Discard the duplicate transition set, // Do not add it to the to-do list newSet = null; break loop2; end if; end loop2; // Add the new transition set to the to-do list if newSet not null, append newSet to end of toDoList; comeFrom = set; end loop1; // Add the completed set to the global list set.complete = true; incList = set.next; end if; end mainLoop;
Further enhancements to the algorithm involve trying to reduce the total amount of memory it consumes. To that end, observe that many of the non-kernel items within each item set contribute little or no extra information about the contents of the set. Specifically, many non-kernel shift items are in fact redundant; for any group of items shifting the same symbol, only one of them is actually needed in order to properly convert the item set into a parser state.
Consider, for example, the following item set:
i18. [Expr -> . Expr '+' Factor, $end]* (Expr, i19) [Expr -> '+' . Expr, $end]* (Expr, i19) [Expr -> . Expr '+' Factor, '+' ] (Expr, i19) « redundant [Expr -> . '+' Expr, '+' ] ('+', i20) [Expr -> . Expr '+' Factor, '(' ] (Expr, i19) « redundant [Expr -> . '+' Expr, '(' ] ('+', i20) « redundant [Factor -> . Term, '+' ] (Term, i21) [Factor -> . Factor '*' Term, '+' ] (Factor, i22) [Factor -> . Term, '(' ] (Term, i21) « redundant [Factor -> . Factor '*' Term, '(' ] (Factor, i22) « redundant
When this item set is converted into a parser state, the only information really required is the look-ahead symbols for the reduction items (which this set has none), and the transition symbols of the shift items. For each group of items that shift the same symbol, though, only one item is actually needed in order to know the (symbol, goto) action for the resulting state. So in this example, all of the items marked redundant are not needed, and can in fact be eliminated from the set without losing any important information.
Eliminating the redundant items yields the following compressed item set:
i18. [Expr -> . Expr '+' Factor, $end]* (Expr, i19) [Expr -> '+' . Expr, $end]* (Expr, i19) [Expr -> . '+' Expr, '+' ] ('+', i20) [Factor -> . Term, '+' ] (Term, i21) [Factor -> . Factor '*' Term, '+' ] (Factor, i22)
One caveat that needs to be made about compressing item sets is that the item set must have been marked as complete before any of its redundant items can be safely eliminated.
This enhancement can save a substantial amount of memory, as well as speeding up the comparison of item sets as candidates for merging. Adding this enhancement results in version 3D (Nov 2004) of the algorithm.
MLR Algorithm - Revision 3D (Nov 2004)// Initialize doneList = empty; incList = empty; toDoList = empty; comeFrom = null; setCount = 0; // Create initial item set (i0) create initial item set i0, containing the initial kernel item constructed from the augmented grammar rule: '$accept : $goal', [$accept -> . $goal, $end]*; append item set i0 to toDoList; // Construct all item sets mainLoop: while (toDoList is not empty) or (incList is not empty), // Phase 1 while toDoList is not empty, set = pop first incomplete item set from toDoList; // Prepare the transition set for possible merging generate closure items for set; mark all reduction items in set; // Attempt to merge set with an existing item set mergeLoop: for each gSet in doneList with identical kernel item cores as set, if no r/r conflicts arise from merging set and gSet, if set and gSet are not identical, increment mergedSetCount; merge set into gSet; if gSet.complete, only kernel and reduce items from set need to be merged into gSet; else all non-duplicate items from set need to be merged into gSet; // Fix previous transitions to the merged set for each item shItem in comeFrom if shItem.action is (S, set), change shItem.action to (S, gSet); discard set; set = null; break mergeLoop; else the grammar is LR but not LALR; end if; end for; // Move the new (still incomplete) set to the incomplete list if set not null (was not merged or a duplicate), increment setCount; set.number = setCount; append set (still incomplete) to incList (doneList tail); end if; end while; // Phase 2 comeFrom = null; if incList not empty, set = first (incomplete) item set in incList; // Generate shift transitions from the incomplete set transitionLoop: for each item in set, if item is a shift, [A -> B . S X, u], // Create a new transition set by shifting S S = shifted symbol S in item; create new transition item set newSet; comeFrom = set; append newSet to toDoList; // Mark all items shifting S in set for each shift shItem in set, if marked symbol of shItem is S, [A -> H . S J, v], create new kernel item k, [A -> H S . J, v]*; add item k to newSet; mark shItem.action as (S, newSet); end if; end for; end if; end transitionLoop; // Item set is now complete mark set as complete; compress set; advance incList; end if; end mainLoop;
The previous algorithm still has one flaw, however. When a newly created transition item set iN is discovered to be a duplicate of a previously created item set iP, i.e., the two sets have identical kernel core items, the new set iN is merged into the existing set iP and is then discarded.
This creates a problem, however, when new items from the new transition set iN are added to the existing set iP if set iP has already been marked as completed. Marking it completed occurred as a result of generating new transition item sets from it. But adding new items to the set adds items that are similar to items that already existed in the set (because the two merged sets have the same kernel core items), but which differ in their look-ahead symbols (because the core of an item set ignores all the look-ahead symbols). This means that any items added to the existing set would have required additional items to be generated in the transition item set generated from it, but which were not yet generated.
Fortunately, the solution is relatively simple. If a new transition item set iN is merged with previously existing item set iP that has been marked as completed, then after all the new items from iN have been added to iP, the newly merged set iP is simply marked as incomplete and moved from the completed done list back to the incomplete list. This insures that the modified item set will be subjected to further processing in a subsequent iteration of the algorithm, which in turn insures that any missing items will be generated and added to the item sets that result from all the shift transitions out of the set.
Note that in addition to marking the altered merged set iP as incomplete, all of the shift actions within it must be reset (wiped clean), removing any previously established shift actions. Reduction actions should not be reset, though, because they were established when the item set was moved from the to-do list to the incomplete list.
Reseting all of the shift actions insures that the now-incomplete set is "clean", like it was when it was first moved to the incomplete list. Eventually item set iP will be revisited by a subsequent iteration (phase 1) of the algorithm, and all of its shift transition items will be used to generate new transition item sets, including the new items that were added during the merge.
Since it is important to maintain all of the items within each completed item set, in case more items are added during a merge to the set, it is no longer possible to compress item sets are they are marked as completed. Every item that is added to an item set must remain in the set until the algorithm has exhausted all possibilities of merges with that set. More importantly, all the items in an item set that is altered during a merge, adding one or more new items to it, must not discard any of its items. Otherwise, there will be items that should be used to generate shift transitions but which have been discarded, resulting in an incomplete and possibly erroneous collection of item sets for the grammar.
Therefore, item set compression is not an optimization that can be used with this corrected revision of the algorithm. This is unfortunate, since item set compression allowed for space and time savings, but correctness of execution is more important than the speed or size of execution.
There is another trade-off made for the correction made to this algorithm. Whereas the previous revisions generated each canonical LR(k) item set exactly once, this new revision will generate each canonical item set at least once, possibly more. This is due to the fact that clearing and marking a merged item set as incomplete causes it to be processed more than once by the main loop of the algorithm (which traverses the incomplete list), and therefore the same item sets may be generated several times, possibly with slightly different items. But even though each item set may be generated multiple times, the net result is the same, because all item sets that can be merged will be merged eventually.
Revision 3E (Dec 2005) the algorithm is shown below.
MLR Algorithm - Revision 3E (Dec 2005)// Initialize doneList = empty; incList = empty; toDoList = empty; comeFrom = null; setCount = 0; // Create initial item set (i0) create initial item set i0, containing the initial kernel item constructed from the augmented grammar rule: '$accept : $goal', [$accept -> . $goal, $end]*; append item set i0 to toDoList; // Construct all item sets mainLoop: while (toDoList is not empty) or (incList is not empty), // Phase 1 while toDoList is not empty, set = pop first incomplete item set from toDoList; // Prepare the transition set for possible merging generate closure items for set; mark all reduction items in set; // Attempt to merge set with an existing item set mergeLoop: for each gSet in doneList with identical kernel item cores as set, if no r/r conflicts arise from merging set and gSet, // Merge set into gSet merge set into gSet; if merging added items to gSet, increment mergedSetCount; // Fix previous transitions to the merged set for each item shItem in comeFrom if shItem.action is (S, set), change shItem.action to (S, gSet); // Fix merged set previously marked as complete if gSet.complete and merging added shift items to gSet, for each item shItem in gSet if gSet.action is (S, set), clear gSet.action; mark gSet as incomplete; move gSet from doneList to incList; end if; discard set; set = null; break mergeLoop; else the grammar is LR but not LALR; end if; end for; // Move the new (still incomplete) set to the incomplete list if set not null (was not merged or a duplicate), increment setCount; set.number = setCount; append set (still incomplete) to incList (doneList tail); end if; end while; // Phase 2 comeFrom = null; if incList not empty, set = first (incomplete) item set in incList; comeFrom = set; // Generate shift transitions from the incomplete set transitionLoop: for each item in set, if item is a shift with no action, [A -> B . S X, u], // Create a new transition set by shifting S S = shifted symbol S in item; create new transition item set newSet; append newSet to toDoList; // Mark all items shifting S in set for each shift shItem in set, if marked symbol of shItem is S, [A -> H . S J, v], create new kernel item k, [A -> H S . J, v]*; add item k to newSet; mark shItem.action as (S, newSet); end if; end for; end if; end transitionLoop; // Item set is now complete mark set as complete; advance incList (which moves set to end of doneList); end if; end mainLoop;
Some final improvements can be made to the algorithm. The order of phases 1 and 2 are reversed, restoring them to their original order. This means that the next item set from the incomplete list is processed first in phase 1, making it the come-from set, and new shift transition item sets are generated from it and are added to the to-do list. Phase 2 then processes all of the new transition item sets from the to-do list, moving them to the incomplete list, or merging them with existing item sets, or outright discarding them because they are duplicates of existing item sets.
This original ordering of the looping phases makes the algorithm easier to understand, and more accurately reflects the original idea behind the two-phase algorithm.
The final form of the Honalee LR item set generation algorithm, revision 3F (Dec 2005), is shown below.
The Honalee LR Algorithm - Revision 3F (Dec 2005), Simplifiedcreate initial item set and add it to toDoList; while (toDoList is not empty) or (incList is not empty), // Phase 1 if incList is not empty, pop first incomplete item set in incList; generate new transition item sets from shift items in set, adding them to the toDoList; mark set as complete and move set to done list; // Phase 2 for each set in the toDoList, generate closure items and mark all reduction items in set; if set can be merged with existing gSet, merge set into gSet; if gSet.complete and merging added shift items to gSet, reset shift actions in gSet mark gSet as incomplete and move gSet to incList; discard set; if set was not discarded, move set to incList; end while;
The Honalee LR Algorithm - Revision 3F (Dec 2005)// Initialize doneList = empty; incList = empty; toDoList = empty; comeFrom = null; setCount = 0; // Create initial item set (i0) create initial item set i0, containing the initial kernel item constructed from the augmented grammar rule: '$accept : $goal', [$accept -> . $goal, $end]*; append item set i0 to toDoList; // Construct all item sets mainLoop: while incList is not empty, // Phase 1 comeFrom = null; set = first (incomplete) item set in incList; comeFrom = set; // Generate shift transitions from the incomplete set transitionLoop: for each item in set, if item is a shift with no action, [A -> B . S X, u], // Create a new transition set by shifting S S = shifted symbol S in item; create new transition item set newSet; append newSet to toDoList; // Mark all items shifting S in set for each shift shItem in set, if marked symbol of shItem is S, [A -> H . S J, v], create new kernel item k, [A -> H S . J, v]*; add item k to newSet; mark shItem.action as (S, newSet); end if; end for; end if; end transitionLoop; // Item set is now complete mark set as complete; move set to end of doneList; // Phase 2 while toDoList is not empty, set = pop first incomplete item set from toDoList; // Prepare the transition set for possible merging generate closure items for set; mark all reduction items in set; // Attempt to merge set with an existing item set mergeLoop: for each gSet in doneList with identical kernel item cores as set, if no r/r conflicts arise from merging set and gSet, // Merge set into gSet merge set into gSet; if merging added items to gSet, increment mergedSetCount; // Fix previous transitions to the merged set for each item shItem in comeFrom if shItem.action is (S, set), change shItem.action to (S, gSet); // Fix merged set previously marked as complete if gSet.complete and merging added shift items to gSet, for each item shItem in gSet if gSet.action is (S, set), reset gSet.action; mark gSet as incomplete; move gSet from doneList to incList; end if; discard set; set = null; break mergeLoop; else the grammar is LR but not LALR; end if; end for; // Move the new (still incomplete) set to the incomplete list if set not null (was not discarded), increment setCount; set.number = setCount; append (incomplete) set to incList; end if; end while; end mainLoop;
There are other considerations to be made about implementing the algorithm in a practical manner.
Note that the process of finding an existing item set in the "done" list with a kernel core identical to a given transition set is not explicitly described. A brute force approach is to simply search the "done" list from beginning to end, comparing each set's kernel core with the transition set to be merged. This approach is an O(n^{2}) algorithm, wasting a lot of time mostly comparing sets having different kernels.
A better approach is to use a table containing lists of item sets, where each list contains all of the item sets in the "done" list (i.e., all of the incomplete and complete sets) having identical kernel cores. Thus the mergeLoop in phase 1 needs to search only a relatively short list of candidate item sets to determine if any of them can be merged with the new transition set at hand.
The table of lists is implemented most efficiently as a hash table, being indexed by a hash value computed from the kernel core of each item set. The hash function does not need to generate completely unique values for kernel cores, but it does need to generate distinct values for item sets with different kernel cores. In practice, reasonably simple numeric permutations of the rule number and symbol marker position of each item core suffice as effective hash functions.
As is the nature of hash functions, each list in the hash table will contain all of the item sets having the same hash value but not necessarily the same kernel item cores. This is an acceptable inaccuracy, though, since the number of different kernel cores with the same hash value is typically quite small (even for large grammars), and each set in a given list is compared to the transition item set prior to actually merging them, so there is no danger of choosing the wrong set to merge.
An alternative to using a hash table is to keep lists of item sets based on their kernel core items. In other words, an array of lists is maintained, with each list indexed by a combination of the rule used by the first item of each set and the position marker within that rule. Thus all of the item sets sharing the same rule and position marker appear in the same list. Each list is relatively short (compared to the total number of item sets), and only one list needs to be searched to find candidate sets for merging with a given item set.
The Honalee Algorithm, as far as the author knows, is the first canonical LR(k) parser construction algorithm of its kind. It makes possible practical implementations of LR(k) parsers.
One last thing, concerning the name of the algorithm (with apologies to Peter Yarrow and Leonard Tipton):
Puff The Merging DragonPuff, the merging dragon lived by the sea
And frolicked in the item mist in a land called Honalee.
Little Jackie Parser loved that rascal Puff
And brought him strings and token streams and other lexer stuff.
Together they would travel on a boat with shifting stacks
Jackie kept a look-ahead for Puff's reduction acts.
Noble scripts and programs would bow when 'ere they ran
And semantic acts would set their flags when Puff derived his scan.
Oh, Puff, the merging dragon lived by the sea
And frolicked in the item mist in a land... called... Honalee.
(This should have some symbolic meaning to those familiar with the Red Dragon book, or its older edition, the Green Dragon book. Okay, so it's not great poetry - the author admits that his expertise lies in the area of programming and not in the more sublime endeavors of poetastry.)
The author can be contacted at:
david@tribble.com.
The author's home page is at:
http://david.tribble.com.
This document: http://david.tribble.com/text/honalee_hist.html.