In the previous part, we saw
- How to make LR(0) items
- How to make DFA for recognizing viable prefixes and handles
- How to build LR(0) parse table
In LR(0) parsing whenever there is an item $A \to \alpha \bullet$ $(\bullet$ at end and $A$ is not the new Start symbol) action in parsing table for the corresponding state is REDUCE for all terminals. The problem here is that, in this state, we cannot have a SHIFT action for any terminal. i.e., a CFG is LR(0) if
- If $A \to \gamma \bullet$ is in a configurating set $(A$ is not the start symbol), no other item $B \to \alpha \bullet x \beta$ is in the set (Otherwise we will get SHIFT-REDUCE conflict as shown above)
- If $A \to \alpha \bullet$ is in a configurating set, $B \to \beta \bullet$ should not be there in the same set. Otherwise we will have REDUCE-REDUCE conflist (which production to reduce with?)
- The 2 rules above means there can be maximum one complete item in any set
The above restrictions rule most of the CFGs to be not LR(0) and hence LR(0) parsing is less useful. How can we reduce the conflicts here?
- Instead of having a REDUCE move for all terminals, we can have it only for FOLLOW(A), where A is the LHS of the production used for reduction. This rule is straightforward as based on the definition of FOLLOW set, only those symbols can any way come after the given non-terminal (in any sentential form). By doing so, we do not add any extra overhead (other than finding the FOLLOW set and use one look-ahead symbol to check if in the FOLLOW set) and we can reduce SHIFT-REDUCE and REDUCE-REDUCE conflicts.
- This parser now becomes Simple LR(1) or SLR(1) parser.
- The only change in the Parsing Rule as compared to LR(0) parsing is
- If there is a complete item $A \to \alpha \bullet$ in set $i$, set ACTION[i, a] to REDUCE $A \to \alpha$ ONLY for all $a$ in FOLLOW(A) $(A$ is not the start symbol)
- SLR(1) parser does allow both SHIFT and REDUCE items to be in same state
Lets now see the condition for a grammar to be SLR(1)
- If in a state we have items $A \to \alpha \bullet$ and $A \to \beta \bullet$ $FOLLOW(\alpha)$ and $FOLLOW(\beta)$ must be disjoint
- If in a state we have items $A \to \gamma \bullet$ and $A \to \alpha \bullet x \beta$ $(x$ is a terminal), then $FOLLOW(A)$ should not contain $x.$
Problem of SLR(1)
- It reduced conflicts by using FOLLOW set for Reduction
- But FOLLOW set is not enough
- FOLLOW set is based on the entire grammar
- But in any state in the parser, the reduction should be based on only the visited states (entire stack content, not just the stack top)
- This will make some of the entries in the FOLLOW set invalid for reduction
- i.e., if we prune the FOLLOW set we can get less conflicts and a better parser
- Canonical LR(1) or CLR(1) or simply LR(1) is such a one
CLR(1)
As detailed in above section, the main difference of CLR(1) as compared to SLR(1) is in having extra information for deciding REDUCE moves as compared to using only the FOLLOW set.
- CLR(1) stores extra information in each state
- The extra information is the set of valid terminals which can cause a REDUCE move
- This set of valid terminals will be a subset of the FOLLOW(A), where $A$ is the LHS of the production used for reduction
- Storing extra information can lead to more number of states in CLR(1) as compared to SLR(1)
- An LR(1) item is of the form $A \to X_1X_2 \ldots X_i \bullet X_{i+1}\ldots X_j , a/b\dots$
- This means states corresponding to $X_1 X_2 \ldots X_i$ is on the stack
- States corresponding to X_{i+1}\ldots X_j$ is expected to be coming to stack
- After that we REDUCE but only if the lookahead token is one among $a/b/\ldots$ which is the set of valid terminals including $\$$ (subset of FOLLOW(A))
Closure of a state in LR(1) is found by repeating the following until no more production can be added:
- For each configuration $[A \to \alpha •B \beta, a]$ in $I,$ for each production $B \to w$ in $G’,$ and for each terminal $b$ in $FIRST(\beta a)$ such that $[B \to •w, b]$ is not in $I$: add $[B –> •w, b]$ to $I.$ $a$ in $(FIRST(\beta a)$ becomes relevant when $\beta$ becomes empty.
To understand how LR(1) parsing works and how the LR(1) items differ from LR(0) items lets take the example from Ullman:
- $S’ \to S$
- $S \to XX$
- $X \to aX$
- $X \to b$
The SLR and LR configurating sets are shown below. The number of states in LR increases to $10$ from $7$ in the SLR. This is due to the difference in the lookahead which necessitized creation of new states.
$$\begin{array}{c|c}
\underline{\text{SLR Configurating Sets}} &\underline{\text{LR Configurating Sets}}\\
\boxed{S’\to .S\\S \to .XX\\ X \to .aX \\X \to .b} & \boxed{S’\to .S, \$\\S \to .XX, \$\\ X \to .aX, a/ b \\X \to .b, a/ b} \\
\boxed{S’\to S.} & \boxed{S’\to S., \$}\\
\boxed{S \to X.X\\ X \to .aX \\X \to .b} & \boxed{S \to X.X, \$\\ X \to .aX, \$ \\X \to .b, \$} \\
\boxed{S \to XX.}& \boxed{S \to XX., \$}\\
\boxed{ X \to a.X\\
X \to .aX \\X \to .b} & \boxed{ X \to a.X,\$\\
X \to .aX,\$ \\X \to .b,\$} \boxed{ X \to a.X,a/b\\
X \to .aX,a/b \\X \to .b,a/b}\\
\boxed{X \to aX.} &\boxed{X \to aX., \$} \boxed{X \to aX., a/b}\\
\boxed{X \to b.}& \boxed{X \to b.,\$} \boxed{X \to b.,a/b} \end{array} $$
After the configurating sets are done, working of LR parser follows the approach of SLR or LR(0) one. You can see the systematic approach here.
Now, we can see the formal requirement for a grammar to be LR(1)
A grammar is LR(1) if the following two conditions are satisfied for each configurating set:
- If $[A \to \alpha \bullet x \beta, a],$ $x$ a terminal, is in a set, then there should be no item of the form $[B \to \gamma \bullet, x]$ in the same set. (Shift-Reduce conflicts)
- $[A \to \alpha \bullet, a]$ and $[B \to \beta \bullet, a].$ must not be in a set. (Lookaheads for all complete items must be disjoint).
(Reduce-Reduce conflicts)
So, we have seen LR(1) parser which reduces conflicts from SLR(1) parsing but at the expense of more number of states. As seen in the example, this increase in states is due to the “lookahead” symbols being part of the configuration item.
- What happens if we merge the states having same items but different lookaheads?
- We get the same number of states as in SLR(1)
- We still avoid all Shift-Reduce conflicts which are avoided by LR(1)
- We may have more Reduce-Reduce conflicts as compared to LR(1) as we are having imprecise lookaheads
- This is exactly what is LALR(1) parser.
- You mostly won’t need to know more about it but still if interested can READ THIS
PS: So that completes the parsing notes for GATE 2019. Will be having a small Test on Parsing coming up tomorrow.