In the previous part we talked about viable prefixes and handles. Today we will do a bit of background to ensure your concepts are clear.
When we started LR parsing I assumed that you know LL parsing. LL means scan input from left to right and do a left-most derivation. This is less powerful than LR – why?
- Assume for simplicity that we are talking about LL(1) and LR(1) where the ‘1’ refers to the lookahead- that is at anytime we look at the ‘next ONE’ input character to decide the parser action.
- In LL(1) we see the first symbol of the input and see the production to apply. So, if there is two productions with the same ‘first’ symbol as in the input parser gets a conflict and fails.
- In LR(1) we see the input from left till we get a handle. After this, we see one more lookahead symbol and determine the parser action. i.e., the parser has MORE information to decide its action than in LL(1) and this makes it more powerful than LL(1). More powerful means, any grammar that can be parsed by LL(1) can also be parsed by LR(1)
- This holds in general case too – for any $k$, LL(k) set of grammars is a PROPER subset of LR(k) set of grammars
Now, when we say “power of a parser” it is with respect to the GRAMMARS it can parse- not the language where as when we say “power of a grammar” it is with respect to the languages it can generate. So, what will be the language of LL(1) – that is the languages which can be generated by the set of all LL(1) grammars?
Actually the power of LL(k) grammar strictly increases with an increase in ‘k’. i.e., language of LL(k) is a strict subset of the langauge of LL(k+1).
What about LR(k)?
Power of LR(0) grammar is lower than power of LR(1).
Language of LR(0) (set of all LR(0) grammars) is same as DCFL having prefix property or the set of languages accepted by a DPDA with empty stack – this is neither a super set nor a subset of the set of regular languages.
- Language of LR(1) (set of all LR(1) grammars) is same as DCFL
- i.e., Language of LR(0) $\subset$ Language of LR(1) = Language of LR(2) = Language of LR(k), k > 0.
Since any LL(k) grammar is also an LR(k) grammar, language of any LL(k) grammar is guaranteed to be a DCFL. But since language of LL(k) is a strict subset of the langauge of LL(k+1), for any finite $k$, there exist SOME DCFL which cannot be generated by ANY LL(k) grammar.
For sake of completeness we can see the languages of LL(0) and LL(1).
- 0 in LL(0) means the parser must be able to determine the production without seeing any lookahead symbol. This basically means there cannot be multiple choices for any non-terminal. But if there are no choices, the grammar can only generate a single string or no string. So, the language of LL(0) is the set of all single string languages and also the empty language.
- LL(1) grammars have more power but as we saw earlier it cannot generate all DCFLs. But can it at least generate all regular languages?
- It does. Every regular language has a right liner grammar (also a left linear one) and this is LL(1)
So, now we can come back to LR parsing. I wont be discussing LL or Top down parsing. For that please see here. But we can see the definitions of FIRST and FOLLOW which are useful even for LR parsing.
(Greek letters means a sequence of terminals or non-terminals)
$FIRST(\alpha)$ is the set of terminals that begins strings derivable from $\alpha$ ($\alpha$ itself can be starting with a terminal). (How does it help in parsing?)
$FOLLOW(A)$ (see capital English alphabet A, so a single non-terminal only) is the set of terminals that can appear immediately to the right of $A$ in some sentential form. (So, should $FOLLOW(S)$ have $\$$ – the end marker for string?)
We can now see the condition for a grammar to be LL(1). A grammar $G$ is LL(1) only if
- $G$ is unambiguous
- $G$ is not left-recursive
- If there is a production $A \to \alpha \mid \beta,$ then
- $FIRST(\alpha) \cap FIRST(\beta) = \emptyset$ Otherwise parser gets confused which production to use
- If $FIRST(\alpha)$ contains $\epsilon$, $FIRST(\beta)$ should not contain $\epsilon$ (same reason as above)
- If $FIRST(\alpha)$ contains $\epsilon$, $FIRST(\beta) \cap FOLLOW(\alpha) = \emptyset$ (again same reason as above)
SLR(1), LALR and CLR parsing to be continued in next part…