From LR Items to LR States
Over the past few post's I've introduced data structures for representing Context Free Grammars from BNF, algorithms for computing First sets of those CFGs, and algorithms for computing the follow sets to accompany them. With that, we have everything needed to generate LL(1) parse tables. With just a bit more work though, we could generate something a bit more interesting: SLR(1) parsers.
An LR Item is a grammar production that has been augmented with a special "dot" symbol that represents the current position in the rule, or how much of the production has already been matched. Configurations or configuration sets are sets of LR Items which trace the steps of recognizing a productions right hand side. LR States make up the states of an LR automaton, which have transitions on input symbols connecting them. These automatons - CSFM's - are the heart of shift/reduce parsers, and in todays post I'm going to cover the first step in their construction.
Configuration Sets
The position of the dot in a production tells us about the progress of the parse.
Production:
A ::= B C
When the dot is before a symbol, it is said to be predicting the symbol. When the dot is located to the far right of a production, it is said to be complete.
Configuration Set:
(1) A ::= . B C
(2) A := B . C
(3) A ::= B C .
In the above example, when we begin trying to recognize the production A, in step (1) we are predicting that the current symbol is B. In step (2) we have accepted symbol B and are predicting symbol C. Once C has been processed, we are in step (3) and the configuration is complete.
Representing LR Items
At their most basic LR items are a tuple of a production and a dot position { P, index }.
struct LRItem {
Production production;
int dotPosition;
LRItem(Production p, int dp) : production(p), dotPosition(dp) { }
};
Using some of the other definitions from above, we can add some helper methods to our object which will greatly help us in constructing the CFSM. Two such methods are symbolAfterDot() and complete(), which return the next symbol to the right of the dot in the current production, and whether the current item is complete, respectively.
Symbol LRItem::symbolAfterDot() {
if (complete())
return "<fin>";
return production.rhs.at(dotPosition);
}
bool LRItem::complete() const {
return dotPosition >= production.rhs.size();
}
The method advance() returns the next LRItem in a configuration from the current item. Notice that we are generating the next LR Item in the call to advance(), as they are only created on an as-needed basis.
LRItem LRItem::advance() {
return LRItem(production, dotPosition + 1);
}
bool LRItem::operator==(const LRItem& other) const {
return production == other.production && dotPosition == other.dotPosition;
}
bool LRItem::operator!=(const LRItem& other) const {
return !(this == other);
}
Inorder to be able to store our LRItem's in an unordered set, we need to implement a hash function in addition to overriding operator==. One possible implementation of a hash function is generated by combining the production id and the dot position index:
namespace std {
template <> struct hash<LRItem> {
std::size_t operator()(const LRItem& item) const {
size_t h1 = hash<int>()(item.production.pid);
size_t h2 = hash<int>()(item.dotPosition);
return h1 ^ (h2 << 1);
}
};
}
This is preferable to using some kind of string based representation, as it is much faster to compute. We do need to implement a toString() method for our LRItems, as we will need it for sorting later.
string toString() const {
int i = 0;
string result;
result += production.lhs;
result += " ::= ";
for (i = 0; i < production.rhs.size(); i++) {
if (i == dotPosition)
result += ". ";
result += production.rhs[i];
result += " ";
}
if (dotPosition == production.rhs.size())
result += ".";
return result;
}
Representing LR States
LR States are comprised of sets of LRItem. The data structure is pretty much what you would expect: the aformentioned set of LRItem objects, and unique numerical label. You should notice that we use the string representation of the set to compare the states for equality. This is to quickly check If we manage to generate another configuration set with the same LR Items of an existing state so we dont create a duplicate state.
struct LRState {
int state_num;
unordered_set<LRItem> items;
bool operator==(const LRState& other) const {
return key() == other.key();
}
bool operator!=(const LRState& other) const {
return !(*this == other);
}
string key() const {
vector<string> strs;
for (const auto& item : items) {
strs.push_back(item.toString());
}
sort(strs.begin(), strs.end());
string result;
for (const auto& s : strs) {
result += s + "\n";
}
return result;
}
};
The sets themselves are created through a closure operation, with the states being discovered breadth first from the previous state, all the way back to the start symbol.
Closing Over LR Items
LRState closure(const Grammar& G, const LRState& state) {
LRState ret;
ret.items = state.items;
queue<LRItem> work;
for (const auto& item : state.items)
work.push(item);
while (!work.empty()) {
LRItem item = work.front();
work.pop();
Symbol X = item.symbolAfterDot();
if (G.nonterminals.find(X) == G.nonterminals.end())
continue;
for (Production p : G.productions.at(X)) {
LRItem newItem(p, 0);
if (ret.items.find(newItem) == ret.items.end()) {
ret.items.insert(newItem);
work.push(newItem);
}
}
}
return ret;
}
-
From LR Items to LR States
-
Calculating Follow Sets of a Context Free Grammar
-
Streaming Images from ESP32-CAM for viewing on a CYD-esp32
-
Constructing the Firsts Set of a Context Free Grammar
-
Inchworm Evaluation, Or Evaluating Prefix-Expressions with a Queue
-
Data Structures For Representing Context Free Grammar
-
A B Tree of Binary Search Trees
-
Implementing enhanced for loops in Bytecode
-
Top-Down Deletion for Red/Black Trees
-
Function Closures For Bytecode VMs: Heap Allocated Activation Records & Access Links
Leave A Comment