Extending Sedgewick's explicit Digraph NFA
Regular expressions are a fantastic way to describe patterns in text. Thanks to the work of pioneers like Stephen Kleene we know that for any regular expression there is cooresponding finite automaton (FA) and vice versa. This is known as Kleenes Theorem. In the beginning these expressions were comprised of just a handfull of basic operations: concatenation, closure, and the or operator. Like algebraic expressions parentheses can be used to override a given operators precedence.
The Multiway-or operator
else if (re[i] == ')') {
int ro = ops.pop();
if (re[ro] == '|') {
lp = ops.pop();
G.addEdge(lp, ro+1);
G.addEdge(ro, i);
} else if (re[ro] == '(') {
lp = ro;
} else {
lp = i;
}
}
else if (re[i] == ')') {
unordered_set<int> orOpIdx;
while (re[ops.top()] == '|') {
orOpIdx.insert(ops.pop());
}
lp = ops.pop();
for (int ro : orOpIdx) {
G.addEdge(lp, ro+1);
G.addEdge(ro, i);
}
}
Regular Expressions & Specified Sets
else if (re[i] == ']') {
lp = ops.pop();
for (int inside = lp+1; inside < i; inside++) {
G.addEdge(lp, inside);
specifiedSetIndex[inside] = i;
}
}
/*
code for other operators
*/
if (re[i] == '(' || re[i] == '*' || re[i] == '+' || re[i] == ')' || re[i] == '[' || re[i] == ']')
G.addEdge(i, i+1);
You will notice that we create an edge from the opening paren to the character, but instead of an edge going out from the character we add an entry to a hashtable that pairs the matched character with the index of the sets closing parentheses. That's because for this implementation we also need to make a change to the inner loop of the match() method.
collectReachedStates(dfs, reached);
for (int i = 0; i < text.length(); i++) {
Bag<int> matches;
for (int state : reached) {
if (state < m && (re[state] == text[i] || re[state] == '.')) {
if (specifiedSetIndex.find(state) != specifiedSetIndex.end()) {
matches.add(specifiedSetIndex[state]);
} else {
matches.add(state+1);
}
}
}
if (reached.empty()) return false;
How do they compare?
If specified sets and multiway-or both let us choose from amongst multiple options, than why do we need both? Well, it's not quite as simple as that. Sure, the two constructs can be used equivelantly, but in truth the multiway or is more powerful than specified set. However, when used for the kind of problems like the example above, using a specified set is more efficient. To understand why this might be, lets take a look at the NFA created by each to compare them.
RE: r[aeiou]n
NFA:
0(():
--> 1(r)
1(r):
2([):
--> 3(a)
--> 3(a)
--> 4(e)
--> 5(i)
--> 6(o)
--> 7(u)
3(a):
4(e):
5(i):
6(o):
7(u):
8(]):
--> 9(n)
9(n):
10()):
$ (Match).
The NFA built above is fairly simple with E-transitions from the r to each choice. Because we used the set index map to store the transition from each choice, those transitions are not represented in the NFA despite exisiting.
The regular expresion using the multiway or operator ends up building a much larger NFA with both more states, and more transitions than that built using speicified sets.
RE: r(a|e|i|o|u)n
NFA:
0(():
--> 1(r)
1(r):
2(():
--> 3(a)
--> 5(e)
--> 7(i)
--> 9(o)
--> 11(u)
3(a):
4(|):
--> 12())
5(e):
6(|):
--> 12())
7(i):
8(|):
--> 12())
9(o):
10(|):
--> 12())
11(u):
12()):
--> 13(n)
13(n):
14()):
$ (Match).
So there you have it: multiway-or operator and specified sets for regular expression matching. Until next time, Happy Hacking!
-
Generating P-Code by AST Traversal
-
Ternary Search Tries: String Specific Ordered Symbol Tables
-
Digital Search Trees
-
Lossless Compression Part III: Huffman Coding
-
Lossless Compression Part II: The LZ77 Algorithm
-
Lossless Compression Part I: Working with Bits in a Byte Oriented World
-
Bottom Up AVL Tree: The OG Self-Balancing Binary Search Tree
-
A Different Take on Merge Sort: Binary Search Trees?
-
Deleting Entries From Open-Address Hash tables
-
Transform any Binary Search Tree in to a Sorted Linked List
Leave A Comment