Extending Sedgewick's explicit Digraph NFA

mgcadmin 12-10-2024

Regular expressions are a fantastic way to describe patterns in text. Thanks to the work of pioneers like Stephen Kleene we know that for any regular expression there is cooresponding finite automaton (FA) and vice versa. This is known as Kleenes Theorem. In the beginning these expressions were comprised of just a handfull of basic operations: concatenation, closure, and the or operator. Like algebraic expressions parentheses can be used to override a given operators precedence.

Over time this list of operators has grown quite extensive and users have come to expect a wider selection of operators to facilitate constructing ever more powerful regular expressions. Some would argue that many of these additional operators have made the expressions decidedly non-regular, but thats a discussion for another day. Today I want to discuss two similar but distinct operations over regular expressions: the specified set and the multiway-or operator. I will be using the code from my earlier post on compiling regular expressions as the starting point for todays post so you may want to give that a read as well.

The Multiway-or operator

As currently implemented, buildEpsilonGraph() only supports the binary-or operator. This allows us to write regular expressions such as "r(a|u)n" to match

ran or run. But if our friend ron is ran or is going for a run, we'd have a hard time recognizing it. For our FA to match ran, ron, or run we would have to write it as "r((a|u)|o)n". While possible, it is both complicated and error prone. The multiway-or operator allows us to simply write "r(a|o|u)n". This simplified syntax is far more readable and so it's easier to spot any mistakes. The section of code in the buildEpsilonGraph() method responsible for building the part of the NFA which represents the or operator looks like this:

    else if (re[i] == ')') {
            int ro = ops.pop();
            if (re[ro] == '|') {
                lp = ops.pop();
                G.addEdge(lp, ro+1);
                G.addEdge(ro, i);
            } else if (re[ro] == '(') {
                    lp = ro;
            } else {
                lp = i;
            }
    }

It works by upon encountering an or operator, obtaining the index of the associated left parentheses by removing it from the stack. Armed with the indices to the proper vertices, it then adds the approriate edges to the graph to represent the NFA we are constructing.

To convert this to a multiway operator the first step is gather the indices of each or operator:

    else if (re[i] == ')') {
        unordered_set<int> orOpIdx;
        while (re[ops.top()] == '|') {
            orOpIdx.insert(ops.pop());
        }

With our indices collected we will proceed similarly to before, but now we add an edge from the left parentheses to all of the post-op choices and an edge from the closing parentheses to all of the pre-op choices:

        lp = ops.pop();
        for (int ro : orOpIdx) {
            G.addEdge(lp, ro+1);
            G.addEdge(ro, i);
        }
    }

Not too difficult. Let's now turn our attention to implementing specified sets.

Regular Expressions & Specified Sets

Specified sets can be used like syntactic sugar over the multiway or operator. Their use allow us to specify a set of choices for the next match without

the need to separate themn all by an explicit or operator. This means that we can re-write the example from above - "r(a|o|u)n" - as "r[aou]n". To implement support

for specified sets we once again return to the buildEpsilonGraph() method. Square brackets are used to designate the specified set, and as we did for parentheses we push the index of the left bracket onto the stack, and take action during parsing upon encountering the closing bracket.

    else if (re[i] == ']') {
        lp = ops.pop();
        for (int inside = lp+1; inside < i; inside++) {
            G.addEdge(lp, inside);
            specifiedSetIndex[inside] = i;
        }
    }
    /*
        code for other operators
    */
    if (re[i] == '(' || re[i] == '*' ||  re[i] == '+' || re[i] == ')' || re[i] == '[' || re[i] == ']')
                    G.addEdge(i, i+1);

You will notice that we create an edge from the opening paren to the character, but instead of an edge going out from the character we add an entry to a hashtable that pairs the matched character with the index of the sets closing parentheses. That's because for this implementation we also need to make a change to the inner loop of the match() method.

            collectReachedStates(dfs, reached);
            for (int i = 0; i < text.length(); i++) {
                Bag<int> matches;
                for (int state : reached) {
                        if (state < m && (re[state] == text[i] || re[state] == '.')) {
                            if (specifiedSetIndex.find(state) != specifiedSetIndex.end()) {
                                matches.add(specifiedSetIndex[state]);
                            } else {
                                matches.add(state+1);
                            } 
                        }
                }
                if (reached.empty()) return false;

How do they compare?

If specified sets and multiway-or both let us choose from amongst multiple options, than why do we need both? Well, it's not quite as simple as that. Sure, the two constructs can be used equivelantly, but in truth the multiway or is more powerful than specified set. However, when used for the kind of problems like the example above, using a specified set is more efficient. To understand why this might be, lets take a look at the NFA created by each to compare them.

RE: r[aeiou]n

NFA:
0(():
  --> 1(r)
1(r):
2([):
  --> 3(a)
  --> 3(a)
  --> 4(e)
  --> 5(i)
  --> 6(o)
  --> 7(u)
3(a):
4(e):
5(i):
6(o):
7(u):
8(]):
  --> 9(n)
9(n):
10()):
$ (Match).

The NFA built above is fairly simple with E-transitions from the r to each choice. Because we used the set index map to store the transition from each choice, those transitions are not represented in the NFA despite exisiting.

The regular expresion using the multiway or operator ends up building a much larger NFA with both more states, and more transitions than that built using speicified sets.

RE: r(a|e|i|o|u)n
NFA:
0(():
  --> 1(r)
1(r):
2(():
  --> 3(a)
  --> 5(e)
  --> 7(i)
  --> 9(o)
  --> 11(u)
3(a):
4(|):
  --> 12())
5(e):
6(|):
  --> 12())
7(i):
8(|):
  --> 12())
9(o):
10(|):
  --> 12())
11(u):
12()):
  --> 13(n)
13(n):
14()):
$ (Match).

So there you have it: multiway-or operator and specified sets for regular expression matching. Until next time, Happy Hacking!

MaxGCoding.com

A software Engineering Space

Extending Sedgewick's explicit Digraph NFA

The Multiway-or operator

Regular Expressions & Specified Sets

How do they compare?

Leave A Comment

Lossless Compression Part III: Huffman Coding

Lossless Compression Part II: The LZ77 Algorithm

Lossless Compression Part I: Working with Bits in a Byte Oriented World

Bottom Up AVL Tree: The OG Self-Balancing Binary Search Tree

A Different Take on Merge Sort: Binary Search Trees?

Deleting Entries From Open-Address Hash tables

Transform any Binary Search Tree in to a Sorted Linked List

From Regular Expressions To NFA by way of Abstract Syntax Trees: Thompsons Construction

Extending Sedgewick's explicit Digraph NFA

Iterative In-order B Tree Traversal