Thompsons Construction: Building NFA from Regular Expressions using a Bottom Up Approach

mgcadmin 15-10-2024

Regular expressions and NFA have been an active area of study in theoretical computer science since the late 1950's. As such, there is a large body of work exploring a multitude of ways to evaluate regular expressions for the purpose of pattern matching. In previous posts I have discussed one such implementation presented by Robert Sedgewick in his book "Algorithms" (4th ed). While being conceptually simple to understand than some other approaches, it is also more difficult to modify or extend. As I've said before, when it comes to data structures and algorithms, you can never have too many choices. With this in mind I've decided to explore some other implementations for evaluating regular expressions.

In todays post I am going to talk about one such approach to Thompson's Construction. Thompson's Construction is a process for building Non-Deterministic Finite Automata(NFA) from Regular Expreessions(RE), and is the algorithm outlined in the "dragon book", amongst many other notable texts for building pattern matching NFA[1].

The algorithm works by creating individual NFA's for each sub-expression of a regular expression, and then "wiring" them together into larger NFA's until we have constructed an NFA for the entire expression, bottom up. This is done by building an abstract syntax tree of the regular expression, which is then traversed post-order to build the NFA's and connect the sub expressions.

Non-Deterministic Finite Automata

Non-Deterministic Finite Automata are Mathematical models of computation. To put it simply though, an NFA is a set of states, with one state being designated as the starting state, and another state being designated as the accepting state. Each state has an associated (possibly empty) set of transitions from it's self to other states.

I'm sure you have quickly realized that what I just described is a directed graph. All NFA's (and DFA's for that matter) can be viewed as directed graphs. The previously discussed algorithm by sedgewick made this point very clear by making explicit use of a directed graph data structure. That algorithm however uses a vertex-labeled graph, where I will be using a vertex-indexed edge-labeled graph. Essentially, my graph has labels on the edges, while Sedgewick's graph moves that label to the vertex. This might seem like a trivial difference, but it has a number of practical implications.

In addition to the directed graph, every NFA also has an alphabet of symbols that represent the language recognized by that particular NFA.

typedef int State;

class Edge {
    public:
        virtual Token getLabel() = 0;
        virtual bool matches(char c) = 0;
        virtual bool isEpsilon() = 0;
        virtual ~Edge() { }
};

struct Transition {
    State from;
    State to;
    Edge* edge;
    Transition(State s, State t, Edge* m) {
        from = s; to = t; edge = m;
    }
};

class NFA {
    private:
        State start;
        State accept;
        unordered_map<State, vector<Transition>> states;
    public:
        NFA();
        NFA(const NFA& nfa);
        void makeState(State name) {
            if (states.find(name) == states.end()) {
                states.insert(make_pair(name, vector<Transition>()));
            }
        }
        void setStart(State ss) {
            start = ss;
        }
        void setAccept(State as) {
            accept = as;
        }
        State getStart() {
            return start;
        }
        State getAccept() {
            return accept;
        }
        void addTransition(Transition t) {
            states[t.from].push_back(t);
        }
        int size() {
            return states.size();
        }
        unordered_map<State, vector<Transition>>& getStates() {
            return states;
        }
        vector<Transition>& getTransitions(State state) {
            return states[state];
        }
        bool match(string text);
        NFA& operator=(const NFA& nfa);
};

A transition is a tuple of 2 states and an edge {f, t, e} where f is the from state, t is the to state, and e is the type of edge the transition is. The edge can be either a character labeld transition, which consumes input when traversed or an Epsilon transition, which while still causing a change of state when traversed, does not consume any input. This last fact is key because it is what allows us to "wire up" the NFA's as we go using epsilon transitions.

class CharEdge : public Edge {
    private:
        char label;
    public:
        CharEdge(char c) {
            label = c;
        }
        ~CharEdge() {

        }
        bool isEpsilon() {
            return false;
        }
        bool matches(char c) {
            return label == c;
        }
        char getLabel() {
            return label;
        }
};

class EpsilonEdge : public Edge {
    public:
        EpsilonEdge() { }
        ~EpsilonEdge() { }
        bool matches(char c) {
            return true;
        }
        bool isEpsilon() {
            return true;
        }
        char getLabel() {
            return '&';
        }
};

To build our NFA, we populate the NFA with states, and add the correct type of transitions to each source stat. Each NFA has a start and accept state which must set before being used. Once our NFA is built, the match() method is used to see if the NFA built from the supplied regular expression recognizes the input. As can be seen from the example below however, building even the most trivial of NFA's by hand can be tedious, and is prone to error. Thus being able to automate this process is important.

void testNFA() {
    NFA nfa;
    nfa.makeState(0); nfa.makeState(1); nfa.makeState(2); nfa.makeState(3);
    nfa.setStart(0);
    nfa.setAccept(3);
    nfa.addTransition(Transition(0, 1, new CharEdge('a'))); //start -> a
    nfa.addTransition(Transition(1, 2, new CharEdge('b'))); // a -> b
    nfa.addTransition(Transition(2, 2, new CharEdge('b'))); // b -> b
    nfa.addTransition(Transition(2, 3, new EpsilonEdge())); // b -> accept
    cout<<nfa.match("abb")<<endl;
    cout<<nfa.match("aab")<<endl;
    cout<<nfa.match("ab")<<endl;
    cout<<nfa.match("a")<<endl;
}

The solution to automating the constuction is to convert a Regular expression string to an abstract syntax tree, and then build our NFA automatically from the tree. We'll return to implementing match() once we have our NFA constructed from an AST, so lets get to it.

Building an AST from a Regular Expression

Parsing the Expression

Parsing of regular expressions begins by converting the parenthesized in-fix expression into it's non-parenthesized postfix representation. To accomplish this we use the shunting yard algorithm introduced by Edsgar Dijkstra, and which i've covered previously.

class Parser {
       private:
           bool isOp(char c);
           int precedence(char c);
           int leftAssociative(char c);
           string addConcatOp(string& str);
           string makePostfix(string& str);
           RegularExpression* makeTree(string& postfix);
      public:
          Parser();
          RegularExpression* parse();
};

To implement the shunting yard algorithm for regular expressions, we need to know both the precedence, and associativity of the various operators. A quick glance to chapter 3 of the dragon book by Aho, Sethi, and Ullman[1] tells us that all of the operators are left associative. Well, at least thats simple.

bool leftAssociative(char c) {
    switch (c) {
        case '*':
        case '+':
        case '?':
        case '|':
        case '@':
            return true;
        default:
            break;
    }
    return false;
}

When it comes to precedence, Closure's have the highest precedence, followed by concatenation and then alternation.

int precedence(char c) {
    switch (c) {
        case '*': return 50;
        case '+': return 50;
        case '?': return 50;
        case '@': return 30;
        case '|': return 20;
        default:
            break;
    }
    return 10;
}

Making Implicit Operators Explicit

In regular expressions the concat operator is only implicitly present. That is, we know that two symbols next to each other are concatenated, but there is no visible operator symbol like there is for closures and alternation. This presents us with a sligh problem when it comes to implementing the shunting yard algorithm: It's awfully hard to reposition an operator from infix to postfix when there is nothing to denote it's presence. As such, before we can transform the expression we begin by first explicitly inserting an '@' symbol as the symbol for conc@enation into the expression string.

string addConcatOp(string str) {
    string fixed;
    for (int i = 0; i < str.length(); i++) {
        fixed.push_back(str[i]);
        if (str[i] == '(' || str[i] == '|')
            continue;
        if (i+1 < str.length()) {
            char p = str[i+1];
            if (p == '|' || p == '*' || p == '+' || p == ')')
                continue;
            fixed.push_back('@');
        }
    }
    return fixed;
}

Converting The Expression To Postfix

With our explicit concatenation operators in place the rest of the shunting yard algoritm proceeds the same as it would for an algebraic expression.

string makePostfix(string& str) {
    str = addConcatOp(str);
    cout<<"Inserting explicit concat operators: "<<str<<endl;
    string postfix;                                                                                                   
    for (int i = 0; i < str.length(); i++) {
        if (str[i] == '(') {
            ops.push(str[i]);
        } else if (str[i] == '|' || str[i] == '*' || str[i] == '+' || str[i] == '@' || str[i] == '?') {
                if (precedence(str[i]) < precedence(ops.top()) || (precedence(str[i]) == precedence(ops.top()) && leftAssociative(str[i]))) {
                    char c = ops.pop();
                    postfix.push_back(c);
                    ops.push(str[i]);
                } else {
                    ops.push(str[i]);
                }
        } else if (str[i] == ')') {
            while (!ops.empty()) {
                char c = ops.pop();
                if (c == '(')
                    break;
                else postfix.push_back(c);
            }
        } else {
            postfix.push_back(str[i]);
        }
    }
    while (!ops.empty()) {
        char c = ops.pop();
        if (c != '(')
            postfix.push_back(c);
    }
    return postfix;
}

Building The AST

Armed with our post-fix expression, building an AST is fairly straight forward. To represent the AST i've implemented a class heirarchy which derives from the base class RegularExpression. The ExpressionLiteral class will be used to represent the literal values as the leaf nodes and the ExpressionOperator class for representing operators as the internal nodes of the AST.

class RegularExpression {
    public:
        virtual char getSymbol() = 0;
        virtual RegularExpression* getLeft() = 0;
        virtual RegularExpression* getRight() = 0;
};

class ExpressionLiteral : public RegularExpression {
    private:
        char symbol;
    public:
        ExpressionLiteral(char sym) { symbol = sym; }
        RegularExpression* getLeft() {
            return nullptr;
        }
        RegularExpression* getRight() {
            return nullptr;
        }
        char getSymbol() {
            return symbol;
        }
};

class ExpressionOperator : public RegularExpression {
    private:
        RegularExpression* left;
        RegularExpression* right;
        char symbol;
    public:
        ExpressionOperator(char c, RegularExpression* ll, RegularExpression* rr) {
            symbol = c;
            left = ll;
            right = rr;
        }
        char getSymbol() {
            return symbol;
        }
        RegularExpression* getLeft() {
            return left;
        }
        RegularExpression* getRight() {
            return right;
        }
};

The actual construction of the AST is performed by iterating over the expression string and when a non-operator symbol is encountered an ExpressionLiteral node is created which is then pushed onto a stack of temporary trees.

Operators are handled similarly except when an operator node is created, previously created trees are popped from the stack and set as the operator nodes left and right children respectively, with the new tree with our operator node as the root being pushed onto the temporary tree stack.

RegularExpression* makeTree(string postfix) {
    Stack<RegularExpression*> sf;
    for (char c : postfix) {
        cout<<"Processing: "<<c<<" ";
        if (!isOp(c)) {
            cout<<"Alphabet."<<endl;
            sf.push(new ExpressionLiteral(c));
        } else {
            cout<<"Operator."<<endl;
            auto right = sf.empty() ? nullptr:sf.pop();
            auto left = sf.empty() ? nullptr:sf.pop();
            sf.push(new ExpressionOperator(c, left, right));
        }
    }
    return sf.pop();
}

When all of the input has been consumed, the stack should contain one item: a pointer to the root of our expression tree. With our AST constructed, we can now swing back around to our NFA, and start thinking about how we can automate it's construction from our AST.

Constructing The NFA

As mentioned above, thompsons construction builds an NFA for a regular expression by building automata for its sub expressions and combining them to create the final automaton. Each operator has a distinct automata, as do character transitions, and even empty expressions. Lets take a look at the individual automata we're going to be building.

There are a few operations performed during the construction of each NFA that will take place regardless of the automata we construct and as such make sense to be abstracted out in to helper functions. One of these tasks, is the creation of a new NFA with a start and end state, as well as the generation of state names.

        int label = 0;
        int makeStateLabel() {
            return label++;
        }
        void initNextNFA(NFA& nnfa) {
            int nstart = makeStateLabel();
            nnfa.makeState(nstart);
            nnfa.setStart(nstart);
            int nend = makeStateLabel();
            nnfa.makeState(nend);
            nnfa.setAccept(nend);
        }

It should also be mentioned that when I say we are "wiring NFA's together" I'm not implying that we are some how linking NFA objects together - we're not. Instead, we merge NFA's by copying their transitions into a new NFA and throwing the old one away:

       void copyTransitions(NFA& nnfa, NFA onfa) {
            for (auto state : onfa.getStates()) {
                nnfa.makeState(state.first);
                for (auto trans : onfa.getTransitions(state.first)) {
                    nnfa.addTransition(trans);
                }
            }
        }

At each step of the construction, new NFA are created from previously constructed NFA until we arrive at the completed, desired automaton.

Atomic NFA For Symbols and E-transitions

Ok, let's start putting some NFA's together. The simplest of them are the so-called "atomic" NFA's: the empty expression, and the single character expression.

NFA singleTransitionNFA(Edge* edge) {
      NFA nfa;
      initNextNFA(nfa);
      nfa.addTransition(Transition(nfa.getStart(), nfa.getAccept(), edge)); 
      return nfa;
}

Infact, an empty expression and a character expression are the same automata, the difference being the type of edge the transition has. An empty Expression is simply a start state, an ending state, and an Epsilon transition connecting them.

NFA emptyExpr() {
       return singleTransitionNFA(new EpsilonEdge());
}

With that definition in mind, then the single character automata is a starting state connecting to an accepting state, with a character labeled edge connecting them.

NFA atomicNFA(char c) {
       return singleTransitionNFA(new CharEdge(c));
}

The Concatenation Operator

Next up is concatenation. To implement the concatenation automaton, we take the two automata to be joined, and create an epsilon transition from the first automata's accepting state to the second automata's starting state. Finally, we set the first automata's accepting state to be the second automata's accepting state. See what I mean by "wiring together" the automata?

NFA concatNFA(NFA first, NFA second) {
      NFA nnfa;
      copyTransitions(nnfa, first);
      copyTransitions(nnfa, second);
      nnfa.setStart(first.getStart());
      nnfa.setAccept(second.getAccept());
      nnfa.addTransition(Transition(first.getAccept(), second.getStart(), new EpsilonEdge()));
      return nnfa;
}

The Alternation Operator

And now we get to the Or operator. Take a moment to think about how we've implemented the previous types of automata and see if you can come up with how to wire this one.

We begin by creating a new starting state. From the new starting state we create two epsilon transitions which connect the new start state, to the start states of our two choices. We then create a new accept state and connect the accept state of our respective automata with the new one with epsilon changes.

NFA alternateNFA(NFA first, NFA second) {
        //create new NFA with start and end state.
        NFA nnfa;
        initNextNFA(nnfa);
        //copy states and transitions from both NFA's to new NFA
       copyTransitions(nnfa, first);
       copyTransitions(nnfa, second);
      //Add new E-transitions from new start state to first and second NFAs
      nnfa.addTransition(Transition(nnfa.getStart(), first.getStart(), new EpsilonEdge()));
      nnfa.addTransition(Transition(nnfa.getStart(), second.getStart(), new EpsilonEdge()));
     //Add new E-transitions from first and second accept state to new accept state.
     nnfa.addTransition(Transition(first.getAccept(), nnfa.getAccept(), new EpsilonEdge()));
     nnfa.addTransition(Transition(second.getAccept(), nnfa.getAccept(), new EpsilonEdge()));
     return nnfa;
}

The Closure Operators

Finally, we arrive at our closure operators: *, ?, and +. The Kleene star (*) is for matching a character 0 or more times, while the + closure matches those that occur at least once but possibly more times, and as such the two NFA's are constructed in a very similar fashion.

The ? operator is for matching a character either 0 or 1 times and is more closely related to the or operator. Actually, it IS an or operator, with the choices being a single character transition, or an epsilon transition.

NFA kleeneStar(NFA torepeat) {
            NFA nnfa;
            initNextNFA(nnfa);
            //create inner loop
            torepeat.addTransition(Transition(torepeat.getAccept(), nnfa.getStart(), new EpsilonEdge()));
            copyTransitions(nnfa, torepeat);
            //wire new start & accept state with E-transitions
            nnfa.addTransition(Transition(nnfa.getStart(), torepeat.getStart(), new EpsilonEdge()));
            nnfa.addTransition(Transition(torepeat.getAccept(), nnfa.getAccept(), new EpsilonEdge()));
            //create outter loop
            nnfa.addTransition(Transition(nnfa.getStart(), nnfa.getAccept(), new EpsilonEdge()));
            return nnfa;
}

NFA kleenePlus(NFA torepeat) {
            NFA nnfa;
            initNextNFA(nnfa);
            //create inner loop
            torepeat.addTransition(Transition(torepeat.getAccept(), nnfa.getStart(), new EpsilonEdge()));
            copyTransitions(nnfa, torepeat);
            //wire new start & accept state with E-transitions
            nnfa.addTransition(Transition(nnfa.getStart(), torepeat.getStart(), new EpsilonEdge()));
            nnfa.addTransition(Transition(torepeat.getAccept(), nnfa.getAccept(), new EpsilonEdge()));
            return nnfa;
}

NFA zeroOrOnce(NFA onfa) {
            NFA nnfa;
            initNextNFA(nnfa);
            copyTransitions(nnfa, onfa);
            //wire in match choice
            nnfa.addTransition(Transition(nnfa.getStart(), onfa.getStart(), new EpsilonEdge()));
            nnfa.addTransition(Transition(onfa.getAccept(), nnfa.getAccept(), new EpsilonEdge()));
            //wire in epsilon choice.
            nnfa.addTransition(Transition(nnfa.getStart(), nnfa.getAccept(), new EpsilonEdge()));
            return nnfa;
}

All that remains to do is take our AST and output the fully constructed NFA! Using the above procedures, this is a simple task. All it takes is a post-order traversal of our tree with a stack to hold our intermediate subexpressions.

 class NFACompiler {
     private: 
        Stack<NFA> nfaStack;
        NFA buildOperatorNFA(RegularExpression* ast) {
            switch (ast->getSymbol()) {
                case '@': {
                    NFA b = nfaStack.pop();
                    NFA a = nfaStack.pop();
                    return concatNFA(a, b);
                }
                break;
                case '|': {
                    NFA b = nfaStack.pop();
                    NFA a = nfaStack.pop();
                    return alternateNFA(a, b);
                }
                case '*': {
                    NFA a = nfaStack.pop();
                    return kleeneNFA(a, false);
                }
                case '+': {
                    NFA a = nfaStack.pop();
                    return kleeneNFA(a, true);
                }
                case '?': {
                    NFA a = nfaStack.pop();
                     return zeroOrOnce(a);
                }
                default:
                    break;
            }
            return NFA();
        }
        void gen_nfa(RegularExpression* ast) {
            if (ast != nullptr) {
                gen_nfa(ast->getLeft());
                gen_nfa(ast->getRight());
                if (!isOp(ast->getSymbol())) {
                    nfaStack.push(atomicNFA(ast->getSymbol()));
                } else {
                    nfaStack.push(buildOperatorNFA(ast));
                }
            }
        }
    public:
         NFACompiler() { }
         NFA compile(string pattern) {
            Parser parser;
            string re = "(" + pattern + ")";
            RegularExpression* ast = parser.parse(re);
            gen_nfa(ast);
            return nfaStack.pop();
        }
};

Finally, we're ready to simulate the NFA, so lets get back to talking about impelementing match().

Simulating the NFA for pattern matching

One of the ways we can implement match() is via an iterative Depth First Search, unfortunately the straight depth first search approach is susceptible to the same catastrophic backtracking that plagues Perl, PCRE, Java, and many other popular regex engines. It will also need some modification before we can use it on NFA's with closure's as It has the potential of getting stuck in loops of epsilon transitions. The fix for that problem is easy at least.

     struct Node {
          int strPos;
          State state;
          unordered_set<Transition> epsHistory;
          Node(int sp, State s, unordered_set<Transition> t) : strPos(sp), state(s), epsHistory(t) { }
     };
     bool matchBT(string text) {
            unordered_set<Transition> epsHistory;
            Queue<Node> sf;
            sf.push(Node(0, start, epsHistory));
            while (!sf.empty()) {
                int strPos = sf.top().strPos;
                State currState = sf.top().state;
                epsHistory = sf.top().epsHistory;
                sf.pop();
                char input = text[strPos];
                if (accept == currState) {
                    return true;
                }
                for (Transition t : states[currState]) {
                    if ((t.edge->matches(input) || t.edge->matches('.')) || t.edge->isEpsilon()) {
                        if (t.edge->isEpsilon()) { 
                            if (epsHistory.find(t) != epsHistory.end()) {
                                continue;
                            }
                            epsHistory.insert(t);
                            sf.push(Node(strPos, t.to, epsHistory));
                        } else {
                            epsHistory.clear();
                            sf.push(Node(strPos + 1, t.to, epsHistory));
                        }
                    }
                    cout<<endl;
                }
            }
            return false;
        }

To avoid getting lost in loops of Epsilon transitions, we simply cache the epsilon transitions we've already crossed, and if we encounter them again we skip them. We must be careful to clear this cache however whenever we cross a non-Epsilon transition.

Allright, let's see it in action:

#include <iostream>
#include "compiler.hpp"
using std::cout;
using std::endl;

int main(int argc, char* argv[]) {
    NFACompiler compiler;
    NFA nfa = compiler.compile("(a*b|ac)d");
    cout<<"---------------------"<<endl;
    cout<<nfa.matchBT("aaaaaabd")<<endl;
    return 0;
}

Parsing: ((a*b|ac)d)
Inserting explicit concat operators: ((a*@b|a@c)@d)
Postfix: a*b@ac@|d@
---------------------
Attempting to match: aaaaabd, Start state: 10, Accept state: 13
State: 10, Input: a
10-(&)->6

10-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: a
1-(&)->3

1-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: a
1-(&)->3

1-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: a
1-(&)->3

1-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: a
1-(&)->3

1-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: b
1-(&)->3

1-(&)->2

State: 2, Input: b
2-(&)->3

2-(&)->0

State: 0, Input: b
Dead end.

State: 3, Input: b
3-(&)->4

State: 4, Input: b
4-(b)->5

State: 5, Input: d
5-(&)->11

State: 11, Input: d
11-(&)->12

State: 12, Input: d
12-(d)->13

State: 13, Input:
Found Accept State.
1

As you can see, it works as advertised, and if we were so inclined, we could call it quits here. But if you're anything like me then that whole "catastrophic backtracking" thing is probably sitting in the corner of your mind waiting to rain on your parade. In that case, allow me to tell you about a slightly more complicated way of simulating the NFA - one that doesnt require backtracking.

A Better Way

As I previously mentioned, a straight depth first search isnt the only way to simulate the NFA - it's just the conceptually simplest. A faster method is the one originally described by Thompson and used in awk and grep. That algorithm, called "Thompsons Algorithm" (noticing a pattern here?) uses power set construction to simulate the NFA, And if you squint just right, you might just notice what looks like a depth first search hidden in there....

        unordered_set<State> move(unordered_set<State> clist, char ch) {
            unordered_set<State> nlist;
            for (State s : clist) {
                for (Transition t : states[s]) {
                    if (t.edge->matches(ch) || t.edge->matches('.')) {
                        if (t.edge->isEpsilon() == false)
                            nlist.insert(t.to);
                    }
                }
            }
            return nlist;
        }

        unordered_set<State> e_closure(unordered_set<State> clist) {
            unordered_set<State> nlist = clist;
            Stack<State> sf;
            for (State s : clist)
                sf.push(s);
            while (!sf.empty()) {
                State s = sf.pop();
                for (Transition t : states[s]) {
                    if (t.edge->isEpsilon()) {
                        if (nlist.find(t.to) == nlist.end()) {
                            nlist.insert(t.to);
                            sf.push(t.to);
                        }
                    }
                }
            }
            return nlist;
        }

        bool match(string text) {
            set<State> curr, next;
            next.insert(start);
            curr = e_closure(next);
            for (int i = 0; i < text.length(); i++) {
                next = move(curr, text[i]);
                curr = e_closure(next);
            }
            return curr.find(accept) != curr.end();
        }

So how does thompsons algorithm work? It begins by initializing the state list with all of the states reachable from the starting state by traversing only epsilon transitions - thats what the e_closure() method does. Next, for each character in the string we are searching, we find the transitions on that character which are present in our current state list - this is the move() procedure. Next, we gather all of states reachable via epsilon transitions from those states. This repeats for every character in the string until we have consumed all of the input. If, when there is no more input to process, the accept state is in our state list, then the pattern matches.

Instead of pursuing one path to its terminus and then backtracking on failure, this algorithm has the effect of running all of the paths in parallel, doing the search step wise from each state so it never needs to back track, as at each step it can determine the correct transition to take (or fail early if there is none). As far as algorithms go, I find this one to be particularly beautiful.

#include <iostream>
#include "compiler.hpp"
using std::cout;
using std::endl;

int main(int argc, char* argv[]) {
    NFACompiler compiler;
    NFA nfa = compiler.compile("(a*b|ac)d");
    cout<<"---------------------"<<endl;
    if (nfa.match("aaaaaabd")) cout<<"Matched."<<endl;
    return 0;
}

Parsing: ((a*b|ac)d)
Inserting explicit concat operators: ((a*@b|a@c)@d)
Postfix: a*b@ac@|d@
Processing: a Alphabet.
Processing: * Operator.
Processing: b Alphabet.
Processing: @ Operator.
Processing: a Alphabet.
Processing: c Alphabet.
Processing: @ Operator.
Processing: | Operator.
Processing: d Alphabet.
Processing: @ Operator.
        *
          a
      @
        b
    |
        a
      @
        c
  @
    d
---------------------
Init:
	10 - (&) ->2
	10 - (&) ->6
	2 - (&) ->0
	2 - (&) ->3
	3 - (&) ->4
a: 
	0 - (a) ->1
	6 - (a) ->7
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
	7 - (&) ->8
a: 
	0 - (a) ->1
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
a: 
	0 - (a) ->1
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
a: 
	0 - (a) ->1
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
a: 
	0 - (a) ->1
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
b: 
	4 - (b) ->5
	5 - (&) ->11
	11 - (&) ->12
d: 
	12 - (d) ->13
Matched.

Oh, and the set of edges thompsons algorithm collects while simulating the NFA? It just so happens to be the same set of edges you'd need to convert your NDFA to a DFA, but that is a subject for another day.

Well, thats what I got. Until next time, Happy Hacking!

MaxGCoding.com

For the love of programming

Thompsons Construction: Building NFA from Regular Expressions using a Bottom Up Approach

Non-Deterministic Finite Automata

Building an AST from a Regular Expression

Parsing the Expression

Making Implicit Operators Explicit

Converting The Expression To Postfix

Building The AST

Constructing The NFA

Atomic NFA For Symbols and E-transitions

The Concatenation Operator

The Alternation Operator

The Closure Operators

Simulating the NFA for pattern matching

A Better Way

Further Reading:

Leave A Comment

Improving mgcLisp's define syntax

Separating the Men from the Boys, Knuth Style

Reducing rotations during deletion from AVL trees

Parsing Lisp: From Data To Code

Swiz Heaps: A deterministic modification of Randomized Meldable Heaps

Let's talk Eval/Apply

BST Deletion: Removal By Merge

Dictionary Based Compression: The LZW Algorithm

Taking Action: Compiling Procedures to P-Code

Making Decisions: Compiling If Statements to P-Code