Composable Linked Digraphs: Yet another NFA Data Structure

When it comes to implementing Finite State Automata picking a data structure with which to model the machine is an interesting problem in that on the one hand the choice is obvious: Finite state machines, be them deterministic or non, can be viewed as directed graphs (digraphs). Each state in the FA is a vertex in the graph, and each transition from one state to the next is modeled as a directed edge. 

In my previous post on thompsons construction I used what I will refer to from here on out as the "adjacency list approach", createivly named as such because it employs a rather straight forward adjacency-list graph-representation as the underlying data structure for the automaton.  

typedef int State;

class Edge {
    private:
        State from;
        State to;
    public:
        Edge(State s, State t) : from(s), to(t) { }
        State getFrom() const { return from; }
        State getTo() const { return to; }
};

class NFA {
    private:
        State start;
        State accept;
        unordered_map<State, set<Edge*>> states;
    public:
        NFA() {
            start = 0;
            accept = 0;
        }
        /* addEdge(), hadEdge(), etc. */
};

  While having the benefit of being simple to implement, it also has a pretty significant downside: each step requires merging the adjacency lists of the NFA's being combined. For large patterns this can become a significant bottle neck, as mergeing them is an O(N) operation where N is the number of vertices to be merged from the smaller graph into the larger. In todays post I'm going to introduce an alternative data structure which will bring that complexity down to O(1).

Linked Digraphs

While both the adjacency list approach and the Linked Digraph approach model Non Finite Automata as directed graphs, they operate at different levels of abstraction. In the adjacency list representation states/vertices were a simple typedef for an integer. Linked Digraphs on the other hand uses explicit NFAState structures which maintains its own adjacency list in the form of a set of transitions to other NFAState objects.

struct NFAState;

struct Transition {
    char ch;
    bool is_epsilon;
    NFAState* dest;
    Transition(NFAState* d) : ch('&'), dest(d), is_epsilon(true) { }
    Transition(char c, NFAState* d) : ch(c), dest(d), is_epsilon(false) { }
    Transition() : dest(nullptr), is_epsilon(false) { }
};

struct NFAState {
    State label;
    vector<Transition> transitions;
    NFAState(State st) : label(st) { }
    void addTransition(Transition t) {
        transitions.push_back(t);
    }
};

struct NFA {
    NFAState* start;
    NFAState* accept;
    NFA(NFAState* s = nullptr, NFAState* a = nullptr) : start(s), accept(s) { }
};

The NFAState objects form a linked-list-like structure of the NFA's directed graph. The top level Linked Digraph object only needs to maintain two pointers: one to the start state and another to the accepting State. In this example i'm using raw pointers to keep the code uncluttered because this isnt a post on memory management, however If you are going to use this implementation I would either use some type of smart pointer like std::shared_ptr or other garbage collection scheme as destroying such a cross-linked structure can become very complicated. 

 Implementing Thompsons Construction with Linked Digraphs

Since NFA's are recursively defined, we will begin with laying out the procedure for constructing our simple NFA's: The single character transition and empty string automatons, represented by figure (a) in the picture below. Using these "base-case" automata we will then construct the operator automata with the assistance of epsilon transitions. These machines are pictured below with concatenation (b), alternation (c), and  closures (d).

NFA Regular Expression Matching based Electriuec Power Sensitive Data  Recognition Algorithm Design and Simulation

Finally, we can combin these larger NFA's together to create an NFA capable of recognizing any regular expression using the provided operators by performing a depth first traversal over the abstract syntax tree of a provided regular expressions.

NFA Building Blocks

Starting with our "base case" NFA's, the single character and empty string automatons comprise of two states and a transition connecting them. Character transitions "consume" a character of input while changing the internal state of the machine. Epsilon Transitions are change the internal state of NFA without consuming input.

int nextLabel() {
    static int label = 0;
    return label++;
}

NFA makeAtomic(char ch) {
    NFA nfa(new NFAState(nextLabel()),new NFAState(nextLabel()));
    nfa.start->addTransition(Transition(ch, nfa.accept));
    return nfa;
}

// "The empty string"
NFA makeEpsilonAtomic() {
    NFA nfa(new NFAState(nextLabel()),new NFAState(nextLabel()));
    nfa.start->addTransition(Transition(nfa.accept));
    return nfa;
}

Next up is concatenation (r),( s) -> (rs) , which is implemented by creating an epsilon transition from the first automatons accept state to the second automatons start state, we then assign the second automatons accept state to the first automaton, allowing us to discard the second automaton having effectively "absorbed" all of it's states/transitions in to the first.

NFA makeConcat(NFA a, NFA b) {
    a.accept->addTransition(Transition(b.start));
    a.accept = b.accept;
    return a;
}

 This may seem convoluted at first glance, but it saves us from needing to create an additional two states and two epsilon transitions that would be required if we were to implement concat the "easy" way:

NFA badConcatImpl(NFA a, NFA b) {
     NFAState* new_start = new NFAState(nextLabel());
     NFAState* new_end = new NFAState(nextLabel());
     new_start->addTransition(Transition(a.start));
     a.accept->addTransition(Transition(b.start));
     b.accept->addTransition(Transition(new_end));
     return NFA(new_start, new_end);
}

While this second version is more explicit in its construction making it easier to understand whats happening, careful scrutiny of both should convince you that they are equivelant in function, while the first is more space efficient.

Alternation (r|s) allows us to choose  between two possible paths by taking two NFA, and creating a new state with two epsilong transitions going from the new state to each of the NFA's start states, and doing the same from their accept states to a new accept state, returning these new start and accept states as a new NFA.

NFA makeAlternate(NFA a, NFA b) {
    NFAState* new_start = new NFAState(nextLabel());
    NFAState* new_end = new NFAState(nextLabel());
    new_start->addTransition(Transition(a.start));
    new_start->addTransition(Transition(b.start));
    a.accept->addTransition(Transition(new_end));
    b.accept->addTransition(Transition(new_end));
    return NFA(new_start, new_end);
}

And everybody's favorite, the closures:

NFA makeKleene(NFA a, bool must_accept) {
    NFAState* new_start = new NFAState(nextLabel());
    NFAState* new_end = new NFAState(nextLabel());
    new_start->addTransition(Transition(a.start));
    a.accept->addTransition(Transition(new_end));
    a.accept->addTransition(Transition(a.start));
    if (!must_accept) //Kleene * doesn't need to match, Kleene + does.
        new_start->addTransition(Transition(new_end));
    return NFA(new_start, new_end);
}

// ? isnt _really_ a closure, i suppose.
NFA makeZeorOrOne(NFA a) {
    return makeAlternate(a, makeEpsilonAtomic());
}

Creating the full NFA is done via post order traversal of the regular expressions abstract syntax tree:

class Compiler {
    private:
        Stack<NFA> st;
        NFA nfa;
        void trav(astnode* node) {
            if (node != nullptr) {
                if (node->type == LITERAL) {
                    cout<<"Character State: "<<node->c<<endl;
                    st.push(makeAtomic(node->c));
                } else if (node->type == CHCLASS) {
                    st.push(makeCharClass(node->ccl));
                } else {
                    cout<<"Building: "<<node->c<<endl;
                    switch (node->c) {
                        case '|': {
                            trav(node->left);
                            trav(node->right);
                            NFA rhs = st.pop();
                            NFA lhs = st.pop();
                            st.push(makeAlternate(lhs, rhs));
                        } break;
                        case '@': {
                            trav(node->left);
                            trav(node->right);
                            NFA rhs = st.pop();
                            NFA lhs = st.pop();
                            st.push(makeConcat(lhs, rhs));
                        } break;
                        case '*': {
                            trav(node->left);
                            NFA lhs = st.pop();
                            st.push(makeKleene(lhs, false));
                        } break;
                        case '+': {
                            trav(node->left);
                            NFA lhs = st.pop();
                            st.push(makeKleene(lhs, true));
                        } break;
                        case '?': {
                            trav(node->left);
                            NFA lhs = st.pop();
                            st.push(makeZeorOrOne(lhs));
                        } break;
                        default:
                            break;
                    }
                }
            }
        }
    public:
        Compiler() {

        }
        NFA compile(astnode* node) {
            trav(node);
            return st.pop();
        }
};

NFA compile(string pattern) {
    Parser parser;
    astnode* ast = parser.parse(pattern);
    print(ast, 1);
    Compiler compiler;
    return compiler.compile(ast);
}


Leave A Comment