Composable Linked Digraphs: An efficient NFA Data Structure for Thompsons Construction

mgcadmin 10-10-2025

When it comes to implementing Finite State Automata picking a data structure with which to model the machine is an interesting problem. On the one hand the choice is obvious: Finite state machines be them deterministic or NFAs are generally viewed as directed graphs (digraphs). Each state in the FA is a vertex in the graph, and each transition from one state to the next is modeled as a directed edge. On the other hand, certain requirements of NFA's - such as multiple edge types - make the use of "normal" (i.e. vertex-centric) digraph data structures found in most DSA books somewhat awkward.

In my previous post on thompsons construction I used what I will refer to from here on out as the "adjacency list approach", because it employs just such a simple adjacency-list graph-representation as the underlying data structure for the automaton. While having the benefit of being simple to implement, the adjacency list approach also has a pretty significant downside: each step requires merging the adjacency lists of the NFA's being combined. For large patterns this can become a significant bottle neck, as mergeing them is an O(N) operation where N is the number of vertices to be merged from the smaller graph into the larger.

typedef int State;

class Edge {
    private:
        State from;
        State to;
    public:
        Edge(State s, State t) : from(s), to(t) { }
        State getFrom() const { return from; }
        State getTo() const { return to; }
};

class NFA {
    private:
        State start;
        State accept;
        unordered_map<State, set<Edge*>> states;
    public:
        NFA() {
            start = 0;
            accept = 0;
        }
        /* addEdge(), hadEdge(), etc. */
};

It is possible to work around this by "flipping" the abstraction and placing the indices of the start/accept vertices on a stack instead of the actual NFAs, allowing the use of a single graph instance with no copying. This is the approach taken in the book "Algorithms" by Robert Sedgewick. While this offers better performance than explicitly merging adjacency lists, it makes the code more complicated and thus harder to understand. In todays post I'm going to introduce an alternative data structure which will bring the O(N) complexity of merging NFAs down to O(1) without having to flip the abstraction inside out during construction.

Linked Digraphs

While both the adjacency list approach and the Linked Digraph approach model Non Finite Automata as directed graphs, they operate at different levels of abstraction. In the adjacency list representation states/vertices were a simple typedef for an integer, used as an index into a table. Linked Digraphs use explicit NFAState structures, each of which maintain their own adjacency list in the form of a set of transitions to other NFAState objects.

package com.maxgcoding.fa;

import java.util.ArrayList;
import lombok.Data;


@Data
public class NFAState {
    private int label;
    private ArrayList<Transition> trans;
    public NFAState(int label) {
        this.label = label;
        this.trans = new ArrayList<>();
    }
    public void addTransition(Transition t) {
        trans.add(t);
    }
}

The NFAState objects form a linked-list-like structure of the NFAs directed graph.

package com.maxgcoding.fa;

import lombok.Data;

@Data
public class Transition {
    private String edgeLabel;
    private Boolean isEpsilon;
    private NFAState destination;
    public Transition(String label, NFAState dest) {
        this.edgeLabel = label;
        this.destination = dest;
        this.isEpsilon = false;
    }
    public Transition(NFAState dest) {
        this.edgeLabel = "eps";
        this.destination = dest;
        this.isEpsilon = true;
    }
    @Override
    public int hashCode() {
        return edgeLabel.hashCode();
    }
    @Override
    public boolean equals(Object o) {
        if (o instanceof Transition) {
            return this.edgeLabel.equals(((Transition) o).getEdgeLabel()) && this.destination.equals(((Transition) o).getDestination());
        }
        return false;
    }
}

The top level Linked Digraph object only needs to maintain two pointers: one to the start state and another to the accepting State.

package com.maxgcoding.fa;

import lombok.AllArgsConstructor;
import lombok.Data;

@Data
@AllArgsConstructor
public class NFA {
    private NFAState start;
    private NFAState accept;
}

Implementing Thompsons Construction with Linked Digraphs

Since NFA's are recursively defined, we will begin with laying out the procedure for constructing our simple NFA's: The single character transition and empty string automatons, represented by figure (a) in the picture below. Using these "base-case" automata we will then construct the operator automata with the assistance of epsilon transitions. These machines are pictured below with concatenation (b), alternation (c), and closures (d).

NFA Regular Expression Matching based Electriuec Power Sensitive Data Recognition Algorithm Design and Simulation

Finally, we can combin these larger NFA's together to create an NFA capable of recognizing any regular expression using the provided operators by performing a depth first traversal over the abstract syntax tree of a provided regular expressions.

NFA Building Blocks

Starting with our "base case" NFA's, the single character and empty string automatons comprise of two states and a transition connecting them. Unlike with implementing a DFA, an NFA uses two types of transitions: Character transitions and Epsilon transitions. Character transitions are the same as DFA transitions in that they "consume" a character of input while changing the internal state of the machine. Epsilon Transitions are unique to NFA's as they change the internal state of NFA without consuming input. It is thanks to Epsilon transitions that thompsons construction is possible.

package com.maxgcoding.compile;

import java.util.Stack;

import com.maxgcoding.fa.NFA;
import com.maxgcoding.fa.NFAState;
import com.maxgcoding.fa.Transition;
import com.maxgcoding.parse.Node;
import com.maxgcoding.parse.NodeType;

public class NFACompiler {
    private Stack<NFA> st;
    private int nextLabel;
    private int makeLabel() {
        return nextLabel++;
    }
    public NFACompiler() {
        st = new Stack<>();
    }

    private NFA makeAtomic(String str) {
        NFAState ns = new NFAState(makeLabel());
        NFAState ts = new NFAState(makeLabel());
        ns.addTransition(new Transition(str, ts));
        return new NFA(ns,ts);
    }

    private NFA makeEpsilonAtomic() {
        NFAState ns = new NFAState(makeLabel());
        NFAState ts = new NFAState(makeLabel());
        ns.addTransition(new Transition(ts));
        return new NFA(ns,ts);
    }

Next up is concatenation (r)( s) -> (rs) , which is implemented by creating an epsilon transition from the first automatons accept state to the second automatons start state, we then assign the second automatons accept state to the first automaton, allowing us to discard the second automaton having effectively "absorbed" all of it's states/transitions in to the first.

    private NFA makeConcat(NFA lhs, NFA rhs) {
        lhs.getAccept().addTransition(new Transition(rhs.getStart()));
        lhs.setAccept(rhs.getAccept());
        return lhs;
    }

This may seem convoluted at first glance, but it saves us from needing to create an additional two states and two epsilon transitions that would be required if we were to implement concat the "easy" way:

private NFA badConcatImpl(NFA a, NFA b) {
     NFAState new_start = new NFAState(makeLabel());
     NFAState new_end = new NFAState(makeLabel());
     new_start.addTransition(Transition(a.start));
     a.accept.addTransition(Transition(b.start));
     b.accept.addTransition(Transition(new_end));
     return new NFA(new_start, new_end);
}

While this second version is more explicit in its construction making it easier to understand whats happening, careful scrutiny of both should convince you that they are equivelant in function, while the first is more space efficient.

Alternation (r|s) allows us to choose between two possible paths by taking two NFA, and creating a new state with two epsilong transitions going from the new state to each of the NFA's start states, and doing the same from their accept states to a new accept state, returning these new start and accept states as a new NFA.

    private NFA makeAlternate(NFA lhs, NFA rhs) {
        NFAState ns = new NFAState(makeLabel());
        NFAState ts = new NFAState(makeLabel());
        ns.addTransition(new Transition(lhs.getStart()));
        ns.addTransition(new Transition(rhs.getStart()));
        lhs.getAccept().addTransition(new Transition(ts));
        rhs.getAccept().addTransition(new Transition(ts));
        return new NFA(ns, ts);
    }

And last but not least, the closure operators (*,+,?) for repetition. The difference between * and + is the presence of a single epsilon transition, the one from the new start to new accept state which allows the machine to accept 0 occurences when using the '*' operator.

private NFA makeKleene(NFA lhs, boolean mustAccept) {
        NFAState ns = new NFAState(makeLabel());
        NFAState ts = new NFAState(makeLabel());
        if (!mustAccept)
             ns.addTransition(new Transition(ts));
        ns.addTransition(new Transition(lhs.getStart()));
        lhs.getAccept().addTransition(new Transition(lhs.getStart()));
        lhs.getAccept().addTransition(new Transition(ts));
        return new NFA(ns, ts);
    }

// ? isnt _really_ a closure, i suppose.
private NFA makeZeorOrOne(NFA a) {
    return makeAlternate(a, makeEpsilonAtomic());
}

Allright, with all of our procedures to build the machines in place, were ready to construct the overall NFA.

Compiling the final NFA

Creating the full NFA is still done via post order traversal of the regular expressions abstract syntax tree. If you've read my previous post on thompsons construction, this portion of the code does not change. The only thing we changed is the data structure representing the graph so If we've designed and abstracted everything properly thats how it should be. This procedure only interacts with the top level NFA objects since all the actual NFAState manipulation takes place in the procedures we covered above.

    private void compile(Node node) {
        if (node == null)
            return;
        if (node.getType().equals(NodeType.LITERAL)) {
            st.push(makeAtomic(node.getData()));
        } else {
            if (node.getData().equals("|")) {
                compile(node.getLeft());
                compile(node.getRight());
                NFA rhs = st.pop();
                NFA lhs = st.pop();
                st.push(makeAlternate(lhs, rhs));
            } else if (node.getData().equals("@")) {
                compile(node.getLeft());
                compile(node.getRight());
                NFA rhs = st.pop();
                NFA lhs = st.pop();
                st.push(makeConcat(lhs, rhs));
            } else if (node.getData().equals("*")) {
                compile(node.getLeft());
                NFA lhs = st.pop();
                st.push(makeKleene(lhs));
            }
        }
    }
    public NFA build(Node node) {
        compile(node);
        return st.pop();
    }
}

Traversing the AST, we determine which of the above machines should be constructed based on if the current node is an internal node or a leaf node. Leaf nodes represent character literals, and possibly the empty string. Internal nodes of the AST are for the operators.

Post order traversal is used so that the "work" of sewing the individual machines together is put off until the stack unwinds guranteeing that operators have readily available operands to work with. At each step we creat the appropriate sub-machine, and place it on a stack for use in the next machine to be constructed. When all is said and done there will be a single NFA sitting on the stack, which is our fully constructed automaton.

Code Examples

1) Example in Java

2) Example in C++

MaxGCoding.com

For the love of programming

Composable Linked Digraphs: An efficient NFA Data Structure for Thompsons Construction

Linked Digraphs

Implementing Thompsons Construction with Linked Digraphs

NFA Building Blocks

Compiling the final NFA

Further Reading

Code Examples

Leave A Comment

A Quick tour of MGCLex

Compiling Regular Expressions for "The VM Approach"

Composable Linked Digraphs: An efficient NFA Data Structure for Thompsons Construction

Improving the Space Efficiency of Suffix Arrays

Augmenting B+ Trees For Order Statistics

Top-Down AST Construction of Regular Expressions with Recursive Descent

Balanced Deletion for in-memory B+ Trees

Building an AST from a Regular Expression Bottom-up

The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from the Followpos Table

The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table