Fast string searching using suffix arrays

mgcadmin 03-04-2023

String searching and to a broader extent pattern matching are some of the most fundamental operations you can do on a computer. Much research has been done into pattern matching algorithms using techniques ranging from brute force, to hashing, to making use of non finite automata.

One trait that many efficient pattern matching algorithms share is preprocessing of the input in some way. Some algorithms make use of Tries. An alternative data structure which we will discuss today is the Suffix Array. A suffix array indexes all possible suffixes of a string. The advantage of this will become clear as you read on.

Suffix Array Construction

A suffix array is an an array of the indexes of an array of each suffix of a string sorted alphabetically. You might want to read that again, i know its a bit of a mouthful. So let me explain what I mean by that.

Lets take the string "engine" as an example. We know the string "engine" has 6 characters, meaning there are 6 possible suffixes of the string:

engine
ngine
gine
ine
ne
e

To build the suffix array were going to take this list and sort it, while remembering it's original unsorted index. We're going to use a helper data structure for that:

 private static class Suffix {
        public String suffix;
        public int index;
        public Suffix(String suff, int i) {
            suffix = suff;
            index = i;
        }
    }

Were going to build our list of Suffix data structures like so:

//...
  ArrayList<Suffix> suffixes = new ArrayList();
  for (int i = 0; i < inputString.length(); i++) {
     suffixes.add(new Suffix(inputString.substring(i), i));
   }
//...

With that list built, were going to want to sort the list using the suffix string for the comparison:

  suffixes.sort((Suffix a, Suffix b) -> { return a.suffix.compareTo(b.suffix); });

And now we are ready to build our suffix array, this is simply an array comprised of the index field from our sorted list:

for (int i = 0; i < suffixes.size(); i++) {
     suffArr[i] = suffixes.get(i).index;
}

Ok great, we'be built a suffix array, what can we do with it? Well for one thing, we can now use binary search to perform efficient substring matching for one, but before we get to that I'll give you the full suffix array construction method:

private static class Suffix {
        public String suffix;
        public int index;
        public Suffix(String suff, int i) {
            suffix = suff;
            index = i;
        }
    }

    public static int[] buildSuffixArray(String inputString) {
        ArrayList<Suffix> suffixes = new ArrayList();
        for (int i = 0; i < inputString.length(); i++) {
            suffixes.add(new Suffix(inputString.substring(i), i));
        }
        suffixes.sort((Suffix a, Suffix b) -> { return a.suffix.compareTo(b.suffix); });
        int[] suffArr = new int[inputString.length()];
        for (int i = 0; i < suffixes.size(); i++) {
            suffArr[i] = suffixes.get(i).index;
        }
        return suffArr;
    }

String searching with Suffix Arrays and Binary Search

As you know, binary search is a divide and conquer algorithm that efficiently searches sorted collections by halving the search space at each iteration. This leaves us with the challenge: how can we use binary search on our string without sorting the actual string?

As a famous computer scientist once said "Any problem in computer science can be solved by another level of indirection". Thanks to the preprocessing of our string, we now have that extra level indirection in the form of the indexes of our strings sorted suffixes!

public static int binaryStringSearch(String input, String pattern, int[] suffArr) {
        int l = 0;
        int r = suffArr.length - 1;
        while (l <= r) {
            int m = l + r / 2;
            String compString = input.substring(suffArr[m], suffArr[m] + pattern.length());
            System.out.print("m: " + m + ", " + compString + " == " + pattern + "?");
            int cmp = compString.compareTo(pattern);
            if (cmp > 0) {
                System.out.println(" No.");
                r = m - 1;
            } else if (cmp < 0) {
                System.out.println(" No.");
                l = m + 1;
            } else {
                System.out.println(" Found!");
                return m;
            }
        }
        throw new ItemNotFoundException("%s not found", pattern);
    }

Lets take the whole thing for a test drive and see how she goes:

import java.util.ArrayList;

public class suffixArray {
    private static class Suffix {
        public String suffix;
        public int index;
        public Suffix(String suff, int i) {
            suffix = suff;
            index = i;
        }
    }

    public static int[] buildSuffixArray(String inputString) {
        ArrayList<Suffix> suffixes = new ArrayList();
        for (int i = 0; i < inputString.length(); i++) {
            suffixes.add(new Suffix(inputString.substring(i), i));
        }
        suffixes.sort((Suffix a, Suffix b) -> { return a.suffix.compareTo(b.suffix); });
        int[] suffArr = new int[inputString.length()];
        for (int i = 0; i < suffixes.size(); i++) {
            suffArr[i] = suffixes.get(i).index;
        }
        return suffArr;
    }

    public static void binaryStringSearch(String input, String pattern, int[] suffArr) {
        int l = 0;
        int r = suffArr.length - 1;
        while (l <= r) {
            int m = l + r / 2;
            String compString = input.substring(suffArr[m], suffArr[m] + pattern.length());
            System.out.print("m: " + m + ", " + compString + " == " + pattern + "?");
            int cmp = compString.compareTo(pattern);
            if (cmp > 0) {
                System.out.println(" No.");
                r = m - 1;
            } else if (cmp < 0) {
                System.out.println(" No.");
                l = m + 1;
            } else {
                System.out.println(" Found!");
                return;
            }
        }

        System.out.println("Not Found!");
    } 

    public static void main(String[] args) {
        String input = "Software Engineering";
        String pattern = "Engine";
        binaryStringSearch(input, pattern, buildSuffixArray(input));
    }
}

[Running] javac suffixArray.java && java suffixArray

m: 9, gineer == Engine? No.
m: 4, e Engi == Engine? No.
m: 1, Engine == Engine? Found!

And there you have it, For the cost of O(n log n) to build the suffix array, we can now perform pattern matching on strings in O(log N) time. If you have a large immutable string that will be searched multiple times, the cost of building the suffix array is decent trade off, as it only has to be built once.

The use case that brought us this technique actually comes from searching DNA sequences, which you may or may not know, are represented by very long strings of four character groups (T, C, G, A) that make up distinct sequences which model our genes! Cool stuff.

MaxGCoding.com

For the love of programming

Fast string searching using suffix arrays

Suffix Array Construction

String searching with Suffix Arrays and Binary Search

Leave A Comment

Inchworm Evaluation, Or Evaluating Prefix-Expressions with a Queue

Data Structures For Representing Context Free Grammar

A B Tree of Binary Search Trees

Implementing enhanced for loops in Bytecode

Top-Down Deletion for Red/Black Trees

Function Closures For Bytecode VMs: Heap Allocated Activation Records & Access Links

Pascal & Bernoulli & Floyd: Triangles

A Quick tour of MGCLex

Compiling Regular Expressions for "The VM Approach"

Composable Linked Digraphs: An efficient NFA Data Structure for Thompsons Construction