Fast string searching using suffix arrays
String searching and to a broader extent pattern matching are some of the most fundamental operations you can do on a computer. Much research has been done into pattern matching algorithms using techniques ranging from brute force, to hashing, to making use of non finite automata.
One trait that many efficient pattern matching algorithms share is preprocessing of the input in some way. Some algorithms make use of Tries. An alternative data structure which we will discuss today is the Suffix Array. A suffix array indexes all possible suffixes of a string. The advantage of this will become clear as you read on.
Suffix Array Construction
A suffix array is an an array of the indexes of an array of each suffix of a string sorted alphabetically. You might want to read that again, i know its a bit of a mouthful. So let me explain what I mean by that.
Lets take the string "engine" as an example. We know the string "engine" has 6 characters, meaning there are 6 possible suffixes of the string:
- engine
- ngine
- gine
- ine
- ne
- e
To build the suffix array were going to take this list and sort it, while remembering it's original unsorted index. We're going to use a helper data structure for that:
private static class Suffix {
public String suffix;
public int index;
public Suffix(String suff, int i) {
suffix = suff;
index = i;
}
}
Were going to build our list of Suffix data structures like so:
//...
ArrayList<Suffix> suffixes = new ArrayList();
for (int i = 0; i < inputString.length(); i++) {
suffixes.add(new Suffix(inputString.substring(i), i));
}
//...
With that list built, were going to want to sort the list using the suffix string for the comparison:
suffixes.sort((Suffix a, Suffix b) -> { return a.suffix.compareTo(b.suffix); });
And now we are ready to build our suffix array, this is simply an array comprised of the index field from our sorted list:
for (int i = 0; i < suffixes.size(); i++) {
suffArr[i] = suffixes.get(i).index;
}
Ok great, we'be built a suffix array, what can we do with it? Well for one thing, we can now use the powerful binary search algorithm to perform efficient pattern matching on our String! but before we get to that, I'll give you the full suffix array construction method:
private static class Suffix {
public String suffix;
public int index;
public Suffix(String suff, int i) {
suffix = suff;
index = i;
}
}
public static int[] buildSuffixArray(String inputString) {
ArrayList<Suffix> suffixes = new ArrayList();
for (int i = 0; i < inputString.length(); i++) {
suffixes.add(new Suffix(inputString.substring(i), i));
}
suffixes.sort((Suffix a, Suffix b) -> { return a.suffix.compareTo(b.suffix); });
int[] suffArr = new int[inputString.length()];
for (int i = 0; i < suffixes.size(); i++) {
suffArr[i] = suffixes.get(i).index;
}
return suffArr;
}
String searching with Suffix Arrays and Binary Search
As you know, binary search is a divide and conquer algorithm that efficiently searches sorted collections by halving the search space at each iteration. This leaves us with the challenge: how can we use binary search on our string without sorting the actual string?
As a famous computer scientist once said "Any problem in computer science can be solved by another level of indirection". Thanks to the preprocessing of our string, we now have that extra level indirection in the form of the indexes of our strings sorted suffixes!
public static void binaryStringSearch(String input, String pattern, int[] suffArr) {
int l = 0;
int r = suffArr.length - 1;
while (l <= r) {
int m = l + r / 2;
String compString = input.substring(suffArr[m], suffArr[m] + pattern.length());
System.out.print("m: " + m + ", " + compString + " == " + pattern + "?");
int cmp = compString.compareTo(pattern);
if (cmp > 0) {
System.out.println(" No.");
r = m - 1;
} else if (cmp < 0) {
System.out.println(" No.");
l = m + 1;
} else {
System.out.println(" Found!");
return;
}
}
System.out.println("Not Found!");
}
Lets take the whole thing for a test drive and see how she goes:
import java.util.ArrayList;
public class suffixArray {
private static class Suffix {
public String suffix;
public int index;
public Suffix(String suff, int i) {
suffix = suff;
index = i;
}
}
public static int[] buildSuffixArray(String inputString) {
ArrayList<Suffix> suffixes = new ArrayList();
for (int i = 0; i < inputString.length(); i++) {
suffixes.add(new Suffix(inputString.substring(i), i));
}
suffixes.sort((Suffix a, Suffix b) -> { return a.suffix.compareTo(b.suffix); });
int[] suffArr = new int[inputString.length()];
for (int i = 0; i < suffixes.size(); i++) {
suffArr[i] = suffixes.get(i).index;
}
return suffArr;
}
public static void binaryStringSearch(String input, String pattern, int[] suffArr) {
int l = 0;
int r = suffArr.length - 1;
while (l <= r) {
int m = l + r / 2;
String compString = input.substring(suffArr[m], suffArr[m] + pattern.length());
System.out.print("m: " + m + ", " + compString + " == " + pattern + "?");
int cmp = compString.compareTo(pattern);
if (cmp > 0) {
System.out.println(" No.");
r = m - 1;
} else if (cmp < 0) {
System.out.println(" No.");
l = m + 1;
} else {
System.out.println(" Found!");
return;
}
}
System.out.println("Not Found!");
}
public static void main(String[] args) {
String input = "Software Engineering";
String pattern = "Engine";
binaryStringSearch(input, pattern, buildSuffixArray(input));
}
}
[Running] javac suffixArray.java && java suffixArray
m: 9, gineer == Engine? No.
m: 4, e Engi == Engine? No.
m: 1, Engine == Engine? Found!
And there you have it, For the cost of O(n log n) to build the suffix array, we can now perform pattern matching on strings in O(log N) time. If you have a large immutable string that will be searched multiple times, the cost of building the suffix array is decent trade off, as it only has to be built once.
The use case that brought us this technique actually comes from searching DNA sequences, which you may or may not know, are represented by very long strings of four character groups (T, C, G, A) that make up distinct sequences which model our genes! Cool stuff.
-
Pratt Parsing: Top-Down Operator Precedence
-
Generating P-Code by AST Traversal
-
Ternary Search Tries: String Specific Ordered Symbol Tables
-
Digital Search Trees
-
Lossless Compression Part III: Huffman Coding
-
Lossless Compression Part II: The LZ77 Algorithm
-
Lossless Compression Part I: Working with Bits in a Byte Oriented World
-
Bottom Up AVL Tree: The OG Self-Balancing Binary Search Tree
-
A Different Take on Merge Sort: Binary Search Trees?
-
Deleting Entries From Open-Address Hash tables
Leave A Comment