Dictionary Based Compression: The LZW Algorithm
The LZ family of loss compression algorithms has many, many variants each making use of various optimizations to eek out every bit of performance for a given application. There are so many variants that they are generally broken into two categories: LZ77 based algorithms which use a an implict dictionary and those which are based on LZ78 that make use an explicit dictionary data structure. Having previously covered the LZ77 algorithm, todays post will focus on the LZ78 algorithm. More specifically, I want to discuss one of the more... controversial members of the LZ78 family: LZW84.
I'm sure you're probably thinking "controversial? an algorithm?" but I assure you it's release stirred up a whole host of issues regarding patent law and algorithms. In short, the author published the algorithm, and then patented it. "Ok, and?". Well, this was 1984. So inbetween the algorithm being published, and the author being granted a patent, it was adopted and used in a number of commercial unix distributions under the impression that the algorithm was not encumbered by patents.
But all of that is history, and the patent has since expired, so we're free to take a look at what is a pretty interesting variant of the LZ78 lossless compression algorithm, and that's exactly what we're going to do. So grab some coffee, get comfortable, and lets take a peek at implementing LZW.
From LZ Tuples to LZ Pairs to Codewords
The LZ78 family of algorithms had two key advantages over the earleir LZ77 based algorithms in that where LZ77 always had to start from the beginning of the compressed stream, LZ78 algorithms could pick up from any point in the stream. Of greater signifigance, where LZ77's canonical implementation used a 3-tuple of <offset,length,lookahead>, LZ78 algorithms use Pairs of <Offset, Length>.
LZW takes this space crunching one step further. By preloading the dictionary with every possible single character string, we know require only the offset so our compresed stream is now reduced to a stream of 12 bit integers. These integers are indexes into the decoding dictionary and are called "codewords".
Squeezing Text the LZW way
There are actually two approaches to compression with LZW, and they differ in the data structure used to implement the compression table. Using a data structure optimized for querying the longest common prefix of a search key, such as a TST or Prefix Trie can implement the algorithm in a very simple way. Lacking such a specialized data structure, we can make do with a hash table or even a balanced binary search tree.
Our dictionary can have a maximum of 4096 codewords. This is so that we can represent all of the indicea in the decoding table using 12 bit integers instead of the normal 16 bits.
const int max_code_words = 4096;
void initEncodingTable(unordered_map<string,int>& dict) {
int i = 0;
while (i < 256) {
string tmp;
tmp.push_back((char)i);
dict.emplace(tmp, i++);
}
}
We begin the compression process by initializing the encoding table with each of the 256 single character ASCII values. Once the table is initialized we can begin encoding the data. This is done by building phrases up one character at a time and checking for their existance in the encoding table.
vector<int> squeeze(string input) {
unordered_map<string,int> dict;
initEncodingTable(dict);
string phrase = "";
vector<int> comrpessed;
for (int j = 0; j < input.size(); j++) {
string prev = phrase;
phrase.push_back(input[j]);
if (dict.find(phrase) == dict.end()) {
comrpessed.push_back(dict[prev]);
dict.emplace(phrase, i++);
phrase.clear();
phrase.push_back(input[j]);
}
}
comrpessed.push_back(dict[phrase]);
return comrpessed;
}
Uncompressing with LZW
Just as we did with the encoding process, the decoding process begins by initializing the decoding table with all 256 single-character ascii strings.
void initDecodingTable(unordered_map<int,string>& dict) {
int i;
for (i = 0; i < 256; i++) {
string s;
s.push_back((char)i);
dict[i] = s;
}
dict[i++] = "";
}
Once the table has been initialized, decoding can begin.
void unsqueeze(vector<int> compd) {
unordered_map<int, string> dict;
initDecodingTable(dict);
int codeword = compd[0];
string sofar = dict[codeword];
string output;
for (int j = 1; j < compd.size(); j++) {
output += sofar;
codeword = compd[j];
string s = dict[codeword];
if (i == codeword) { s = sofar; s.push_back(sofar[0]); }
if (i < max_code_words) {
sofar.push_back(s[0]);
dict[i++] = sofar;
}
sofar = s;
}
output += sofar;
return output;
}
-
BST Deletion: Removal By Merge
-
Dictionary Based Compression: The LZW Algorithm
-
Taking Action: Compiling Procedures to P-Code
-
Making Decisions: Compiling If Statements to P-Code
-
Repeating yourself: Compiling While Loops to P-Code
-
Removing an entry from a B+ Tree without Rebalancing: A viable approach?
-
Implementing An Iterator for In-Memory B-Trees
-
Weight Balanced Binary Search Trees
-
Parsing Array Subscript Operators with Recursive Descent
-
Implementing Map & Filter in Scheme & C
Leave A Comment