Hash Tables with Separate Chaining: Quick and Easy associative arrays in C++
How to painlessly implement an associative array using a hash table.
Hash Tables
When it comes to the literature Hash Tables are often glossed over. Data Structures and Algorithms(DSA) books tend to focus on more complicated, "exotic" Data Structures like B+ Trees, Red/Black Trees, and Fibonacci Heaps(Algorithm authors REALLY like trees). The outstanding "Algorithms in C++" by Robert Sedgewick dedicates a mere 13 pages to the topic of hash tables(in the first edition) while "Data Structures & Algorithm Analysis in Java" by Clifford A. Shaffer manages to muster up 20 pages(third edition) on the topic.
Despite having some things in common with a middle child, Hash Tables are very powerful data structures that allow for fast insertion, deletion, and look up times. They're perfect for implementing dictionaries as they are natural associative arrays(key/value pair containers). Almost every programming language's standard library has a hash table. The C++ Standard Template Library uses a hash table for std::unordered_set and std::unordered_map, Java has Hashtable which underpins Dictionary and Map. Perl has always used hash tables for associative arrays:
#!/usr/bin/perl my %toon_towns = ( Cartman => 'South Park', Stewey => 'Quahog', Homer => 'Springfield', Barney => 'Bedrock' ); foreach $character (keys %toon_towns) { print "$character lives in $toon_towns{$character}\n"; } Output: Homer lives in Springfield Stewey lives in Quahog Cartman lives in South Park Barney lives in Bedrock
And the list goes on: Swift, PHP, python, JSON, etc, etc, etc.
So why the lackluster coverage? Well, as mentioned, most languages do the work of implementing them for you. Also, they don't require much explaining and they're pretty darn simple to implement.
So let's dig in to Hash Tables. By the end of the article you'll have another powerful data structure to keep in the tool box.
So what IS a Hash Table?
Hash Tables are a key/value container that utilizes several techniques to imlement fast look up and insertion times. There exists several ways to implement a Hash Table. Any Hash Table consists of three main parts:
- An array
- A hash function
- A way to solve conflict resolution, or "hash crash strategy".
The method i am going to explain is whats called "separate chaining". Separate chaining is a conflict resolution strategy in which items that have keys which hash to the same index are placed in a list together, commonly called "buckets". A hash function is a way of turning a key, often a string, into a number that fits in a certain range - the number of slots available in the hash table array.
The algorithm for the hash function we're going to use is called "Horners Method" (Sedgewick, pg 233). Implemented in C++ Horners method is implemented as follows:
unsigned int HashTable::horner(string key) { int hash = 0; for (char c : key) { hash = (64*hash + c) % M; } return hash; }
Where 'M' is the number of buckets we are using to hold the items of our hash table. Its common practice to use a prime number for M. Horner's method works by treating each character in the string as a digit. In this case, 64 is not a "magic" number. According to "Algorithms in C++", different numbers can be used, thought it should be close to the number of letters in the alphabet. In this case, 64 = 26 upper case letters + 26 lower case letters + the digits 0 - 9 + 1 for 'space'. (and a little padding tossed in for good measure).
Our hash table utilizes an array of linked lists for our buckets. When inserting an item into the hash table, our hash function is used to process the key which yields the index of the array position of the list which will contain our item.
In other words the array is not storing the item directly, the array is an array of lists. It is these lists that we're adding that item to, the array is used to direct the item to the proper list.
The process looks something like this:
key -> hash function -> array -> list-> item
You might be wondering why we're going through all this trouble if in the end we're just going to stick the item into a list, why not just use a list from the beginning? The answer is that by implementing the hash table ADT, were guaranteed certain performance characteristics - faster insertion, deletion, and searching times when compared to using a single linked list.
If we're somehow lucky enough to have the "perfect" hash function which maps every key with zero hash crashes, all of these operations can be achieved in O(1) time! One needs to look at the "birthday rule" in statistics to know that in all likelihood this is NOT going to happen. However, by spreading our data across multiple small linked lists instead of one big linked list, we can utilize sequential search with far better performance on the list of items whose keys hash to the same value than if we had to search a list containing EVERY key.
For a hash table with separate chaining the time complexity is the time to perform the array operation (O(1)) + the time to perform the list operation (worst case (O(n)).
Implementation
The concept of a hash table with separate chaining is so refreshingly simple that once you understand the idea, the code its self does not require much explaining. Using STL containers, the whole shebang can be wrapped up in about 50 lines of code.
With that being said, here is my code which leverages the benefits of std::vector, std::list, and std::pair for an incredibly simple C++ implementation:
#include <iostream> #include <vector> #include <list> #include <tuple> using namespace std; class HashTable { private: typedef std::tuple<bool, string, int> htret; typedef unsigned int uint; int M; //default is 101, should always be a prime number int numRecords; typedef std::pair<string, string> record; std::vector<std::list<record> > table; public: HashTable(int sz = 101) { M = sz; numRecords = 0; table.resize(M); } uint horner(string key); void put(string key, string value); htret get(string key); void remove(string key); int size() { return numRecords; } bool empty() { return numRecords == 0; } string operator[](const string& key) { auto [flag, ret, pos] = get(key); return ret; } }; unsigned int HashTable::horner(string key) { int hash = 0; for (char c : key) { hash = (64*hash + c) % M; } return hash; } void HashTable::put(string key, string value) { unsigned int mapped = horner(key); table[mapped].push_front(make_pair(key,value)); numRecords++; } std::tuple<bool, string, int> HashTable::get(string key) { bool found = false; unsigned int mapped = horner(key); unsigned int position = 0; cout<<"Fetching data associated with: "<<key<<endl; for (auto rec : table[mapped]) { if (rec.first == key) { return make_tuple(true, rec.second, position); } position++; } return make_tuple(false, "not found.", -1); } void HashTable::remove(string key) { auto [found, result, pos] = get(key); if (found) { auto itr = table[horner(key)].begin(); advance(itr, pos); table[horner(key)].erase(itr); } }
I borrowed the idea of overloading the subscript operator[] from std::map and it functions in a similar way. You can use operator[] to fetch data by key or search data by key, if the search finds nothing it simply returns the string "not found".
The return type of the get() function is an std::tuple<bool, string, unsigned int> these values signify the following:
- bool - for determining If the key exists in the table or not
- string - the data being searched for
- the position in the list that record is stored at, for use in removing a record.
I built a small driver program that builds an associative array of authors and books and then manipulate it with our implemented functions. Lets see our hash table in action:
int main(int argc, char *argv[]) { //Declare and initialize the table using M = 101 for our prime. HashTable myHT(101); /********************************** populate the hash table ************************************/ myHT.put("Robert Sedgewick", "Algorithms in C++"); myHT.put("Clifford A. Shaffer", "Data Structures & Algorithms Analysis in Java"); myHT.put("Larry Wall", "Programming Perl"); myHT.put("Randall Munroe", "xkcd"); myHT.put("Donald Knuth", "The art of computer programming"); /************************************************ Testing operations on the table ****************************************************/ cout<<myHT["Windows Programming"]<<endl; cout<<myHT["Larry Wall"]<<endl; myHT.remove("Larry Wall"); cout<<myHT["Larry Wall"]<<endl; return 0; } Output: ./htcpp Fetching data associated with: Windows Programming not found. Fetching data associated with: Larry Wall Programming Perl Fetching data associated with: Larry Wall not found.
We populated the hash table, and searched for windows programming, which was not included, so it outputs not found. We then search for Larry Wall, which returns programming perl. After deleting the record associated with Larry Wall, the search for it this time now returns "not found". Success.
With Class templates this code can be adapted for use with any data type, not just strings. Keep in mind that the hash function may have to be altered depending on what type one uses as a key.
Separate chaining is just one hash crash resolution strategy. Horners method is just on way to compute a hash function. There are many interesting ways to implement both of these parts of hash tables, go out and code them all!
The source code for the examples in this article are available on my github at:
https://github.com/maxgoren/examples/blob/main/hashtable-stl.cpp
-
How Many Hanukkah Candles?
-
Pratt Parsing: Top-Down Operator Precedence
-
Generating P-Code by AST Traversal
-
Ternary Search Tries: String Specific Ordered Symbol Tables
-
Digital Search Trees
-
Lossless Compression Part III: Huffman Coding
-
Lossless Compression Part II: The LZ77 Algorithm
-
Lossless Compression Part I: Working with Bits in a Byte Oriented World
-
Bottom Up AVL Tree: The OG Self-Balancing Binary Search Tree
-
A Different Take on Merge Sort: Binary Search Trees?
Leave A Comment