Optimizing Mergesort for Fast Stable Sorting
Despite being one of the oldest sorting algorithms and possibly the first to ever be implemented on a computer, mergesort is still one of the fastest general purpose sorting algorithms available. Almost every programming languages standard library has some merge sort based algorithm present in it's standard library.
And yet many programmers hold the belief that merge sort is slow, wastes resources, after all it does require O(n) memory when many other sorting algorithms require O(1) or O(log n). But merge sort offers something that no other sorting algorithm does: O(n log n) every case comparisons, the proven minimum, while also maintaining stability of input items of equal value , which for sorting anything other than primitives is a big deal.
So why does mergesort get a bad rap? In my opinion, it boils down to poor implementation choices, and perhaps dogmatic emphasis on big-O performance. Still, no reputation is without reason. As we shall see, despite being a straight forward algorithm, there are a few subtleties in its implementation that can cause serious bottle necks in performance if care is not taken to avoid them. Worst yet, these bottlenecks are present in many (most) web examples and quite a few text books to boot. In this post I will be focusing on top down mergesort and how to avoid these common pitfalls. This post is only concerned with how mergesort is applied to arrays, as when it comes to linked lists nothing is better than mergesort and that's just facts.
Implementing a Merge Sort
As it's name implies, merge sorting relies on the process of merging two sorted arrays into one larger sorted array. As you could imagine, a large portion of the work takes place in the merge portion of the algorithm. The following code is a straight forward algorithm for merging two sorted arrays into one sorted array:
void merge(vector<int>& c, vector<int>& a, vector<int>& b) {
int i = 0, j = 0, k = 0;
while (i < a.size() && j < b.size()) {
if (a[i] < b[j]) {
c[k] = a[i];
i++; k++;
} else {
c[k] = b[j];
j++; k++;
}
}
while (i < a.size()) {
c[k] = a[i];
i++; k++;
}
while (j < b.size()) {
c[k] = b[j];
j++; k++;
}
}
Unfortunately, this merge algorithm is not well suited for performing a merge sort. Worse yet, if this is the merge algorithm you choose to use then you're pretty much guaranteed to end up with a "naiive" mergesort.
So What does a naiive mergesort look like? It's any implementation that performs the splitting of the input array into two actual separate arrays - an entirely unnecessary process. Lets take a look at some pseudocode, to better understand how situation like this might happen.
Mergesort(Array):
IF (Array has 1 item, or is empty)
Return;
Find the mid point of the array
Recursively call Mergesort on the first half of the Array
Recursively call Mergesort on the second half of the Array
Merge the sorted front half of the array with the sorted back half of the array back in to the input array
This is a correct, and very commonly encountered pseudocode for performing mergesort, unfortunately, some people read the above pseudo code and wind up implementing something that looks like the following:
void mergesort(vector<int>& a) {
if (a.size() <= 1) {
return;
}
vector<int> front, back;
int i = 0;
for (i = 0; i < a.size()/2; i++) front.push_back(a.at(i));
for ( ; i < a.size(); i++) back.push_back(a.at(i));
mergesort(front);
mergesort(back);
merge(a, front, back);
}
As it stands, we DO have a working mergesort, so what's the problem? It's slow! Let's do a little profiling, and compare our implementation to std::stable_sort, the C++ standard libraries merge sort implementation. A timed run of sorting a vector of 25,000 random int32's between 1 and 99000 is done for both our merge sort and std::stable_sort, with the following results:
max@MaxGorenLaptop:~/DS/Algorithms$ ./vecms
Naiive Mergesort: 21.3142ms
Sorted.
C++ STL mergesort: 5.06596ms
Sorted.
max@MaxGorenLaptop:~/DS/Algorithms$
std::stable_sort is a solve 4x faster than our current implementation. Now, std::stable_sort IS a highly optimized implementation, while ours has NO optimizations.
Avoiding Small Arrays
Let's give ours the same small sub array insertion sort that std::stable_sort uses, and inline the merge portion to avoid the overhead of the additional function call.
void insertionsort(vector<int>& a, int l, int r) {
for (int i = l; i < r; i++) {
int j = i, v = a[i];
while (j > l && a[j - 1] > v) {
a[j] = a[j - 1];
j--;
}
a[j] = v;
}
}
void mergesort(vector<int>& a) {
if (a.size() < 8) {
insertionsort(a, 0, a.size());
return;
}
int i = 0;
vector<int> front, back;
for (i = 0; i < a.size()/2; i++) front.push_back(a.at(i));
for ( ; i < a.size(); i++) back.push_back(a.at(i));
mergesort(front);
mergesort(back);
i = 0; int j = 0, k = 0;
while (i < front.size() && j < back.size()) {
if (front[i] < back[j]) {
a[k++] = front[i++];
} else {
a[k++] = back[j++];
}
}
while (i < front.size()) {
a[k++] = front[i++];
}
while (j < back.size()) {
a[k++] = back[j++];
}
}
max@MaxGorenLaptop:~/DS/Algorithms$ ./vecms
Naiive Mergesort: 13.3653ms
Sorted.
C++ STL mergesort: 4.94367ms
Sorted.
max@MaxGorenLaptop:~/DS/Algorithms$
All right, not bad. We've got our selves a pretty decent improvement, almost 50%, but were still falling about 2-3x short of std::stable_sort. Let's re-implement our mergesort using the abstract in place method and see what we get.
Optimizing The Merge Process
The abstract in place method of mergesort abstracts the portion of the algorithm that splits the original array into two separate arrays to be merged, which as you will see isn't necessary, and accomplishes nothing besides unneeded copying. In place abstract mergesort delays the copying of data until we are actually in the merge algorithm, where it copys the section of the array being sorted to our auxiliary buffer, where by careful indexing, it treats the front half and back half of the same array as two separate lists.
void msort(vector<int>& a, int l, int r) {
if (r-l<=1) return;
int m = (l+r)/2;
msort(a, l, m);
msort(a, m, r);
vector<int> aux;
aux.resize(r-l);
int i = l, j = m, k = 0;
while (i < m && j < r) {
aux[k++] = a[i] < a[j] ? a[i++]:a[j++];
}
while (i < m) { aux[k++] = a[i++]; }
while (j < r) { aux[k++] = a[j++]; }
for (int k = l, t = 0; k < r; k++, t++)
a[k] = aux[t];
}
max@MaxGorenLaptop:~/DS/Algorithms$ ./vecms
Abstract Inplace Mergesort: 8.92042ms
Sorted.
C++ STL mergesort: 5.25117ms
Sorted.
max@MaxGorenLaptop:~/DS/Algorithms$
Wow! Another 50% increase in performance, and we didn't even use the insertion sort trick in this version. Let's add that in and run it again to see how it fares:
max@MaxGorenLaptop:/mnt/c/Users/mgoren/Desktop/DS/Algorithms$ ./vecms
Abstract Inplace Mergesort: 6.82823ms
Sorted.
C++ STL mergesort: 4.97674ms
Sorted.
max@MaxGorenLaptop:/mnt/c/Users/mgoren/Desktop/DS/Algorithms$
Hmm, we're clearly in the neighborhood, but there must be a bottle neck besides the depth of recursion, If we go back to our naiive implementation and think about what that insertion sort cutoff is doing besides limiting the amount of recursion we get a hint of how we can speed up this version.
Have you figured it out? It's the allocations of the auxilliary buffer that is causing a huge slow down, and when you think about it, we don't need to re allocate on every single recursive call: we can allocate an appropriate sized auxiliary buffer ONCE at the beginning, and just keep reusing it:
void msort(vector<int>& a, vector<int>& aux, int l, int r) {
if (r-l< 8) {
insertionsort(a, l, r);
return;
}
int m = (l+r)/2;
msort(a, aux, l, m);
msort(a, aux, m, r);
int i = l, j = m, k = 0;
while (i < m && j < r) {
aux[k++] = a[i] < a[j] ? a[i++]:a[j++];
}
while (i < m) { aux[k++] = a[i++]; }
while (j < r) { aux[k++] = a[j++]; }
for (int k = l; k < r; k++)
a[k] = aux[l];
}
void mergesortOpt(vector<int>& a) {
vector<int> aux;
aux.reserve(a.size());
msort(a, aux, 0, a.size());
}
Allocations are an inherently costly operation, so on large arrays, where the depth of recursion will be large even with an insertion sort cut off, the savings in time should be substantial, and yet for some reason this flaw is present in almost every book and article about mergesort that you encounter. Having said that, let's see how we did:
max@MaxGorenLaptop:~/DS/Algorithms$ ./vecms
Abstract Inplace Mergesort: 2.7981ms
Sorted.
C++ STL mergesort: 4.67738ms
Sorted.
max@MaxGorenLaptop:~/Desktop/DS/Algorithms$
I'll let the numbers speak for themselves.
Until next time, happy hacking!
-
Lossless Compression Part III: Huffman Coding
-
Lossless Compression Part II: The LZ77 Algorithm
-
Lossless Compression Part I: Working with Bits in a Byte Oriented World
-
Bottom Up AVL Tree: The OG Self-Balancing Binary Search Tree
-
A Different Take on Merge Sort: Binary Search Trees?
-
Deleting Entries From Open-Address Hash tables
-
Transform any Binary Search Tree in to a Sorted Linked List
-
From Regular Expressions To NFA by way of Abstract Syntax Trees: Thompsons Construction
-
Extending Sedgewick's explicit Digraph NFA
-
Iterative In-order B Tree Traversal
Leave A Comment