Lossless Compression Part I: Working with Bits in a Byte Oriented World
In the early days of computing storage space was at a premium. It might hard to reconcile with now that having terrabytes of diskspace and gigabytes of ram in the average laptop that for a computer to have megabytes of storage available wouldn't be practical until the 1970's, and wouldn't be the norm until the 1980s. Once the storage obstacle had been tackled, we decided it would be best to store our data on other computers and access it over a network. At first, this could only be done at the slowest of speeds, kilobytes per second, then mbps, then gbps. All of these advances in technology owe no small bit of thinks to data compression.
Data compression is the act of taking data in stored in one format and transforming that data into another format that requires less storage space. Lossless data compression is doing this in such a way that there is zero degredation in the quality of the data being compresed. Unfortunately Lossless data compression is not possible for all data formats, but it is available for many and not required for certain types of data: images, video, sound can all be compressed with loss and still be useable. With text or executable binary data, lossless compression is a must.
It is my intention over the next series of posts to walk you through encoding and decoding files with huffman trees. In order to do this however we have to lay quite a bit of ground work. Data compression happens at the bit level which complicates matters, as modern computers are byte oriented and c++ offers us no data type lower than this suitable for our purposes. This doesn't mean we can't access individual bit's, it just means were going to have go a bit lower level than we usually do. In todays post I'm going to cover implementing a BitStream class, as well as reading and writing that BitStream to files.
The BitStream Class
C++ has a really great data type called std::bitset, but unfortunately for us, it is not a dynamically resizing set, and so it does not suit our purposes. While vector<bool> is an optimizied dynamic bitset - but only while in-memory, we are still faced with the challenge of writing it to a file bitwise, as the normal c++ streams will write it bool by bool instead of bit by bit. A Bool is a byte, so we would actually grow our file doing that.
Instead, we will implement our own BitSet, and make it a BitStream. For our BitStream set we will be using a vector<unsigned char> to hold our buffer. We will also track the bitcount, bit position (in the stream, for iteration), and bytecount. An unsigned char is one byte, or 8 bits wide, and the binary representation of unsigned char(0) is 00000000. By using a bitwise operators we can read the value of each bit, as well as set the value of each bit to 0 or 1.
typedef unsigned char BYTE;
BYTE setBitAtX(BYTE b, int x) {
b |= (1 << x);
return b;
}
int getBitAtX(BYTE b, int x) {
return b & (1 << x);
}
class BitStream {
private:
std::vector<unsigned char> buffer;
int bitpos;
int bitcount;
bool getBitAtX(int byteIndex, int bitIndex);
void setBitAtX(int byteIndex, int bitIndex);
public:
BitStream() : bitpos(0), bitcount(0);
int size();
bool readBit() ;
char readChar();
int readInt();
int readInt(int width);
void writeBit(bool bit);
void writeInt(int value, int numBits);
void writeChar(char value, int numBits) ;
bool done();
int offset();
void start();
void flush();
};
When it comes to the the read and write operations, the most important thing is the ability to calculate the correct offsets for the current bit position being read from or written to. This is made possible by tracking the current bit position. From this we can quickly compute the byte index by dividing the bit position by 8. The bit index in that byte is obtained by taking the bit position modulo 8.
bool BitStream::readBit() {
int byteIndex = bitpos / 8;
int bitIndex = bitpos % 8;
if (byteIndex > buffer.size()) {
cout<<"End of stream."<<endl;
return false;
}
bool bit = getBitAtX(byteIndex, bitIndex);
bitpos++;
return bit;
}
void BitStream::writeBit(bool bit) {
int byteIndex = bitpos / 8;
int bitIndex = bitpos % 8;
if (byteIndex >= buffer.size())
buffer.push_back(0);
if (bit)
setBitAtX(byteIndex, bitIndex);
bitpos++;
bitcount++;
}
By utilizing the above methods, extending them to multi-bit width is straight forward:
void BitStream::writeInt(int value, int numBits) {
for (int i = numBits - 1; i >= 0; --i) {
writeBit((value >> i) & 1);
}
}
void BitStream::writeChar(char value, int numBits) {
for (int i = numBits - 1; i >= 0; --i) {
writeBit((value >> i) & 1);
}
}
char BitStream::readChar() {
char value = 0;
for (int i = 7; i >= 0; --i) {
value |= (readBit() << i);
}
return value;
}
int BitStream::readInt() {
int value = 0;
for (int i = 7; i >= 0; --i) {
value |= (readBit() << i);
}
return value;
}
int BitStream::readInt(int width) {
int value = 0;
for (int i = width - 1; i >= 0; --i) {
value |= (readBit() << i);
}
return value;
}
The other major functionality of the BitStream class is, as the name suggest, the ability to read the data as a stream. This is a non destructive way of forward Iterating over the stream. This is facilitated through the use of the start() and done() in combination with the readBit() or readChar()/readInt() methods. The offset() method returns the current bitposition (from 0) during streaming, with size() returning the total bit count of the stream. The flush() method clears the buffer and resets the stream.
int BitStream::size() {
return bitcount;
}
BitStream& BitStream::start() {
bitpos = 0;
return *this;
}
int BitStream::offset() {
return bitpos;
}
bool BitStream::done() {
return bitpos >= bitcount;
}
void BitStream::flush() {
buffer.clear();
bitpos = 0;
bitcount = 0;
}
With that we have everything we need to read and write data as individual bits. But so far we've only addressed half the battle. We want to be able to save the compressed message, and to do that we still have to play in a byte aligned world, lets see how we can tackle that problem.
Writing BitStreams to Files
What makes our BitStream class so practical for use with compression algorithms, is that we are no longer constrained to our systems bit-width for types, as you will see when I cover the liv-zempel algorithm, which makes use of 12 bit integers and 8 bit integers. Despite this utility, when it comes to write out bitstream to a file, we must write it byte wise. To do this, we write our bitstream by converting groups of 8 bits to an unsigned char, and then writing that to the file, until we have exhausted our stream.
void writeBitStreamToFile(BitStream compressed, string filename) {
std::ofstream file(filename, std::ios::out | std::ios::binary);
compressed.start();
int j = 0;
while (!compressed.done()) {
int i = 0;
unsigned char byte = 0;
while (j < compressed.size() && i < 8) {
if (compressed.readBit()) {
byte |= (1 <<(7-i));
}
i++;
j++;
}
file.write(reinterpret_cast<const char*>(&byte), sizeof(byte));
}
file.close();
}
Reading from files into a bitstream takes no special consideration, as we simply read the file char by char and use writeChar() to populate the BitStream.
As an example of working with data at the bit level, lets see how we can print the binary representation of a string using our newly created BitStream class.
Using the BitStream class
As a warm up, lets see how we can use the BitStream class to convert strings to BitStreams and print them. We begin be iteration over the string, and using writeChar() to write each charachter to the stream. Optionally, we can supply the number of bits of the char to write, the default is 8 as a char is byte. Once the string has been written to the stream, we can print its binary representation by calling readBit() in groups of 8, or text by calling readChar().
#include <iostream>
#include "bitstream.hpp"
using namespace std;
void printBitStream(BitStream bs) {
cout<<"Binary: \n";
auto asBinary = bs.start();
auto asText = bs.start();
int i = 1;
int j = 0;
while (!asBinary.done()) {
cout<<(asBinary.readBit() ? '1':'0');
if (i % 8 == 0) { cout<<" "; j++; }
if (j == 4) {
for (int i = 0; i < j; i++)
cout<<asText.readChar()<<" ";
cout<<endl;
j = 0;
}
i++;
}
if (j > 0) {
for (int t = j; t < 4; t++)
for (int k = 0; k < 9; k++) cout<<" ";
for (int i = 0; i < j; i++)
cout<<asText.readChar()<<" ";
}
cout<<endl;
}
void stringToBitStream(string str) {
BitStream bs;
for (char c : str) {
bs.writeChar(c, 8);
}
printBitStream(bs);
}
int main() {
stringToBitStream("Hello, world!");
return 0;
}
max@MaxGorenLaptop:~/$ ./bitstream
Binary:
01001000 01100101 01101100 01101100 H e l l
01101111 00101100 00100000 01110111 o , w
01101111 01110010 01101100 01100100 o r l d
00100001 !
max@MaxGorenLaptop:~/$
That's where I am going to leave off for today. In the next post we will really put our BitStream class to use with some compression algorithms. Until then, Happy Hacking!
-
Digital Search Trees
-
Lossless Compression Part III: Huffman Coding
-
Lossless Compression Part II: The LZ77 Algorithm
-
Lossless Compression Part I: Working with Bits in a Byte Oriented World
-
Bottom Up AVL Tree: The OG Self-Balancing Binary Search Tree
-
A Different Take on Merge Sort: Binary Search Trees?
-
Deleting Entries From Open-Address Hash tables
-
Transform any Binary Search Tree in to a Sorted Linked List
-
From Regular Expressions To NFA by way of Abstract Syntax Trees: Thompsons Construction
-
Extending Sedgewick's explicit Digraph NFA
Leave A Comment