This project is C++ implementation of a simple lossless compression algorithm, based on the Huffman Coding. It was made as a practical exam for the Data Structures (INF 213) course of the Computer Science major of the Federal University of Viçosa (UFV).
To compile the code, just run the Makefile:
~$ make
If it doesn't work, try to compile the files manualy with the following commands:
~$ g++ main.cpp -std=c++11 -O3 -c
~$ g++ Huffman.cpp -std=c++11 -O3 -c
~$ g++ main.o Huffman.o -std=c++11 -O3 -o huffman
If you updated the code and got a 'up to date' message when trying to run the Makefile, try:
~$ make clear
or
~$ rm main.o Huffman.o huffman
To compress a file, run:
~$ ./huffman c input_file_name compressed_output_name
To decompress a file, run:
~$ ./huffman d compressed_file_name decompressed_output_name
Proposed by David A. Huffman in 19511, the Huffman Coding an optimal prefix code commonly used for lossless data compression. It generates a variable-length code to represent each symbol a of sequence, based on their frequency of occurrence in the given stream or on their probability of appearing in it. The implementation receives a file as input, counts the number of occurrences of each one of the 256 ASCII symbols, builds a tree of frequencies and generates a set of binary codes based on such tree. After that, it creates a file composed by the frequencies of each symbol and the encoded symbols of the original file. To decompress the created files, the implementation uses the frequencies to rebuild the tree and reverse engineers the symbols using the codes.
A simple example of how the algorithm would encode the string "ABBBABBBCBDB".
The figure below represents a possible tree generated by the algorithm.
Each one of the leaves represents a symbol of the string and the path from root to them describes their binary code, with each movement to left child being a 1 and each movement to the right child being a 0. By the way, note how the most frequent symbols stay on the leaves closer to the root, creating a shorter code for them.
Result:
Symbol | Frequency | ASCII Code | Huffman Code |
---|---|---|---|
A | 2 | 1000001 | 01 |
B | 8 | 1000010 | 1 |
C | 1 | 1000011 | 001 |
D | 1 | 1000100 | 000 |
1. HUFFMAN, David A. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, v. 40, n. 9, p. 1098-1101, 1952.