Compress, decompress and benchmark using Huffman coding and LZW
Travis CI | BCH | Codacy | Codecov | codebeat | CodeFactor | LGTM (alerts) |
---|---|---|---|---|---|---|
Please notice! The current LZW implementation works, but is very slow when using larger files (over 0.5 MB). The app will become unresponsive for the longest time if you try to compress a large file with LZW or use a large file with the benchmarking functionality. Thus, consider using smaller files for now.
In this project, you can compress files using either Huffman coding or LZW (Lempel-Ziv-Welch). Decompression, as well as comparisons between the two algorithms, are also possible.
You can download a pre-built JAR
-file from here.
General interest:
- Manual (how to use)
Advanced interest:
- Definition (what was planned)
- Implementation (how it turned out)
- Testing (how testing was done)
- Performance (how the algorithms stack up)
Geek interest:
- Javadoc (code documentation, updated manually)
- Week 1 Report (first week)
- Week 2 Report
- Week 3 Report
- Week 4 Report
- Week 5 Report
- Week 6 Report (final week)
To reduce the size of a file, and to be able to return the original file from the size-reduced file. This is also known as "lossless compression" (compared to "lossy compression", which results in irrevocable data loss such as with JPEG files).
There are two main reasons to apply compression techniques. First is that of storage; smaller-size files can more easily fit onto storage media, either HDD, SSD, optical media (CD/DVD/Blu-ray) or tape drives. The second reason is that less bandwidth is needed to transfer the data.
A compressed file must be decompressed before it's usable, and this incurs a time-delay. On slower systems, decompressing a file can take a considerable amount of time (this is why installing new software could take a long time, it's decompressing).
Two different algorithms are compared in this project.
algorithm | GitHub (source) | Wikipedia (info) | time complexity | space complexity |
---|---|---|---|---|
Huffman coding | source | info | O(n) | O(n) |
LZW (Lempel-Ziv-Welch) | source | info | O(n) | O(n) |
Huffman works by counting the instances of specific characters from the data. It then creates a tree structure that maps each character as a sequence of bits. Characters that appear more frequently get a shorter bit string representation, and uncommon/rare characters get a longer string.
Usually, each character is encoded with either 8 or 16 bits. For an example, the character 'A' is 01000001. Huffman coding can encode 'A' to be 01 or 0110 for an example, potentially saving a lot of space.
LZW works differently from Huffman. It creates a dictionary of strings (preferably long strings) that map to binary sequences. For an example, if textual data contains the word "welcome" many times, each entry of that word gets a binary representation that is shorter than the original (in the optimal scenario).
Thus, LZW works best with data that has many long, repeating strings containing the same data. Usually this is the case with text files, but not the case with binary files.
You'll find sample data that you can use in the data folder. The contents of that folder are in the public domain.
More data (not created by me) can be found online with search query "data compression corpora
".
name | type | format | size | contents | compresses |
---|---|---|---|---|---|
cities.sql | SQL | SQL data | 6,39 KB | list of cities in Finland | very well |
keychain.jpeg | JPEG† | JPEG picture | 83,9 KB | an old picture of my keychain | very poorly |
lorem_ipsum.docx | DOCX | Word document | 19,6 KB | Lorem Ipsum | poorly |
lorem_ipsum.pdf | PDF document | 95,4 KB | Lorem Ipsum | poorly | |
lorem_ipsum.txt | TXT | text file | 17,2 KB | Lorem Ipsum | very well |
screenshot.png | PNG | PNG image | 53,6 KB | screenshot from Windows | poorly |
† the real name of a JPEG file is with the E ("Experts"), not just JPG without it (this was due to the 8.3 limit of older systems)