Compress

Compress, decompress and benchmark using Huffman coding and LZW

Travis CI	BCH	Codacy	Codecov	codebeat	CodeFactor	LGTM (alerts)

Please notice! The current LZW implementation works, but is very slow when using larger files (over 0.5 MB). The app will become unresponsive for the longest time if you try to compress a large file with LZW or use a large file with the benchmarking functionality. Thus, consider using smaller files for now.

In this project, you can compress files using either Huffman coding or LZW (Lempel-Ziv-Welch). Decompression, as well as comparisons between the two algorithms, are also possible.

You can download a pre-built JAR-file from here.

Documentation

General interest:

Manual (how to use)

Advanced interest:

Definition (what was planned)
Implementation (how it turned out)
Testing (how testing was done)
Performance (how the algorithms stack up)

Geek interest:

Javadoc (code documentation, updated manually)

Weekly reports

The problem

To reduce the size of a file, and to be able to return the original file from the size-reduced file. This is also known as "lossless compression" (compared to "lossy compression", which results in irrevocable data loss such as with JPEG files).

There are two main reasons to apply compression techniques. First is that of storage; smaller-size files can more easily fit onto storage media, either HDD, SSD, optical media (CD/DVD/Blu-ray) or tape drives. The second reason is that less bandwidth is needed to transfer the data.

A compressed file must be decompressed before it's usable, and this incurs a time-delay. On slower systems, decompressing a file can take a considerable amount of time (this is why installing new software could take a long time, it's decompressing).

Algorithms

Two different algorithms are compared in this project.

algorithm	GitHub (source)	Wikipedia (info)	time complexity	space complexity
Huffman coding	source	info	O(n)	O(n)
LZW (Lempel-Ziv-Welch)	source	info	O(n)	O(n)

Algorithm "Huffman"

Huffman works by counting the instances of specific characters from the data. It then creates a tree structure that maps each character as a sequence of bits. Characters that appear more frequently get a shorter bit string representation, and uncommon/rare characters get a longer string.

Usually, each character is encoded with either 8 or 16 bits. For an example, the character 'A' is 01000001. Huffman coding can encode 'A' to be 01 or 0110 for an example, potentially saving a lot of space.

Algorithm "LZW"

LZW works differently from Huffman. It creates a dictionary of strings (preferably long strings) that map to binary sequences. For an example, if textual data contains the word "welcome" many times, each entry of that word gets a binary representation that is shorter than the original (in the optimal scenario).

Thus, LZW works best with data that has many long, repeating strings containing the same data. Usually this is the case with text files, but not the case with binary files.

Sample data

You'll find sample data that you can use in the data folder. The contents of that folder are in the public domain.

More data (not created by me) can be found online with search query "data compression corpora".

name	type	format	size	contents	compresses
cities.sql	SQL	SQL data	6,39 KB	list of cities in Finland	very well
keychain.jpeg	JPEG†	JPEG picture	83,9 KB	an old picture of my keychain	very poorly
lorem_ipsum.docx	DOCX	Word document	19,6 KB	Lorem Ipsum	poorly
lorem_ipsum.pdf	PDF	PDF document	95,4 KB	Lorem Ipsum	poorly
lorem_ipsum.txt	TXT	text file	17,2 KB	Lorem Ipsum	very well
screenshot.png	PNG	PNG image	53,6 KB	screenshot from Windows	poorly

† the real name of a JPEG file is with the E ("Experts"), not just JPG without it (this was due to the 8.3 limit of older systems)

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
config/checkstyle		config/checkstyle
data		data
docs		docs
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PULL_REQUEST_TEMPLATE.md		PULL_REQUEST_TEMPLATE.md
README.md		README.md
build.gradle		build.gradle
codecov.yml		codecov.yml
gradlew		gradlew
gradlew.bat		gradlew.bat
renovate.json		renovate.json
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compress

Documentation

Weekly reports

The problem

Algorithms

Algorithm "Huffman"

Algorithm "LZW"

Sample data

About

Releases

Packages

Contributors 2

Languages

License

gotonode/compress

Folders and files

Latest commit

History

Repository files navigation

Compress

Documentation

Weekly reports

The problem

Algorithms

Algorithm "Huffman"

Algorithm "LZW"

Sample data

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages