Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: MIME-type aware compression decision for whole files #7403

Closed
RubenKelevra opened this issue Apr 6, 2018 · 8 comments
Closed

Comments

@RubenKelevra
Copy link

RubenKelevra commented Apr 6, 2018

Currently, LZ4-compression uses a fast bail-out when it detects that a recordsize-block isn't compressible. GZIP invest a huge amount of CPU power, to do the same (when it's even bailing out early - not sure if this is implemented).

Adding the ability to detect if a block is worth compression is usually not necessary, we all do this consideration naturally if we want to send an email. Nobody would really consider investing an hour of CPU time compressing JPGs or a Video file with LZMA2 - because it's never worth it.

Why don't we just rely on the MIME type of a file, to judge, if ZFS should try to compress a file in the first place? This information should be nearly always available (except on sparse files not containing the head of the file), and it's a very fast decision:

file -i random_filename.mp4 gives me video/mp4 in 47ms real and 4ms user and 4ms sys for a spinning disk, running ZFS while currently compiling a Linux kernel on this volume. While the recordsize is up to 1M on this pool.

We can easily provide a standard-blacklist of MIME-types (for archive formats, audio and video files and similar) and give the users the ability to just add additionally MIME-types if they have custom needs.

This feature request is a successor of #7373

Edit:

I've compiled a list of mime types which are not worth any compression effort:

application/font-woff
font/woff
font/woff2
application/gzip
application/java-archive
application/java-serialized-object
application/java-vm
application/vnd.android.package-archive
application/vnd.apple.pkpass
application/vnd.google-earth.kmz
application/x-cfs-compressed
application/x-lzip
application/x-lzma
application/x-lzop
application/x-snappy-framed
application/x-xz
application/x-compress
application/x-dgc-compressed
application/x-dar
application/x-apple-diskimage
application/x-gca-compressed
application/x-gtar
application/x-cbr
application/x-bdoc
application/x-java-jnlp-file
application/x-gzip
application/x-redhat-package-manager
application/x-shockwave-flash
application/vnd.openofficeorg.extension
application/vnd.ms-3mfdocument
image/apng
image/gif
image/jp2
image/jpeg
image/jpm
image/jpx
image/pjpeg
image/png
image/vnd.mozilla.apng
image/webp
model/vnd.dwf
model/x3d+binary
application/x-virtualbox-vbox-extpack
application/x-xpinstall

#already compressed, however a small compression ratio can be achieved
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.openxmlformats-officedocument.presentationml.slide
application/vnd.openxmlformats-officedocument.presentationml.template
application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.openxmlformats-officedocument.wordprocessingml.template
image/vnd.adobe.photoshop

#usually compressed archive format
application/x-7z-compressed
application/x-ace-compressed
application/x-arj
application/x-bzip
application/x-bzip2
application/x-alz-compressed
application/x-b1
application/vnd.ms-cab-compressed
application/x-lzh
application/x-lzx
application/x-rar-compressed
application/x-stuffit
application/x-stuffitx
application/zip
application/x-zoo
application/x-par2
application/x-debian-package
application/x-deb
application/x-dgc-compressed
application/x-msdownload
image/tiff

#media containers
application/ogg
application/x-dvi
application/vnd.ms-asf

#encrypted
application/pgp-encrypted
application/x-astrotite-afa
application/x-silverlight-app
multipart/encrypted

#encryption keys
application/x-pkcs12
application/x-pkcs7-certificates

#usually very low compression ratio
application/pdf
application/vnd.apple.installer+xml
application/x-blorb

#wildcard filter:
video/* (except video/raw)
audio/* (except audio/vnd.wave, audio/wav, audio/wave, audio/x-aiff, audio/x-wav, audio/midi)
@RubenKelevra RubenKelevra changed the title Feature Request: MIME-type aware compression decision for full files Feature Request: MIME-type aware compression decision for whole files Apr 6, 2018
@RubenKelevra
Copy link
Author

@kpande might you explain your opinion?

@DeHackEd
Copy link
Contributor

DeHackEd commented Apr 7, 2018

Kernel drivers cannot make a decision at the "MIME type" level. Running the user application "file" on every thing being written is absurd.

@behlendorf
Copy link
Contributor

behlendorf commented Apr 7, 2018

@RubenKelevra on the surface I can see why this sounds like an appealing idea. It makes good sense from the perspective of a user space application. Unfortunately, the problem looks a little different from a kernel space driver which is expected to as efficiently as possible handle arbitrary blobs of data.

Saso Kiselkov gave a talk a few years ago at an OpenZFS development summit discussing this issue I'm sure you'd find it interesting. His proposed solution isn't too different that yours, but instead of checking the MIME type to determine if something is incompressible we track some per-file compression statistics. This has the advantage of being completely generic and would let us decide beforehand if we're likely to benefit from compression. Now we just need to implement it, or something like it!

http://open-zfs.org/w/images/4/4d/Compression-Saso_Kiselkov.pdf
http://www.youtube.com/watch?v=TZF92taa_us

@openzfs openzfs unlocked this conversation Apr 7, 2018
@RubenKelevra
Copy link
Author

@behlendorf thanks for the clarification.

So my thoughts are, if you run a /home directory on zfs, it would be commonly filled with a lot of different types of files. Running some nginx sites, I know that it's pretty common to select the decision of compressing a file or not depending on the mime-type - which works great even under large workloads.

So spending 8 ms of computing power per file written, on the first block of each file to determine if the mime type is on a black list or not, sounds still like a good idea instead of trying throw several hundred MB uncompressible binary in a compression algorithm, if a simple 1 bit flag on the file metadata could save all this effort.

Sure, there might be some drawbacks, like, if the blacklist would be touched, all old flags must be reevaluated, but I guess there's the possibility to do this based on timestamps of file-creation and the last change of the mime-blacklist.

I still think this idea isn't that crazy. :)

@RubenKelevra
Copy link
Author

@kpande

well, you already wrote your opinion. Then you closed this ticket, refused to respond to my question why you think that way and blocked all further discussion of this topic.

I don't like your attitude, I don't think an open discussion of ideas hurt this project and if you do, you might be wrong. Censorship of opinions you don't like should have no place in an open source project. Get your frustrations somewhere else sorted out, not here.

That said, I understand, you dislike the idea to call an external program to decide if a block should be compressed or not, but that's not what I wanted to suggest here.

So some facts:
-We know that File-Magic works, we use it every day on OS X and Unix to decide how to open a file.
-We know that most files I've listed are incompressible.
-We know that File-Magic is a very good indicator what type of file we just be locking at.
-In other high performance/low latency scenario when a fast decision about the compressibility is needed, the application use File-Magic to decide with a low latency if those CPU-cycles trying to compress the file would be wasted or not.

Furthermore, I like the auto compression idea, but it's a solution for not-overloading a system with efforts of compression.

Applying just auto compression to a something like a /home folder with mixed content might lead to not compressed compressible files while other files might not be compressible but they've been selected to try to compress them.

So adding the ability to predict if an compression is worth trying increase the chance of selecting actually compressible blocks which would enhance the compression ratio while keeping the CPU usage at the same level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@behlendorf @RubenKelevra @DeHackEd and others