-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: MIME-type aware compression decision for whole files #7403
Comments
@kpande might you explain your opinion? |
Kernel drivers cannot make a decision at the "MIME type" level. Running the user application "file" on every thing being written is absurd. |
@RubenKelevra on the surface I can see why this sounds like an appealing idea. It makes good sense from the perspective of a user space application. Unfortunately, the problem looks a little different from a kernel space driver which is expected to as efficiently as possible handle arbitrary blobs of data. Saso Kiselkov gave a talk a few years ago at an OpenZFS development summit discussing this issue I'm sure you'd find it interesting. His proposed solution isn't too different that yours, but instead of checking the MIME type to determine if something is incompressible we track some per-file compression statistics. This has the advantage of being completely generic and would let us decide beforehand if we're likely to benefit from compression. Now we just need to implement it, or something like it! http://open-zfs.org/w/images/4/4d/Compression-Saso_Kiselkov.pdf |
@behlendorf thanks for the clarification. So my thoughts are, if you run a /home directory on zfs, it would be commonly filled with a lot of different types of files. Running some nginx sites, I know that it's pretty common to select the decision of compressing a file or not depending on the mime-type - which works great even under large workloads. So spending 8 ms of computing power per file written, on the first block of each file to determine if the mime type is on a black list or not, sounds still like a good idea instead of trying throw several hundred MB uncompressible binary in a compression algorithm, if a simple 1 bit flag on the file metadata could save all this effort. Sure, there might be some drawbacks, like, if the blacklist would be touched, all old flags must be reevaluated, but I guess there's the possibility to do this based on timestamps of file-creation and the last change of the mime-blacklist. I still think this idea isn't that crazy. :) |
well, you already wrote your opinion. Then you closed this ticket, refused to respond to my question why you think that way and blocked all further discussion of this topic. I don't like your attitude, I don't think an open discussion of ideas hurt this project and if you do, you might be wrong. Censorship of opinions you don't like should have no place in an open source project. Get your frustrations somewhere else sorted out, not here. That said, I understand, you dislike the idea to call an external program to decide if a block should be compressed or not, but that's not what I wanted to suggest here. So some facts: Furthermore, I like the auto compression idea, but it's a solution for not-overloading a system with efforts of compression. Applying just auto compression to a something like a /home folder with mixed content might lead to not compressed compressible files while other files might not be compressible but they've been selected to try to compress them. So adding the ability to predict if an compression is worth trying increase the chance of selecting actually compressible blocks which would enhance the compression ratio while keeping the CPU usage at the same level. |
Currently, LZ4-compression uses a fast bail-out when it detects that a recordsize-block isn't compressible. GZIP invest a huge amount of CPU power, to do the same (when it's even bailing out early - not sure if this is implemented).
Adding the ability to detect if a block is worth compression is usually not necessary, we all do this consideration naturally if we want to send an email. Nobody would really consider investing an hour of CPU time compressing JPGs or a Video file with LZMA2 - because it's never worth it.
Why don't we just rely on the MIME type of a file, to judge, if ZFS should try to compress a file in the first place? This information should be nearly always available (except on sparse files not containing the head of the file), and it's a very fast decision:
file -i random_filename.mp4 gives me video/mp4 in 47ms real and 4ms user and 4ms sys for a spinning disk, running ZFS while currently compiling a Linux kernel on this volume. While the recordsize is up to 1M on this pool.
We can easily provide a standard-blacklist of MIME-types (for archive formats, audio and video files and similar) and give the users the ability to just add additionally MIME-types if they have custom needs.
This feature request is a successor of #7373
Edit:
I've compiled a list of mime types which are not worth any compression effort:
The text was updated successfully, but these errors were encountered: