Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] gzip compression heuristic #7481

Closed

Conversation

rbrewer123
Copy link

@rbrewer123 rbrewer123 commented Apr 27, 2018

Description

This change uses lz4 as a heuristic to decide if a block is worth compressing with gzip levels 6-9. If lz4 provides any compression at all, gzip is attempted. Otherwise, the block is assumed incompressible and gzip is skipped.

Motivation and Context

The goal of this change is to improve gzip compression speed for incompressible data, without significantly affecting performance on compressible data. #7373 provides additional discussion and rationale.

How Has This Been Tested?

I haven't tested this yet. I will add benchmarks here as I test this. I encourage others with particular workloads to post their own benchmarks.

This change does not properly deal with the QAT hardware compression yet. It might be that if this heuristic check is viable, it should go in gzip_compress(}. The issue with that is d_len has already been reduced by 12.5%. I could use s_len instead, but it feels error prone. For now I'm posting this for benchmarking purposes.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@codecov
Copy link

codecov bot commented Apr 27, 2018

Codecov Report

Merging #7481 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7481      +/-   ##
==========================================
- Coverage   76.58%   76.57%   -0.01%     
==========================================
  Files         336      335       -1     
  Lines      107137   107107      -30     
==========================================
- Hits        82047    82020      -27     
+ Misses      25090    25087       -3
Flag Coverage Δ
#kernel 76.79% <100%> (-0.15%) ⬇️
#user 66.11% <75%> (+0.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 089500e...de4a0c3. Read the comment docs.

@ahrens
Copy link
Member

ahrens commented Apr 27, 2018

@allanjude may be interested in this

d_len = s_len - 1;
} else {
d_len = s_len;
}
Copy link
Contributor

@behlendorf behlendorf May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to be worthwhile we need to save at least a full sector since that's what it will be padded out too. Let's go ahead and assume a 4k sector size since the actual ashift isn't available, and this is only a heuristic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the assumption of 4K sectors would mean that gzip is mostly disabled on small-block files/zvols, right? That seems like an unnecessary regression in functionality. Could we somehow have the spa_t accessible here, so that we could check the spa_min_ashift?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be good, and after a quick glance it does look like it's available in all the callers so it can be passed in.

Copy link

@RubenKelevra RubenKelevra May 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf I'm not sure, this is a safe asumption. If the data contains all zeros it might be possible to compress some data larger than one block, down to a very small size, small enought to fit in the block pointer, by the embedded_data feature. If we don't add an additional check, to avoid this from happening, we completely break the embedded_data feature with the check you purpose, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RubenKelevra If the data is all zeros, it will be caught by the zio_compress_zeroed_cb() check on line 112, and written as a hole rather than as data.

The point @behlendorf was trying to make is that the SAVINGS needs to be at least one sector, else doing any compression is a waste, since it won't actually consume any less space on disk.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpande I don't feel comfortable with your replies anymore. Just stop.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allanjude well, my point was: if the data is just one sector, no sector can be saved anymore. So the check will always return zero sectors saved. If the data can be compressed very well, but is also very small uncompressed, it won't be compressed anymore and thus the embedded_data feature isn't triggered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpande: I think that @RubenKelevra is engaging in a technical discussion about the code in question. One needn't be an expert to ask questions and try to understand the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone is welcome to contribute to the technical discussion as long as it's done respectfully.

@RubenKelevra the embedded block pointers do complicate things slightly. For small data blocks there may be a benefit in doing the compression even if a full sector isn't saved. This is actually also true for meta data blocks. My comment, which @allanjude clarified nicely, was only for the common medium-large block case. There are quite a few cases here which need to be considered.

@allanjude
Copy link
Contributor

This is an interesting idea. I wonder how we might make it more generic so it can be used got zstd, and future compression algorithms as well. Like an 'is_slow' flag on any compression type/level that is slower than 100mb/s on an average core or something.

@RubenKelevra
Copy link

@allanjude maybe we should test all compression algorithms on load and flag those which are slow depending on the used system? GZIP should be fast when using a extension card, I wonder if those tests are useful in this case.

@rbrewer123
Copy link
Author

All, thanks for the helpful comments. I'm trying some new ways of building my benchmark test environment for this and it's been taking a while to get it all working.

@behlendorf I'm hoping some of the benchmarking will help tease out a good lz4 threshold to use. The squash compression benchmark https://quixdb.github.io/squash-benchmark/ has some interesting comparisons between algorithms. If you choose the "x-ray" file and "s-desktop" processor, it shows that lz4 level 1 gets a compression ratio of 1.00 with a speed over 700 MB/s. gzip level 6 gets a ratio of 1.40 at a speed of 20 MB/s. It also lists an lz4f algorithm, I'm not sure how that compares to zfs's lz4.

Based on that and other data on the squashfs site, I think there are 3 cases to consider for the heuristic based on how well gzip is able to compress a file.

"Very compressible" case. If gzip can compress a given file greater than a ratio of, say, 1.8 (a rough guess for now), lz4 should also be able to compress the file (but less than gzip). The heuristic will decide the file is "very compressible" and invoke gzip to do full compression. Users see full gzip compression and a minimal speed reduction compared to without the heuristic.

"Incompressible" case. If gzip is unable to compress a file to a ratio higher than the zfs minimum of 1.14 (aka 87.5%), lz4 will likely provide no compression on the file at all. In this case, the heuristic will skip gzip altogether and save signficant CPU time. Users see a significant speedup compared to not using the heuristic. This case is the big win and the reason to use the heuristic.

"Somewhat compressible" case. These files have a gzip compression ratio between 1.8 and 1.14.
Users who choose gzip want whatever compression they can get on these files because they chose the CPU-internsive gzip to get as much compression as possible. Unfortunately, lz4 may not find much if any compression in this case, and the heuristic may incorrectly consider the file "incompressible" and skip gzip. Users would see a significant speedup but would miss out on the compression they expected from gzip compared to not using the heuristic.

Ideally, we would find an lz4 compression ratio, say greater than 1.01, that predicts that gzip would achieve better than the zfs minimum of 1.14. If the ratio we use for lz4 is too high compared to the ideal, users get faster and faster speed with less compression as we increase the ratio. If the ratio is too low, users get slower and slower speed (due to more invocations of gzip) but with maximal compression.

If we find that tuning the compression heuristic just doesn't work very well for these somewhat-compresible files, we could consider putting it back on the user and giving them "gzip6" and "gzip6+lz4" compression options. Or we could try to find a better predictor of gzip compressibility.

@ahrens ahrens added the Type: Feature Feature request or new feature label Jun 7, 2018
@ahrens
Copy link
Member

ahrens commented Jun 7, 2018

@rbrewer123 I agree with your analysis. Can you gather some data on the "somewhat compressible" cases, that could help us choose the lz4 ratio cutoff, or show us that there is no reasonable cutoff value?

@rbrewer123
Copy link
Author

@ahrens, life has been getting in the way lately but I do intend to collect some data to inform what to do next.

@RubenKelevra
Copy link

@rbrewer123

I like your approach on fine tuning those values, but why not just autotune those values based on the actual data?

We can start with your numbers on boot. We keep a sliding average of the best value for the compress/no compress threshold by simply pick some random blocks like 1 on 10,000 and do a lz4 and a gzip compression to have additional data points.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Jun 15, 2018
@ggzengel
Copy link
Contributor

Have you read this?
https://wr.informatik.uni-hamburg.de/_media/research:theses:florian_ehmke_adaptive_compression_for_the_zettabyte_file_system.pdf

@behlendorf behlendorf added Status: Inactive Not being actively updated and removed Status: Work in Progress Not yet ready for general review labels Sep 25, 2018
@ahrens ahrens added the Status: Design Review Needed Architecture or design is under discussion label Sep 27, 2018
@softwarecreations
Copy link

softwarecreations commented Jul 21, 2019

What's happening with this? I'd like to use this feature.
Seems very important considering #7373 is unresolved?

How can we move this forward?

@lnicola
Copy link
Contributor

lnicola commented Jul 21, 2019

@softwarecreations in the meanwhile, you should watch the ZSTD PR (#9024), which might be really close these days.

@PrivatePuffin
Copy link
Contributor

@softwarecreations
Whap happened is the usual thing with Github:
Both @RubenKelevra and @rbrewer123 stopped working (or being on github mostly) on their respective PR's...

@ColinIanKing and @softwarecreations
Although its almost a necro, it's currently more worthwhile watching this one fro ZSTD, as it seems to be getting somewhere, is actually working and is based on a more recent ZSTD version:
#8941

@BrainSlayer
Copy link
Contributor

@Ornias1993 i had the same idea already, but never implemented it since zstd alone is much faster than gzip and lz4 would just waste cpu time here

@BrainSlayer
Copy link
Contributor

to check if a block is incompressible is also not that hard without lz4. a statistical approach by checking gaussian distribution for instance works too

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Nov 25, 2019

@BrainSlayer I wasn't suggesting something, just pointed them to a changed link ;)

That being said:
If you want to talk syngery with other PR's, Adaptive Compression might be a nice addon in the future ;)

@PrivatePuffin
Copy link
Contributor

Okey, is this PR actually still relevant...
Considering we just merged ZSTD, I don't think it's worth doing such a thing anymore (and Honestly, GZIP itself is mostly made irrelevant for general dataset compression).

Simply put: ZSTD is able to give GZIP level compression @ 50-75% of LZ4 speed, adding any type of LZ4 heuristic compression on top would be useless.

I think this van be quite safely closed.

@adamdmoss
Copy link
Contributor

IMHO it's fair to treat gzip as deprecated for purposes other than interop.

@PrivatePuffin
Copy link
Contributor

IMHO it's fair to treat gzip as deprecated for purposes other than interop.

  • interop, legacy and QAT support.
    But Yes, besides those I think there is not much need for CPU based GZIP anymore and as heuristics are not really very relevant for zstd, I think this is pretty done-for.

@allanjude
Copy link
Contributor

IMHO it's fair to treat gzip as deprecated for purposes other than interop.

* interop, legacy and QAT support.
  But Yes, besides those I think there is not much need for CPU based GZIP anymore and as heuristics are not really very relevant for zstd, I think this is pretty done-for.

A note, that QAT accelerated gzip likely has similar problems to ZSTD re: recompression. If a block was originally compressed with software gzip, QAT gzip may compress it differently and result in a different checksum. This may cause L2ARC issues (with compressed_arc disabled), but also cause problems with nopwrite etc.

It is likely it would "inherit" the solution we come up with for ZSTD future-proofing.

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Aug 24, 2020

@allanjude This PR has nothing to do with GZIP in general, please add your note elsewhere because it's not relevant for this PR at all.

@ahrens
Copy link
Member

ahrens commented Jun 4, 2021

I noticed that this PR hasn’t been updated in some time. It seems like this might still have some benefit even with zstd? But it seems that it isn’t quite complete, so we’re going to close it for now. When you have time to revisit this work, feel free to reopen this PR or open a new one. Thanks for your contribution to OpenZFS and please continue to open PR’s to address issues that you encounter!

@ahrens ahrens closed this Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion Status: Inactive Not being actively updated Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.