[WIP] gzip compression heuristic #7481

rbrewer123 · 2018-04-27T02:49:49Z

Description

This change uses lz4 as a heuristic to decide if a block is worth compressing with gzip levels 6-9. If lz4 provides any compression at all, gzip is attempted. Otherwise, the block is assumed incompressible and gzip is skipped.

Motivation and Context

The goal of this change is to improve gzip compression speed for incompressible data, without significantly affecting performance on compressible data. #7373 provides additional discussion and rationale.

How Has This Been Tested?

I haven't tested this yet. I will add benchmarks here as I test this. I encourage others with particular workloads to post their own benchmarks.

This change does not properly deal with the QAT hardware compression yet. It might be that if this heuristic check is viable, it should go in gzip_compress(}. The issue with that is d_len has already been reduced by 12.5%. I could use s_len instead, but it feels error prone. For now I'm posting this for benchmarking purposes.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

codecov · 2018-04-27T09:59:30Z

Codecov Report

Merging #7481 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #7481      +/-   ##
==========================================
- Coverage   76.58%   76.57%   -0.01%     
==========================================
  Files         336      335       -1     
  Lines      107137   107107      -30     
==========================================
- Hits        82047    82020      -27     
+ Misses      25090    25087       -3

Flag	Coverage Δ
#kernel	`76.79% <100%> (-0.15%)`	⬇️
#user	`66.11% <75%> (+0.29%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 089500e...de4a0c3. Read the comment docs.

ahrens · 2018-04-27T20:34:46Z

@allanjude may be interested in this

behlendorf · 2018-05-01T04:42:29Z

module/zfs/zio_compress.c

+                d_len = s_len - 1;
+            } else {
+                d_len = s_len;
+            }


For this to be worthwhile we need to save at least a full sector since that's what it will be padded out too. Let's go ahead and assume a 4k sector size since the actual ashift isn't available, and this is only a heuristic.

Making the assumption of 4K sectors would mean that gzip is mostly disabled on small-block files/zvols, right? That seems like an unnecessary regression in functionality. Could we somehow have the spa_t accessible here, so that we could check the spa_min_ashift?

That would be good, and after a quick glance it does look like it's available in all the callers so it can be passed in.

@behlendorf I'm not sure, this is a safe asumption. If the data contains all zeros it might be possible to compress some data larger than one block, down to a very small size, small enought to fit in the block pointer, by the embedded_data feature. If we don't add an additional check, to avoid this from happening, we completely break the embedded_data feature with the check you purpose, right?

@RubenKelevra If the data is all zeros, it will be caught by the zio_compress_zeroed_cb() check on line 112, and written as a hole rather than as data.

The point @behlendorf was trying to make is that the SAVINGS needs to be at least one sector, else doing any compression is a waste, since it won't actually consume any less space on disk.

@kpande I don't feel comfortable with your replies anymore. Just stop.

@allanjude well, my point was: if the data is just one sector, no sector can be saved anymore. So the check will always return zero sectors saved. If the data can be compressed very well, but is also very small uncompressed, it won't be compressed anymore and thus the embedded_data feature isn't triggered.

@kpande: I think that @RubenKelevra is engaging in a technical discussion about the code in question. One needn't be an expert to ask questions and try to understand the code.

Everyone is welcome to contribute to the technical discussion as long as it's done respectfully.

@RubenKelevra the embedded block pointers do complicate things slightly. For small data blocks there may be a benefit in doing the compression even if a full sector isn't saved. This is actually also true for meta data blocks. My comment, which @allanjude clarified nicely, was only for the common medium-large block case. There are quite a few cases here which need to be considered.

allanjude · 2018-05-01T04:51:46Z

This is an interesting idea. I wonder how we might make it more generic so it can be used got zstd, and future compression algorithms as well. Like an 'is_slow' flag on any compression type/level that is slower than 100mb/s on an average core or something.

RubenKelevra · 2018-05-08T13:03:50Z

@allanjude maybe we should test all compression algorithms on load and flag those which are slow depending on the used system? GZIP should be fast when using a extension card, I wonder if those tests are useful in this case.

rbrewer123 · 2018-05-09T13:31:35Z

All, thanks for the helpful comments. I'm trying some new ways of building my benchmark test environment for this and it's been taking a while to get it all working.

@behlendorf I'm hoping some of the benchmarking will help tease out a good lz4 threshold to use. The squash compression benchmark https://quixdb.github.io/squash-benchmark/ has some interesting comparisons between algorithms. If you choose the "x-ray" file and "s-desktop" processor, it shows that lz4 level 1 gets a compression ratio of 1.00 with a speed over 700 MB/s. gzip level 6 gets a ratio of 1.40 at a speed of 20 MB/s. It also lists an lz4f algorithm, I'm not sure how that compares to zfs's lz4.

Based on that and other data on the squashfs site, I think there are 3 cases to consider for the heuristic based on how well gzip is able to compress a file.

"Very compressible" case. If gzip can compress a given file greater than a ratio of, say, 1.8 (a rough guess for now), lz4 should also be able to compress the file (but less than gzip). The heuristic will decide the file is "very compressible" and invoke gzip to do full compression. Users see full gzip compression and a minimal speed reduction compared to without the heuristic.

"Incompressible" case. If gzip is unable to compress a file to a ratio higher than the zfs minimum of 1.14 (aka 87.5%), lz4 will likely provide no compression on the file at all. In this case, the heuristic will skip gzip altogether and save signficant CPU time. Users see a significant speedup compared to not using the heuristic. This case is the big win and the reason to use the heuristic.

"Somewhat compressible" case. These files have a gzip compression ratio between 1.8 and 1.14.
Users who choose gzip want whatever compression they can get on these files because they chose the CPU-internsive gzip to get as much compression as possible. Unfortunately, lz4 may not find much if any compression in this case, and the heuristic may incorrectly consider the file "incompressible" and skip gzip. Users would see a significant speedup but would miss out on the compression they expected from gzip compared to not using the heuristic.

Ideally, we would find an lz4 compression ratio, say greater than 1.01, that predicts that gzip would achieve better than the zfs minimum of 1.14. If the ratio we use for lz4 is too high compared to the ideal, users get faster and faster speed with less compression as we increase the ratio. If the ratio is too low, users get slower and slower speed (due to more invocations of gzip) but with maximal compression.

If we find that tuning the compression heuristic just doesn't work very well for these somewhat-compresible files, we could consider putting it back on the user and giving them "gzip6" and "gzip6+lz4" compression options. Or we could try to find a better predictor of gzip compressibility.

ahrens · 2018-06-07T15:26:41Z

@rbrewer123 I agree with your analysis. Can you gather some data on the "somewhat compressible" cases, that could help us choose the lz4 ratio cutoff, or show us that there is no reasonable cutoff value?

rbrewer123 · 2018-06-14T10:19:37Z

@ahrens, life has been getting in the way lately but I do intend to collect some data to inform what to do next.

RubenKelevra · 2018-06-15T16:02:02Z

@rbrewer123

I like your approach on fine tuning those values, but why not just autotune those values based on the actual data?

We can start with your numbers on boot. We keep a sliding average of the best value for the compress/no compress threshold by simply pick some random blocks like 1 on 10,000 and do a lz4 and a gzip compression to have additional data points.

ggzengel · 2018-07-10T21:22:33Z

Have you read this?
https://wr.informatik.uni-hamburg.de/_media/research:theses:florian_ehmke_adaptive_compression_for_the_zettabyte_file_system.pdf

softwarecreations · 2019-07-21T06:40:04Z

What's happening with this? I'd like to use this feature.
Seems very important considering #7373 is unresolved?

How can we move this forward?

lnicola · 2019-07-21T07:19:44Z

@softwarecreations in the meanwhile, you should watch the ZSTD PR (#9024), which might be really close these days.

PrivatePuffin · 2019-11-23T15:47:49Z

@softwarecreations
Whap happened is the usual thing with Github:
Both @RubenKelevra and @rbrewer123 stopped working (or being on github mostly) on their respective PR's...

@ColinIanKing and @softwarecreations
Although its almost a necro, it's currently more worthwhile watching this one fro ZSTD, as it seems to be getting somewhere, is actually working and is based on a more recent ZSTD version:
#8941

BrainSlayer · 2019-11-25T10:27:38Z

@Ornias1993 i had the same idea already, but never implemented it since zstd alone is much faster than gzip and lz4 would just waste cpu time here

BrainSlayer · 2019-11-25T10:29:36Z

to check if a block is incompressible is also not that hard without lz4. a statistical approach by checking gaussian distribution for instance works too

PrivatePuffin · 2019-11-25T10:35:55Z

@BrainSlayer I wasn't suggesting something, just pointed them to a changed link ;)

That being said:
If you want to talk syngery with other PR's, Adaptive Compression might be a nice addon in the future ;)

PrivatePuffin · 2020-08-23T10:15:51Z

Okey, is this PR actually still relevant...
Considering we just merged ZSTD, I don't think it's worth doing such a thing anymore (and Honestly, GZIP itself is mostly made irrelevant for general dataset compression).

Simply put: ZSTD is able to give GZIP level compression @ 50-75% of LZ4 speed, adding any type of LZ4 heuristic compression on top would be useless.

I think this van be quite safely closed.

adamdmoss · 2020-08-24T20:03:58Z

IMHO it's fair to treat gzip as deprecated for purposes other than interop.

PrivatePuffin · 2020-08-24T20:21:44Z

IMHO it's fair to treat gzip as deprecated for purposes other than interop.

interop, legacy and QAT support.
But Yes, besides those I think there is not much need for CPU based GZIP anymore and as heuristics are not really very relevant for zstd, I think this is pretty done-for.

allanjude · 2020-08-24T20:36:23Z

IMHO it's fair to treat gzip as deprecated for purposes other than interop.

* interop, legacy and QAT support.
  But Yes, besides those I think there is not much need for CPU based GZIP anymore and as heuristics are not really very relevant for zstd, I think this is pretty done-for.

A note, that QAT accelerated gzip likely has similar problems to ZSTD re: recompression. If a block was originally compressed with software gzip, QAT gzip may compress it differently and result in a different checksum. This may cause L2ARC issues (with compressed_arc disabled), but also cause problems with nopwrite etc.

It is likely it would "inherit" the solution we come up with for ZSTD future-proofing.

PrivatePuffin · 2020-08-24T20:43:19Z

@allanjude This PR has nothing to do with GZIP in general, please add your note elsewhere because it's not relevant for this PR at all.

ahrens · 2021-06-04T23:14:34Z

I noticed that this PR hasn’t been updated in some time. It seems like this might still have some benefit even with zstd? But it seems that it isn’t quite complete, so we’re going to close it for now. When you have time to revisit this work, feel free to reopen this PR or open a new one. Thanks for your contribution to OpenZFS and please continue to open PR’s to address issues that you encounter!

initial compression heuristic

de4a0c3

rbrewer123 mentioned this pull request Apr 27, 2018

ZFS stalls on gzip-9 while writing incompressible files #7373

Open

behlendorf reviewed May 1, 2018

View reviewed changes

ahrens added the Type: Feature Feature request or new feature label Jun 7, 2018

behlendorf added the Status: Work in Progress Not yet ready for general review label Jun 15, 2018

behlendorf added Status: Inactive Not being actively updated and removed Status: Work in Progress Not yet ready for general review labels Sep 25, 2018

ahrens added the Status: Design Review Needed Architecture or design is under discussion label Sep 27, 2018

behlendorf added the Status: Post 0.8.0 label Apr 19, 2019

behlendorf removed the Status: Post 0.8.0 label May 23, 2019

PrivatePuffin mentioned this pull request Nov 25, 2019

WIP: Adaptive compression [was: auto compression] #7560

Open

13 tasks

ahrens closed this Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] gzip compression heuristic #7481

[WIP] gzip compression heuristic #7481

rbrewer123 commented Apr 27, 2018 •

edited by ahrens

Loading

codecov bot commented Apr 27, 2018

ahrens commented Apr 27, 2018

behlendorf May 1, 2018 •

edited

Loading

ahrens May 1, 2018

behlendorf May 1, 2018

RubenKelevra May 5, 2018 •

edited

Loading

allanjude May 5, 2018

RubenKelevra May 8, 2018

RubenKelevra May 8, 2018

ahrens May 9, 2018

behlendorf May 9, 2018

allanjude commented May 1, 2018

RubenKelevra commented May 8, 2018

rbrewer123 commented May 9, 2018

ahrens commented Jun 7, 2018

rbrewer123 commented Jun 14, 2018

RubenKelevra commented Jun 15, 2018

ggzengel commented Jul 10, 2018

softwarecreations commented Jul 21, 2019 •

edited

Loading

lnicola commented Jul 21, 2019

PrivatePuffin commented Nov 23, 2019

BrainSlayer commented Nov 25, 2019

BrainSlayer commented Nov 25, 2019

PrivatePuffin commented Nov 25, 2019 •

edited

Loading

PrivatePuffin commented Aug 23, 2020

adamdmoss commented Aug 24, 2020

PrivatePuffin commented Aug 24, 2020

allanjude commented Aug 24, 2020

PrivatePuffin commented Aug 24, 2020 •

edited

Loading

ahrens commented Jun 4, 2021

[WIP] gzip compression heuristic #7481

[WIP] gzip compression heuristic #7481

Conversation

rbrewer123 commented Apr 27, 2018 • edited by ahrens Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

codecov bot commented Apr 27, 2018

Codecov Report

ahrens commented Apr 27, 2018

behlendorf May 1, 2018 • edited Loading

Choose a reason for hiding this comment

ahrens May 1, 2018

Choose a reason for hiding this comment

behlendorf May 1, 2018

Choose a reason for hiding this comment

RubenKelevra May 5, 2018 • edited Loading

Choose a reason for hiding this comment

allanjude May 5, 2018

Choose a reason for hiding this comment

RubenKelevra May 8, 2018

Choose a reason for hiding this comment

RubenKelevra May 8, 2018

Choose a reason for hiding this comment

ahrens May 9, 2018

Choose a reason for hiding this comment

behlendorf May 9, 2018

Choose a reason for hiding this comment

allanjude commented May 1, 2018

RubenKelevra commented May 8, 2018

rbrewer123 commented May 9, 2018

ahrens commented Jun 7, 2018

rbrewer123 commented Jun 14, 2018

RubenKelevra commented Jun 15, 2018

ggzengel commented Jul 10, 2018

softwarecreations commented Jul 21, 2019 • edited Loading

lnicola commented Jul 21, 2019

PrivatePuffin commented Nov 23, 2019

BrainSlayer commented Nov 25, 2019

BrainSlayer commented Nov 25, 2019

PrivatePuffin commented Nov 25, 2019 • edited Loading

PrivatePuffin commented Aug 23, 2020

adamdmoss commented Aug 24, 2020

PrivatePuffin commented Aug 24, 2020

allanjude commented Aug 24, 2020

PrivatePuffin commented Aug 24, 2020 • edited Loading

ahrens commented Jun 4, 2021

rbrewer123 commented Apr 27, 2018 •

edited by ahrens

Loading

behlendorf May 1, 2018 •

edited

Loading

RubenKelevra May 5, 2018 •

edited

Loading

softwarecreations commented Jul 21, 2019 •

edited

Loading

PrivatePuffin commented Nov 25, 2019 •

edited

Loading

PrivatePuffin commented Aug 24, 2020 •

edited

Loading