patch-from speed optimization #3545

daniellerozenblit · 2023-03-10T14:28:15Z

TLDR

This PR is a response to issue #2189, which requests a speedier version of --patch-from compression.

This PR offers a solution by only loading the suffix of the dictionary into our normal match finders, rather than the entire dictionary. We continue to load the entire "dictionary" into our LDM match finders.

We only load the portion into our normal match finders that can reasonably be indexed by our hash tables, 8 * (1 << max(hashLog, chainLog)). Note that the 8 here is an arbitrary multiplier that shows good results (I also experimented with 4 as a multiplier, which was slightly faster but seemed to perform worse on the zstd regression test). This feature is disabled for strategies >= ZSTD_btultra.

This optimization offers great improvements on compression speed, with very minimal increase in patch size.

Credit and thanks to @terrelln for the optimization idea.

Benchmarking

I benchmarked on an Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz with core isolation and turbo disabled.

I benchmarked --patch-from compression on the linux kernel tree tarball v6.0 -> v6.2. For speed measurements, I ran each scenario five times interleaved and chose the highest result.

Compression Speed

There are significant improvements in compression / patch-creation speed across a range of compression levels. These speed improvements are especially present at higher compression levels (i.e. ~617.6% increase at compression level 15).

Patch Size

There is some increase in patch size across compression levels. This increase is more prevalent at higher compression levels, but is still fairly minimal (i.e. ~0.47% increase at compression level 15).

…ormal matchfinders

Cyan4973 · 2023-03-10T19:23:58Z

lib/compress/zstd_compress.c

+    }
+
+    /* If the dict is larger than we can reasonably index in our tables, only load the suffix. */
+    {   U32 maxDictSize = 8U * (1U << MIN(MAX(params->cParams.hashLog, params->cParams.chainLog), 29));


This is a bit too much.
8 * (1<<29) is equivalent to 1<<32, so this will not end well on 32-bit values.
1<<31 (2 GB) shall be the max, it is (currently) our limit window size anyway.
8 * MIN(MAX(),28) should do the trick.

Oops, silly mistake! Thanks for catching this.

Cyan4973 · 2023-03-10T19:31:46Z

Great results @daniellerozenblit !

Is it practical to test this patch with compression level --ultra -22 ?

The trade-off achieved here is very good, and most users will be glad to trade a little bit of compression ratio for a lot of speed. Furthermore, if they really want more compression, they could still increase the compression level, and get better ratio, for even more speed (could be shown effectively with speed/ratio graph).

But users of level 22 are typically ready to sacrifice speed to get the best possible compression ratio. And a 0.5% ratio regression would be a lot for them.
However, I suspect that your PR actually doesn't impact this case. Yet, it would be better to measure it.

Cyan4973 · 2023-03-13T22:40:57Z

Could we get some benchmark results from the latest changes for levels 1, 19 and 22 please ?
(Note : not necessarily in graphs, just giving numbers would be enough).
Could you test multithreading too (even if it's just -T2) ?
The point is to show that these optimizations remain perfectly valid with multi-threading enabled.

At this stage, we just want to assert the results, and show that they are globally positive.

The code looks good to me, changes are sufficiently simple to be properly reviewed.

terrelln · 2023-03-14T00:59:11Z

Could you test multithreading too (even if it's just -T2) ?

And also explicit single threading if you are using the CLI with --single-thread

terrelln · 2023-03-14T01:01:28Z

The code LGTM once we have the benchmark results for the mentioned scenarioes!

daniellerozenblit · 2023-03-14T20:00:12Z

Additional Benchmarking

The results described earlier seem consistent across multiple threading scenarios: --single-thread, -T2, and the default scenario. However, for --single-thread, there is a higher loss of compression ratio for the optimization.

Single Thread

For single threaded compression, speed improvements are consistent with previous results. However, there is some increased loss in compression ratio across compression levels (~4%).

T2

Results for -T2 are very consistent with the default scenario. There even appears to be some increase in the speed improvements.

Levels 1, 19, 22

There are no significant changes in neither compression ratio nor speed for levels 1, 19, 22 in the default scenario.

terrelln · 2023-03-14T21:26:17Z

Awesome!

patch-from speed optimization: only load portion of dictionary into n…

f49395f

…ormal matchfinders

facebook-github-bot added the CLA Signed label Mar 10, 2023

daniellerozenblit marked this pull request as ready for review March 10, 2023 14:32

daniellerozenblit marked this pull request as draft March 10, 2023 14:38

daniellerozenblit marked this pull request as ready for review March 10, 2023 15:02

daniellerozenblit marked this pull request as draft March 10, 2023 15:06

test regression for x8 multiplier

9d8d9a4

daniellerozenblit marked this pull request as ready for review March 10, 2023 18:09

Cyan4973 reviewed Mar 10, 2023

View reviewed changes

fix off-by-one error for bit shift bound

300a59d

daniellerozenblit force-pushed the patch-from-speed-optimization branch 2 times, most recently from f086cfe to 7cba253 Compare March 13, 2023 18:54

restrict patchfrom speed optimization to strategy < ZSTD_btultra

8d0a06a

daniellerozenblit force-pushed the patch-from-speed-optimization branch from 7cba253 to 8d0a06a Compare March 13, 2023 19:12

daniellerozenblit requested review from Cyan4973 and terrelln March 13, 2023 19:46

Cyan4973 approved these changes Mar 14, 2023

View reviewed changes

daniellerozenblit and others added 3 commits March 14, 2023 15:20

update results.csv

f03b126

Merge branch 'dev' into patch-from-speed-optimization

f0f518b

update regression test

4a5de4d

daniellerozenblit merged commit 53bad10 into facebook:dev Mar 15, 2023

daniellerozenblit mentioned this pull request Mar 16, 2023

Fix patch-from speed optimization #3556

Merged

Cyan4973 mentioned this pull request Apr 1, 2023

Preparation for release v1.5.5 #3585

Merged

MOHAMED19OS mentioned this pull request Apr 19, 2023

zstd: add Zstandard v1.5.5 crosstool-ng/crosstool-ng#1936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

patch-from speed optimization #3545

patch-from speed optimization #3545

daniellerozenblit commented Mar 10, 2023 •

edited

Loading

Cyan4973 Mar 10, 2023 •

edited

Loading

daniellerozenblit Mar 10, 2023

Cyan4973 commented Mar 10, 2023 •

edited

Loading

Cyan4973 commented Mar 13, 2023 •

edited

Loading

terrelln commented Mar 14, 2023

terrelln commented Mar 14, 2023

daniellerozenblit commented Mar 14, 2023

terrelln commented Mar 14, 2023

patch-from speed optimization #3545

patch-from speed optimization #3545

Conversation

daniellerozenblit commented Mar 10, 2023 • edited Loading

TLDR

Benchmarking

Compression Speed

Patch Size

Cyan4973 Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

daniellerozenblit Mar 10, 2023

Choose a reason for hiding this comment

Cyan4973 commented Mar 10, 2023 • edited Loading

Cyan4973 commented Mar 13, 2023 • edited Loading

terrelln commented Mar 14, 2023

terrelln commented Mar 14, 2023

daniellerozenblit commented Mar 14, 2023

Additional Benchmarking

Single Thread

T2

Levels 1, 19, 22

terrelln commented Mar 14, 2023

daniellerozenblit commented Mar 10, 2023 •

edited

Loading

Cyan4973 Mar 10, 2023 •

edited

Loading

Cyan4973 commented Mar 10, 2023 •

edited

Loading

Cyan4973 commented Mar 13, 2023 •

edited

Loading