Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New algorithms for the long distance matcher #2483

Merged
merged 4 commits into from
Feb 11, 2021
Merged

Conversation

mpu
Copy link
Contributor

@mpu mpu commented Feb 4, 2021

This PR proposes to replace the hashing algorithm used in the long distance matcher (LDM). The replacement proposed is a combination of gear hash and xxhash that offers significant speedups at low compression levels with no measurable regressions on compression rates.

Overview

The original rolling hash algorithm was used for two purposes: first, find split points in the input, and second, compute a checksum over a small window of data. In the new code these two objectives are realized by two different faster algorithms: split points are determined using a gear hash algorithm, and checksums are computed with xxhash. This combination is motivated by the fact that gear hash is a fine content-defined chunking (CDC) algorithm but a very poor checksumming algorithm, and xxhash is a fast checksumming algorithm unsuitable for CDC.

Even greater speed might be achieved by moving to a threaded gear hash, but this requires using recent SIMD instructions (AVX2) and dynamic dispatch based on CPUID. There is currently no prior for such techniques in zstd and I did not have the time budget to pursue this engineering task.

Code

The changes I propose make use of a couple low-level performance tricks:

  1. Test for zero bits in the rolling checksum instead of ones
  2. Mark split criterion branches as UNLIKELY
  3. Unroll the gear hash inner loop to reduce the loop overhead
  4. Process several split points at once and pefetch the corresponding hash table buckets

The tricks are listed roughly in order of impact.

To help the review I would like to point out that ip is now minMatchLength bytes ahead of where it was in the previous version of the code.

The gear hash constants were generated by my computer's pseudo-random number generator /dev/urandom.

Benchmarks

The baseline I used is the current dev branch (f5b3f64). Each configuration is run 5 times and the best timing is used. Deflate deltas are computed as the difference between the compression ratios in percents (more is better).

FILE CONFIG DEFLATE Δ TIME Δ
hhvm-rt.tar --long=27 -1 +00.02 -33.76%
l1m.tar --long=27 -1 +00.00 -34.03%
l1y.tar --long=27 -1 +00.00 -34.12%
l5.tar --long=27 -1 +00.00 -34.38%
hhvm-rt.tar --long=27 -3 +00.02 -26.97%
l1m.tar --long=27 -3 +00.01 -25.54%
l1y.tar --long=27 -3 +00.01 -23.93%
l5.tar --long=27 -3 +00.01 -23.73%
hhvm-rt.tar --long=27 -8 +00.01 -08.00%
l1m.tar --long=27 -8 +00.00 -10.00%
l1y.tar --long=27 -8 +00.00 -09.78%
l5.tar --long=27 -8 +00.00 -08.48%
hhvm-rt.tar --long=30 -1 +00.01 -34.81%
l1m.tar --long=30 -1 -00.01 -41.10%
l1y.tar --long=30 -1 -00.03 -39.76%
l5.tar --long=30 -1 +00.01 -34.18%
hhvm-rt.tar --long=30 -3 +00.01 -26.68%
l1m.tar --long=30 -3 -00.01 -33.51%
l1y.tar --long=30 -3 -00.02 -32.22%
l5.tar --long=30 -3 +00.00 -22.48%
hhvm-rt.tar --long=30 -8 +00.00 -08.26%
l1m.tar --long=30 -8 -00.01 -16.05%
l1y.tar --long=30 -8 -00.03 -14.00%
l5.tar --long=30 -8 +00.01 -07.37%

@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 4, 2021

Thanks @mpu !
These are impressive results !

@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 4, 2021

I see there are some remaining minor warnings, notably a minor silent cast warning, but assuming it gets fixed, this PR looks good to me.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I would just like to move the large arrays out of the stack frame.

We really care about stack space for kernel environments. But, even outside the kernel zstd runs in stack-constrained environments like fibers, and threads which users have configured to have smaller stacks.

BYTE const* const base = ldmState->window.base;
BYTE const* const istart = ip;
ldmRollingHashState_t hashState;
size_t splits[LDM_LOOKAHEAD_SPLITS];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved into the ldmState_t? This is using 512B of stack space.

Comment on lines 322 to 328
size_t splits[LDM_LOOKAHEAD_SPLITS];
struct {
BYTE const* split;
U32 hash;
U32 checksum;
ldmEntry_t* bucket;
} candidates[LDM_LOOKAHEAD_SPLITS];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: Can both of these be moved into the LDM state as well? This is using 2KB of stack space.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@mpu
Copy link
Contributor Author

mpu commented Feb 11, 2021

FYI, for completeness I re-run the entire evaluation on the latest commit and got results pretty much identical to the ones in the PR description.

@Cyan4973
Copy link
Contributor

Thanks for this excellent speed improvement @mpu !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants