-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipelined Implementation of ZSTD_fast (~+5% Speed) #2749
Conversation
Amusingly, it seems to be a non-trivial performance hit to add in final searches or even hash table insertions during cleanup. So let's not. It seems to not make any meaningful difference in compression ratio.
… Speed) Unrolling the loop to handle 2 positions in each iteration allows us to reduce the frequency of some operations that don't need to happen at every position. One such operation is the step calculation, which is a very rough heuristic anyways. It's fine if we do this a position later. The other operation is the repcode check. But since the repcode check already tries expanding back one position, we're really not missing much of importance by only trying it every other position. This commit also slightly reorders some operations.
This removes the old `ZSTD_compressBlock_fast_generic()` and renames the new `ZSTD_compressBlock_fast_generic_pipelined()` to replace it. This is functionally a no-op.
It's a bit strange, because this is hitting the dictionary special case where the dictionary is contiguous with the input and still runs in the single- segment path. We should probably change that to hit the `extDict` path instead?
9d0f7e9
to
b0977e4
Compare
The maintenance complexity of the PR is pretty good, and a tractable level, For additional control, I've been benchmarking this PR on a stable desktop system, using a variety of compilers. With However,
So that makes So, these results make this PR a clear gain for |
This PR introduces a new implementation of the
ZSTD_fast
parser for single-segment compressions. This new match-finder achieves up to 5% speed improvements, and improves compression slightly on average.Description
If you squint hard enough (and ignore repcodes), the search operation at any given position is broken into 4 stages:
Each of these steps involves a memory read at an address which is computed from the previous step. This means that for each position, these steps must be sequenced and their latencies are cumulative.
Originally,
ZSTD_fast
simply did each step sequentially:In #1562, @terrelln changed the implementation to work on two positions at a time:
This PR changes to a different strategy of parallelizing work, and does approximately the following:
This is very much analogous to the pipelining of execution in a CPU, and has the same benefits and drawbacks. This approach appears to more successfully parallelize read latencies.
However, just like a CPU, we have to dump the pipeline when we find a match (take a branch). When this happens, we throw away our current state, record the match, and then do the following prep to re-enter the loop:
This is also the work we do at the beginning to enter the loop initially.
In addition to this broad rearchitecture, various implementation details are tweaked to coax the best performance possible. This includes:
Parsing Differences
This PR parses slightly differently than the current strategy.
A big change is that the sensitivity is greatly increased to the acceleration factor derived from negative compression levels. In Nick's implementation, the step was applied only every other advance (each pair of searches in a loop iteration remained 1 byte apart). Whereas here, we return to the pre-#1562 behavior of applying the step between every search.
Benchmarks
As expected, given the above discussion, this is comparatively faster on less compressible inputs, but roughly on more compressible inputs, especially those with short matches. Here are some benchmarks:
As of 69b8ee9 ("Initial Pipelined Implementation for ZSTD_fast"):
(Benchmarked on an Intel Xeon E5-1650 v3 @ 3.50GHz.)
As of e2afc28 ("Nit: Only Store 2 Hash Variables"):
(Benchmarked on an Intel Xeon E5-1650 v3 @ 3.50GHz.)
(Benchmarked on a Raspberry Pi 4 aka "Broadcom BCM2711 SoC with a 1.5 GHz 64-bit quad-core ARM Cortex-A72 processor".)
As of 687c591 ("Tweak Step"):
(Benchmarked on an Intel Xeon E5-1650 v3 @ 3.50GHz.)
(Benchmarked on a Raspberry Pi 4 aka "Broadcom BCM2711 SoC with a 1.5 GHz 64-bit quad-core ARM Cortex-A72 processor".)
Extended Benchmark on Multiple Compilers, Levels, and Corpuses:
(Benchmarked on an Intel Xeon E5-2680 v4 @ 2.40GHz.)
Status
I am satisfied with this PR and feel that it is ready to merge.
To-Do:
Search at the end of the block.(Decided not to.)