-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce size of dctx by reutilizing dst buffer #2751
Conversation
Still fixing some fuzzer errors that are blocking CI, updated performance numbers below (All in MB/s)
|
I presume you are using the synthetic content generator.
Also, look at individual files (not the whole package) in |
The memory reductions are nice! I took a quick look at this on dev
binhdctx2
The stats unfortunately are not that helpful, since it seems that instructions per cycle decreases without a very clear culprit, but it seems to indicate a 13-14% decompression speed regression for me. |
for ( ; ; ) { | ||
seq_t sequence = ZSTD_decodeSequence(&seqState, isLongOffset); | ||
size_t oneSeqSize; | ||
if (litPtr + sequence.litLength > dctx->litBufferEnd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that might be helpful is looking at the perf annotations, with perf record --call-graph=dwarf -e cycles:ppp -- ./zstd [...]
. I see the following here:
Which means this branch is maybe the culprit for some extra cycles? If I annotate this condition with LIKELY
then for some reason I get a 5-6% speedup in decompression (900 -> 950 MB/s on enwik7
). So maybe it's worth investigating this branch/loop in particular?
This loop is the hot loop in decompression, and is pretty sensitive to any changes.
EDIT: Given that the LIKELY condition doesn't seem to make much sense, it might have something to do with alignment?
Updated fixing all the fuzzer errors and rebased to current version; there is still one more error in the stream fuzzer (zstreamtests) that should not affect these performance results. Shifted the alignment and annotations; performance on enwik at compression level 1 now looks good, as well as across the rest of the silesia corpus. However, performance on high compression levels seems to be lagging. |
|
In general, high compression levels tend to generate more and smaller sequences, on top of much less literals, meaning that the resulting literals buffer will be pretty small, and the nb of bytes per literals read will likely be very small too ( A way to check that would be to print the nb of sequences and nb of literals decoded. Anyway, this could be a hint to analyze performance of your modification. |
const BYTE* const prefixStart = (const BYTE*) (dctx->prefixStart); | ||
const BYTE* const vBase = (const BYTE*) (dctx->virtualStart); | ||
const BYTE* const dictEnd = (const BYTE*) (dctx->dictEnd); | ||
unsigned litInDst = litPtr >= ostart && litPtr < oend; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an opportunity here to reduce litInDst
.
What you want to avoid is overwriting the literals stored in dst
.
Knowing that you decode a block, you also know that this block can't be larger than maxBlockSize
.
You may not know formally maxBlockSize
at this point, since it depends windowSize
,
but you could simply assume that it's necessarily <= 128 KB
.
Here oend
is the end of dst
buffer.
When this buffer is large, it's much farther away than 128 KB
.
So you could instead calculate a maxBlockEnd
pointer, and compare litPtr
to that.
This should reduce the nb of times litInDst == 1
,
thus using the simpler & faster decoding loop more often.
Some further performance improvements, the new version fairly consistently wins at lower compression levels, and now has some wins and losses at higher compression levels across enwik and silesia.
|
@@ -99,6 +100,7 @@ size_t ZSTD_decodeLiteralsBlock(ZSTD_DCtx* dctx, | |||
U32 const lhlCode = (istart[0] >> 2) & 3; | |||
U32 const lhc = MEM_readLE32(istart); | |||
size_t hufSuccess; | |||
size_t expectedWriteSize = MIN(ZSTD_BLOCKSIZE_MAX, dstCapacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the cause of the last remaining streaming fuzzer error. It appears in some cases, having the lit buffer stored ZSTD_BLOCKSIZE_MAX beyond dst is enough for it to stomp the extended dictionary in streaming mode, so if a match is at fairly long offset and into this area it can copy incorrectly from the litbuffer. I had thought that ZSTD_BLOCKSIZE_MAX was guaranteed not to be in the range of where we needed the extDict from previous offsets; still investigating what the correct safe offset should be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is this fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the fuzzer error was fixed by the increased streaming buffer size
All fuzzing test issues have been fixed, CI tests now pass. Performance looks generally good on Mac, with consistent performance improvements on top of the memory savings across enwik7 and silesia at default compression levels. At maximum compression levels there are some wins and some losses. On the devserver there were initially some stark losses. Some adjustments were made to alignment across the two loops and scratch buffer size; these have improved these losses but they are still present.
|
@@ -106,6 +106,8 @@ typedef struct { | |||
size_t ddictPtrCount; | |||
} ZSTD_DDictHashSet; | |||
|
|||
#define ZSTD_LITBUFFEREXTRASIZE 8192 /* extra buffer reduces amount of dst required to store litBuffer */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 KB is a significant budget.
Ideally, I would have preferred this permanent internal buffer to be reduced to something like 512 bytes (or less).
But hey, if you state that this large size is necessary otherwise it impacts performance, there is room for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that it is necessary, bumping it up was enough to get several cases that otherwise wouldn't have to run entirely within the buffer, but it's not 100% clear that that is a requirement to get adequate performance.
I've been testing this PR on my (stable) desktop, Corpus tested : individual components of I haven't tested more compilers nor more corpuses, but these regressions are large enough by themselves to warrant an investigation. |
#if defined(__GNUC__) && defined(__x86_64__) | ||
/* Align the decompression loop to 32 + 16 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a relatively useful and informative comment.
Unless you have reason to believe that the comment is misleading or useless,
please preserve this knowledge in the source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks, I had intended to move it slightly, not remove it.
I've unwrapped one of the loops in a manner that should help the compiler optimize better for the dev server; initial performance tests look significantly better than before but overall performance on devvm still isn't quite where I'd like it. MacOS performance still remains good after the changes. |
Good news ! Just a note : In other occasions, keeping commits separated is actually better for review. Unfortunately, this is not a trivial clear cut rule. One has to use its own judgment. In this case, for example, I'm unable to tell what "unwrapped one of the loops" means, as this design choice is swamped inside the large PR. So if it's a point worth reviewing, it's better to make it stand out, by giving it its own commit. |
Median of 5 runs each (there was a bit of variation). As before performance looks favorable on MacOS. On devvm, it looks generally comparable across gcc, gcc.par, and clang.par with some tradeoffs at higher compression levels. On devbig there are still regressions, mainly on gcc and gcc.par.
|
@@ -121,8 +123,28 @@ size_t ZSTD_decodeLiteralsBlock(ZSTD_DCtx* dctx, | |||
litCSize = (lhc >> 22) + ((size_t)istart[4] << 10); | |||
break; | |||
} | |||
RETURN_ERROR_IF(litSize > 0 && dst == NULL, dstSize_tooSmall, "NULL not handled"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question :
is this scenario possible,
for example, could it happen if the user would provide a NULL
pointer as destination,
or should it normally never happen,
for example because the code calling this function will ensure it never provides NULL
as dst
?
If this is the second case, this condition is rather an assert()
,
and it should also be clearly documented, so that the caller knows that this scenario is not allowed.
It seems like a small detail (is it an error ? or is it forbidden ?), but this kind of decision trickles down throughout the whole code, so it's important to be familiar with the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a fuzzer test that passes NULL and expects this error response, this was added to maintain that behavior
https://github.com/facebook/zstd/blob/dev/tests/fuzzer.c#L700
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would re-write this check as litSize > dstCapacity
, which is more general.
Edit: The check below expectedWriteSize < litSize
should cover this case.
Benchmarks on Yann's private server after separating execSequence functions into split buffer versions and recalculating alignments across the three loops. The difference between 16k and max size lit buffer are not that large, however the gains overall for the changes are not to total parity as was hoped for. I attempted some benchmarks after equalizing the wildcopy, safecopy, and copy16 behavior as well and they did not make much difference. 16k buffer, gcc-7
128k buffer, gcc-7
16k buffer, gcc-8
128k buffer, gcc-8
16k buffer, gcc-9
128k buffer, gcc-9
16k buffer, gcc-10
128k buffer, gcc-10
16k buffer, clang-12
128k buffer, clang-12
|
Moved the split branch up from where it would result in inlined paths on Nick's suggestion and performance looks much better across both the 16k and 128k cases now on Yann's server; there is a failure in the fuzz and regression tests I am still trying to debug as well as getting benchmarks on Nick's server. 16k buffer, gcc-7
128k buffer, gcc-7
16k buffer, gcc-8
128k buffer, gcc-8
16k buffer, gcc-9
128k buffer, gcc-9
16k buffer, gcc-10
128k buffer, gcc-10
16k buffer, clang-12
128k buffer, clang-12
|
After fixes and some testing, the 32k buffer case looks performant for the most part and still gets a factor 4 memory improvement, except for on level 22 enwik8/9 on the full benchmark on the dev server for gcc and clang.par where it gets some significant regressions (this is the case that exercises the long match loop). It is noteworthy that the same case gets significant benefits on gcc.par. Increasing the lit extra buffer does not address this, I am continuing to investigate. 16k buffer, gcc-7 nondev server
32k buffer, gcc-7 nondev server
128k buffer, gcc-7 nondev server
16k buffer, gcc-8 nondev server
32k buffer, gcc-8 nondev server
128k buffer, gcc-8 nondev server
16k buffer, gcc-9 nondev server
32k buffer, gcc-9 nondev server
128k buffer, gcc-9 nondev server
16k buffer, gcc-10 nondev server
32k buffer, gcc-10 nondev server
128k buffer, gcc-10 nondev server
16k buffer, clang-12 nondev server
32k buffer, clang-12 nondev server
128k buffer, clang-12 nondev server
32k buffer, gcc.par devbig server
32k buffer, gcc devbig server
32k buffer, clang.par devbig server
|
Fixed regressions on enwik8/9 by improving the long matching function to avoid use of split exec functions where possible. 32k buffer, gcc.par devbig server
32k buffer, gcc devbig server
32k buffer, clang.par devbig server
|
Added alignment for gcc11 and increased the buffer size to 64k which for our corpora seems to perform as well as the full 128k buffer with some memory savings. 64k buffer, gcc7 yann server
64k buffer, gcc8 yann server
64k buffer, gcc9 yann server
64k buffer, gcc10 yann server
64k buffer, clang12 yann server
64k buffer, gcc11 nick server
64k buffer, clang12 nick server
|
These results look good to me. |
Updated devbig server tests with 64k buffer and alignment changes (which should not make a difference since gcc11 is not used on devbig). 64k buffer, gcc.par devbig server
64k buffer, gcc devbig server
64k buffer, clang.par devbig server
|
Awesome! The performance looks good to me! |
…ferently in lit buffer allocation.
…pact of litbuffer changes
WIP, this round of optimizations has gotten performance much closer to parity, though it has introduced a checksum error in the 270MB file test I'm still tracking down. This however hasn't affected the smaller size tests; benchmarks indicate that in some cases we now see performance improvements on top of the memory reduction due to the improved cache behavior. However there's other cases, at low file sizes and high compressibility, where we are still about 1% behind parity.
Benchmark
old performance
./tests/fullbench -b2 -B1000 -P0
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 4987.5 MB/s ( 1000)
./tests/fullbench -b2 -B1000 -P10
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 612.0 MB/s ( 1000)
./tests/fullbench -b2 -B1000 -P50
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 585.8 MB/s ( 1000)
./tests/fullbench -b2 -B1000 -P90
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 2597.8 MB/s ( 1000)
./tests/fullbench -b2 -B1000 -P100
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 2635.7 MB/s ( 1000)
./tests/fullbench -b2 -B10000 -P0
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 36167.5 MB/s ( 10000)
./tests/fullbench -b2 -B10000 -P10
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 1292.4 MB/s ( 10000)
./tests/fullbench -b2 -B10000 -P50
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 1671.9 MB/s ( 10000)
./tests/fullbench -b2 -B10000 -P90
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 3205.2 MB/s ( 10000)
./tests/fullbench -b2 -B10000 -P100
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 6179.7 MB/s ( 10000)
./tests/fullbench -b2 -B100000 -P0
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 51880.0 MB/s ( 100000)
./tests/fullbench -b2 -B100000 -P10
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 1237.1 MB/s ( 100000)
./tests/fullbench -b2 -B100000 -P50
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 2151.4 MB/s ( 100000)
./tests/fullbench -b2 -B100000 -P90
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 3193.0 MB/s ( 100000)
./tests/fullbench -b2 -B100000 -P100
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 7095.5 MB/s ( 100000)
./tests/fullbench -b2 -B1000000 -P0
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 34106.0 MB/s ( 1000000)
./tests/fullbench -b2 -B1000000 -P10
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 1309.3 MB/s ( 1000000)
./tests/fullbench -b2 -B1000000 -P50
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 1973.5 MB/s ( 1000000)
./tests/fullbench -b2 -B1000000 -P90
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 2637.8 MB/s ( 1000000)
./tests/fullbench -b2 -B1000000 -P100
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 14852.5 MB/s ( 1000000)
new performance
./tests/fullbench -b2 -B1000 -P0
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 4999.4 MB/s ( 1000)
./tests/fullbench -b2 -B1000 -P10
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 609.1 MB/s ( 1000)
./tests/fullbench -b2 -B1000 -P50
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 583.5 MB/s ( 1000)
./tests/fullbench -b2 -B1000 -P90
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 2402.1 MB/s ( 1000)
./tests/fullbench -b2 -B1000 -P100
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000 bytes :
2#decompress : 2587.4 MB/s ( 1000)
./tests/fullbench -b2 -B10000 -P0
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 37441.8 MB/s ( 10000)
./tests/fullbench -b2 -B10000 -P10
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 1297.5 MB/s ( 10000)
./tests/fullbench -b2 -B10000 -P50
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 1656.7 MB/s ( 10000)
./tests/fullbench -b2 -B10000 -P90
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 3081.0 MB/s ( 10000)
./tests/fullbench -b2 -B10000 -P100
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 10000 bytes :
2#decompress : 6127.2 MB/s ( 10000)
./tests/fullbench -b2 -B100000 -P0
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 52215.9 MB/s ( 100000)
./tests/fullbench -b2 -B100000 -P10
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 1252.2 MB/s ( 100000)
./tests/fullbench -b2 -B100000 -P50
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 2146.6 MB/s ( 100000)
./tests/fullbench -b2 -B100000 -P90
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 3614.6 MB/s ( 100000)
./tests/fullbench -b2 -B100000 -P100
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 100000 bytes :
2#decompress : 7084.7 MB/s ( 100000)
./tests/fullbench -b2 -B1000000 -P0
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 33857.1 MB/s ( 1000000)
./tests/fullbench -b2 -B1000000 -P10
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 1288.9 MB/s ( 1000000)
./tests/fullbench -b2 -B1000000 -P50
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 2095.4 MB/s ( 1000000)
./tests/fullbench -b2 -B1000000 -P90
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 2786.2 MB/s ( 1000000)
./tests/fullbench -b2 -B1000000 -P100
*** Zstandard speed analyzer 1.5.0 64-bits, by Yann Collet (Aug 19 2021) ***
Sample 1000000 bytes :
2#decompress : 15258.4 MB/s ( 1000000)