-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimal huf depth #3285
Optimal huf depth #3285
Conversation
return minBits; | ||
} | ||
|
||
unsigned HUF_optimalTableLog(unsigned maxTableLog, size_t srcSize, unsigned maxSymbolValue, void* workSpace, size_t wkspSize, HUF_CElt* table, const unsigned* count, HUF_depth_mode depthMode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some interface implementation detail :
HUF_CElt* table
: when expressed this way, it implies that table
is an expected output of the function.
But it's not.
Effectively, table
is only provided as a kind of temporary workspace, anything it may contain is just thrown away afterwards.
How to make the difference ? Well, to follow the established convention, table
should not exist as a separate parameter, but be blended inside workSpace
.
Now, I appreciate that it's probably easier to employ workSpace
and HUF_CElt*
separately because that's how they exist from the caller side, and trying to bundle them together in a single workSpace
might end up being more messy for the caller.
So okay, implementation complexity is a criteria.
In which case, please document clearly that table
is just a specialized "workspace", not an expected output of the function.
lib/compress/huf_compress.c
Outdated
U32 minBitsSrc = ZSTD_highbit32((U32)(srcSize)) + 1; | ||
U32 minBitsSymbols = ZSTD_highbit32(maxSymbolValue) + 2; | ||
U32 minBits = minBitsSrc < minBitsSymbols ? minBitsSrc : minBitsSymbols; | ||
if (minBits < FSE_MIN_TABLELOG) minBits = FSE_MIN_TABLELOG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This minimum restriction FSE_MIN_TABLELOG
is only specific to FSE.
It's not necessary for Huffman.
Consequently, this line can be dropped.
lib/compress/huf_compress.c
Outdated
unsigned HUF_minTableLog(size_t srcSize, unsigned maxSymbolValue) | ||
{ | ||
U32 minBitsSrc = ZSTD_highbit32((U32)(srcSize)) + 1; | ||
U32 minBitsSymbols = ZSTD_highbit32(maxSymbolValue) + 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The little detail that matters:
Note that, in this formula, + 2
is itself an heuristic.
That's because this code is taken from FSE_minTableLog()
, where it's presumed to be employed in combination with a fast heuristic.
Consequently, one of the goals was to avoid providing a too low minimal, that would end up being, in general, a rather poor candidate.
But now, with this new brute-force strategy, as long as the candidate is sometimes a better one, it's worth investigating.
So it's possible to change this formula by + 1
, which is the real minimal.
(Note : the real minimal should actually be based on the cardinality of the distribution, for which maxSymbolValue
is merely a cheap upper bound)
I tried that, with a quick test on silesia.tar
.
The new low limit allows finding a few more bytes here and there, achieving an additional 1-2 KB savings at higher compression modes. Not big in absolute value, but relative to existing savings, it improves this strategy by ~20%, so that's a non-negligible contributor.
Where it shines though is in combination with small blocks.
For example, when cutting silesia.tar
into blocks of 1 KB, it achieves additional savings of almost 300 KB. Not a mistake. So it's a game changer for this scenario.
The downside of this strategy is that there is now one more distribution to test.
So it's even slower.
That could imply revisiting the algorithm's triggering threshold.
Alternatively, it could also provide an incentive to invest time into a more optimized method, that requires less cpu effort. Which, I believe, is possible, without loss of compression ratio.
Maybe this could be dealt with in a follow up...
I'm somewhat concerned by the speed delta measured for small blocks ( |
I am also concerned about the large speed delta. Considering the additional wins that you found after modifying |
lib/compress/huf_compress.c
Outdated
|
||
unsigned HUF_minTableLog(size_t srcSize, unsigned symbolCardinality) | ||
{ | ||
U32 minBitsSrc = ZSTD_highbit32((U32)(srcSize)) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simplify this function.
The reason srcSize
was previously part of the formula
is that maxSymbolValue
used to be a generous upper bound of the real cardinality.
Using srcSize
as a second estimator would impact cases where srcSize < maxSymbolValue
, producing a tighter upper bound estimate for these cases.
But now that we have the real cardinality, there is no need for another approximative upper bound.
Instead, derive the result from symbolCardinality
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense thanks for explaining, I initially wasn't sure why we did the first check.
@@ -29,9 +29,9 @@ github.tar, level 7, compress | |||
github.tar, level 9, compress simple, 36760 | |||
github.tar, level 13, compress simple, 35501 | |||
github.tar, level 16, compress simple, 40471 | |||
github.tar, level 19, compress simple, 32134 | |||
github.tar, level 19, compress simple, 32149 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I read correctly that compressed size is (slightly) worse for this scenario ? (github.tar
+ level 19)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this is the case that Nick suggested, where the chosen depth might be locally optimal, but is actually worse when used repeatedly for the next block(s).
I can't really think of another reason why the local optimal would not be globally optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and it's likely a good guess.
I was thinking it was a good opportunity to investigate, transform an hypothesis into confirmed knowledge.
That's very good @daniellerozenblit ! Just one last minor simplification suggestion, and I believe this PR is good to go ! Also, I believe it would be interesting to understand why the compressed size becomes (slightly) worse for the |
TLDR
This PR modifies
HUF_optimalTableLog
so that it behaves differently for high and low compression levels. Previously, we used the same heuristic to find the table log for all compression levels. This PR introduces a change that, for high compression levels, tests all valid table depths and chooses the optimal based on minimizing encoded + header size. This allows us to find additional compression ratio gains in the entropy stage for high compression levels.The threshold to find the optimal table log is compression level 18 (or strategy ZSTD_btultra). Any compression level below this will continue to use the previous heuristic. Note that we intend to introduce some speed optimizations in a follow-up PR that should allow us to expand the use of this feature to lower compression levels.
Benchmarking
I benchmarked on an Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz machine with core isolation and turbo disabled. I measured compression time and compression ratio for silesia.tar (212M), compiling with clang15. I experimented with various combinations of compression level and chunk size. I ran each scenario 5 times and chose the maximum speed value.
Silesia.tar
Compression Ratio
-B1KB
-B16KB
-B32KB
Compression Speed
-B1KB
-B16KB
-B32KB
Silesia unpacked
I did some additional benchmarking on the individual files within silesia.tar to investigate which files might contribute more to the overall reduction in compressed size.
Within the corpus, the files mozilla, mr, x-ray, and osdb seem to reap the most benefits from the optimization. The compressed sizes for dickens and nci, on the other hand, do not appear to significantly change, especially when considering the relatively large sizes of these files.
dickens
mozilla
mr
nci
ooffice
osdb
reymont
samba
sao
webster
xml
x-ray