Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal huf depth #3285

Merged
merged 10 commits into from
Oct 18, 2022
Merged

Conversation

daniellerozenblit
Copy link
Contributor

@daniellerozenblit daniellerozenblit commented Oct 11, 2022

TLDR

This PR modifies HUF_optimalTableLog so that it behaves differently for high and low compression levels. Previously, we used the same heuristic to find the table log for all compression levels. This PR introduces a change that, for high compression levels, tests all valid table depths and chooses the optimal based on minimizing encoded + header size. This allows us to find additional compression ratio gains in the entropy stage for high compression levels.

The threshold to find the optimal table log is compression level 18 (or strategy ZSTD_btultra). Any compression level below this will continue to use the previous heuristic. Note that we intend to introduce some speed optimizations in a follow-up PR that should allow us to expand the use of this feature to lower compression levels.

Benchmarking

I benchmarked on an Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz machine with core isolation and turbo disabled. I measured compression time and compression ratio for silesia.tar (212M), compiling with clang15. I experimented with various combinations of compression level and chunk size. I ran each scenario 5 times and chose the maximum speed value.

Silesia.tar

Compression Ratio

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
3 66508266 66507308 -958
6 61468353 61466015 -2338
9 59288102 59286416 -1686
12 58180337 58178667 -1670
13 57978236 57976678 -1558
15 57159315 57157571 -1744
18 53452679 53446202 -6477
19 53007368 53000712 -6656

-B1KB

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 105689934 105327236 -362698
15 105604824 105241376 -363448
18 105596533 105230572 -365961
19 105596492 105230533 -365959

-B16KB

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 72974335 72953613 -20722
15 72671308 72651124 -20184
18 72417147 72397273 -19874
19 72417034 72397172 -19862

-B32KB

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 71405580 71396709 -8871
15 68878942 68866239 -12703
18 68590940 68578200 -12740
19 68330534 68317337 -13197

Compression Speed

Compression Level Dev Speed (MB/s) Test Speed (MB/s) Speed Delta (%)
3 130.8 129.9 −0.688
6 52.7 52.1 −1.13
9 33.7 33.5 -0.593
12 17.4 17.2 -1.163
13 8.16 8.27 +1.348
15 5.31 5.19 −2.260
18 2.28 2.25 −1.316
19 1.85 1.82 −1.622

-B1KB

Compression Level Dev Speed (MB/s) Test Speed (MB/s) Speed Delta (%)
13 8.82 7.33 -16.89
15 7.05 6.06 -14.04
18 7.06 6.06 -14.16
19 7.05 6.06 14.04

-B16KB

Compression Level Dev Speed (MB/s) Test Speed (MB/s) Speed Delta (%)
13 6.08 6.02 -0.987
15 4.35 4.31 -0.920
18 2.19 2.19 +0.00
19 2.19 2.19 +0.00

-B32KB

Compression Level Dev Speed (MB/s) Test Speed (MB/s) Speed Delta (%)
13 9.70 9.63 -0.722
15 5.83 5.80 -0.515
18 3.55 3.54 -0.282
19 2.04 2.04 +0.00

Silesia unpacked

I did some additional benchmarking on the individual files within silesia.tar to investigate which files might contribute more to the overall reduction in compressed size.

Within the corpus, the files mozilla, mr, x-ray, and osdb seem to reap the most benefits from the optimization. The compressed sizes for dickens and nci, on the other hand, do not appear to significantly change, especially when considering the relatively large sizes of these files.

dickens

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 3712646 3712545 -101
15 3662864 3662730 -134

mozilla

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 18059041 18054319 - 4722
15 17263761 17258825 -4936

mr

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 3436245 3434871 -1374
15 3446356 3445058 -1298

nci

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 2821065 2820980 -85
15 2405267 2405244 -23

ooffice

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 3029211 3028749 -462
15 2897444 2897162 -282

osdb

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 3713925 3712640 -1285
15 3721588 3720448 -1140

reymont

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 1815541 1815460 -81
15 1717733 1717655 -78

samba

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 5118655 5118507 -148
15 4935781 4935616 -165

sao

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 5216394 5216147 -247
15 5153777 5153780 +3

webster

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 11912905 11912788 -117
15 11498834 11498778 -56

xml

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 677105 677062 -43
15 619435 619423 -12

x-ray

Compression Level Dev Size (B) Test Size (B) Size Delta (B)
13 6078800 6078114 -686
15 5693609 5693051 -558

@daniellerozenblit daniellerozenblit changed the title Optimal huff depth Optimal huf depth Oct 11, 2022
return minBits;
}

unsigned HUF_optimalTableLog(unsigned maxTableLog, size_t srcSize, unsigned maxSymbolValue, void* workSpace, size_t wkspSize, HUF_CElt* table, const unsigned* count, HUF_depth_mode depthMode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some interface implementation detail :
HUF_CElt* table : when expressed this way, it implies that table is an expected output of the function.

But it's not.
Effectively, table is only provided as a kind of temporary workspace, anything it may contain is just thrown away afterwards.

How to make the difference ? Well, to follow the established convention, table should not exist as a separate parameter, but be blended inside workSpace.

Now, I appreciate that it's probably easier to employ workSpace and HUF_CElt* separately because that's how they exist from the caller side, and trying to bundle them together in a single workSpace might end up being more messy for the caller.

So okay, implementation complexity is a criteria.

In which case, please document clearly that table is just a specialized "workspace", not an expected output of the function.

U32 minBitsSrc = ZSTD_highbit32((U32)(srcSize)) + 1;
U32 minBitsSymbols = ZSTD_highbit32(maxSymbolValue) + 2;
U32 minBits = minBitsSrc < minBitsSymbols ? minBitsSrc : minBitsSymbols;
if (minBits < FSE_MIN_TABLELOG) minBits = FSE_MIN_TABLELOG;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This minimum restriction FSE_MIN_TABLELOG is only specific to FSE.
It's not necessary for Huffman.
Consequently, this line can be dropped.

unsigned HUF_minTableLog(size_t srcSize, unsigned maxSymbolValue)
{
U32 minBitsSrc = ZSTD_highbit32((U32)(srcSize)) + 1;
U32 minBitsSymbols = ZSTD_highbit32(maxSymbolValue) + 2;
Copy link
Contributor

@Cyan4973 Cyan4973 Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The little detail that matters:

Note that, in this formula, + 2 is itself an heuristic.
That's because this code is taken from FSE_minTableLog(), where it's presumed to be employed in combination with a fast heuristic.
Consequently, one of the goals was to avoid providing a too low minimal, that would end up being, in general, a rather poor candidate.

But now, with this new brute-force strategy, as long as the candidate is sometimes a better one, it's worth investigating.

So it's possible to change this formula by + 1, which is the real minimal.
(Note : the real minimal should actually be based on the cardinality of the distribution, for which maxSymbolValue is merely a cheap upper bound)

I tried that, with a quick test on silesia.tar.
The new low limit allows finding a few more bytes here and there, achieving an additional 1-2 KB savings at higher compression modes. Not big in absolute value, but relative to existing savings, it improves this strategy by ~20%, so that's a non-negligible contributor.

Where it shines though is in combination with small blocks.
For example, when cutting silesia.tar into blocks of 1 KB, it achieves additional savings of almost 300 KB. Not a mistake. So it's a game changer for this scenario.

The downside of this strategy is that there is now one more distribution to test.
So it's even slower.
That could imply revisiting the algorithm's triggering threshold.

Alternatively, it could also provide an incentive to invest time into a more optimized method, that requires less cpu effort. Which, I believe, is possible, without loss of compression ratio.
Maybe this could be dealt with in a follow up...

@Cyan4973
Copy link
Contributor

I'm somewhat concerned by the speed delta measured for small blocks (-B1K).
This may invite a more complex heuristic to decide when to enable the feature, with srcSize or blockSize as a potential parameter.
Or better optimization...

@daniellerozenblit
Copy link
Contributor Author

I'm somewhat concerned by the speed delta measured for small blocks (-B1K). This may invite a more complex heuristic to decide when to enable the feature, with srcSize or blockSize as a potential parameter. Or better optimization...

I am also concerned about the large speed delta. Considering the additional wins that you found after modifying HUF_minTableLog(), I am inclined to first attempt better optimization. It would be a shame to omit this feature for small blocks, when it appears that this is where we see the biggest win.


unsigned HUF_minTableLog(size_t srcSize, unsigned symbolCardinality)
{
U32 minBitsSrc = ZSTD_highbit32((U32)(srcSize)) + 1;
Copy link
Contributor

@Cyan4973 Cyan4973 Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simplify this function.

The reason srcSize was previously part of the formula
is that maxSymbolValue used to be a generous upper bound of the real cardinality.
Using srcSize as a second estimator would impact cases where srcSize < maxSymbolValue, producing a tighter upper bound estimate for these cases.

But now that we have the real cardinality, there is no need for another approximative upper bound.
Instead, derive the result from symbolCardinality directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense thanks for explaining, I initially wasn't sure why we did the first check.

@@ -29,9 +29,9 @@ github.tar, level 7, compress
github.tar, level 9, compress simple, 36760
github.tar, level 13, compress simple, 35501
github.tar, level 16, compress simple, 40471
github.tar, level 19, compress simple, 32134
github.tar, level 19, compress simple, 32149
Copy link
Contributor

@Cyan4973 Cyan4973 Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I read correctly that compressed size is (slightly) worse for this scenario ? (github.tar + level 19)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that this is the case that Nick suggested, where the chosen depth might be locally optimal, but is actually worse when used repeatedly for the next block(s).

I can't really think of another reason why the local optimal would not be globally optimal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and it's likely a good guess.
I was thinking it was a good opportunity to investigate, transform an hypothesis into confirmed knowledge.

@Cyan4973
Copy link
Contributor

That's very good @daniellerozenblit !

Just one last minor simplification suggestion, and I believe this PR is good to go !

Also, I believe it would be interesting to understand why the compressed size becomes (slightly) worse for the github.tar + level 19 scenario. It might not be fixable, but it's still interesting to understand what's going on.

@daniellerozenblit daniellerozenblit marked this pull request as ready for review October 17, 2022 18:20
@daniellerozenblit daniellerozenblit merged commit 0d5d571 into facebook:dev Oct 18, 2022
@daniellerozenblit daniellerozenblit deleted the optimal-huff-depth branch January 4, 2023 16:01
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants