Optimal huf depth #3285

daniellerozenblit · 2022-10-11T21:54:03Z

TLDR

This PR modifies HUF_optimalTableLog so that it behaves differently for high and low compression levels. Previously, we used the same heuristic to find the table log for all compression levels. This PR introduces a change that, for high compression levels, tests all valid table depths and chooses the optimal based on minimizing encoded + header size. This allows us to find additional compression ratio gains in the entropy stage for high compression levels.

The threshold to find the optimal table log is compression level 18 (or strategy ZSTD_btultra). Any compression level below this will continue to use the previous heuristic. Note that we intend to introduce some speed optimizations in a follow-up PR that should allow us to expand the use of this feature to lower compression levels.

Benchmarking

I benchmarked on an Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz machine with core isolation and turbo disabled. I measured compression time and compression ratio for silesia.tar (212M), compiling with clang15. I experimented with various combinations of compression level and chunk size. I ran each scenario 5 times and chose the maximum speed value.

Silesia.tar

Compression Ratio

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
3	66508266	66507308	-958
6	61468353	61466015	-2338
9	59288102	59286416	-1686
12	58180337	58178667	-1670
13	57978236	57976678	-1558
15	57159315	57157571	-1744
18	53452679	53446202	-6477
19	53007368	53000712	-6656

-B1KB

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	105689934	105327236	-362698
15	105604824	105241376	-363448
18	105596533	105230572	-365961
19	105596492	105230533	-365959

-B16KB

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	72974335	72953613	-20722
15	72671308	72651124	-20184
18	72417147	72397273	-19874
19	72417034	72397172	-19862

-B32KB

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	71405580	71396709	-8871
15	68878942	68866239	-12703
18	68590940	68578200	-12740
19	68330534	68317337	-13197

Compression Speed

Compression Level	Dev Speed (MB/s)	Test Speed (MB/s)	Speed Delta (%)
3	130.8	129.9	−0.688
6	52.7	52.1	−1.13
9	33.7	33.5	-0.593
12	17.4	17.2	-1.163
13	8.16	8.27	+1.348
15	5.31	5.19	−2.260
18	2.28	2.25	−1.316
19	1.85	1.82	−1.622

-B1KB

Compression Level	Dev Speed (MB/s)	Test Speed (MB/s)	Speed Delta (%)
13	8.82	7.33	-16.89
15	7.05	6.06	-14.04
18	7.06	6.06	-14.16
19	7.05	6.06	14.04

-B16KB

Compression Level	Dev Speed (MB/s)	Test Speed (MB/s)	Speed Delta (%)
13	6.08	6.02	-0.987
15	4.35	4.31	-0.920
18	2.19	2.19	+0.00
19	2.19	2.19	+0.00

-B32KB

Compression Level	Dev Speed (MB/s)	Test Speed (MB/s)	Speed Delta (%)
13	9.70	9.63	-0.722
15	5.83	5.80	-0.515
18	3.55	3.54	-0.282
19	2.04	2.04	+0.00

Silesia unpacked

I did some additional benchmarking on the individual files within silesia.tar to investigate which files might contribute more to the overall reduction in compressed size.

Within the corpus, the files mozilla, mr, x-ray, and osdb seem to reap the most benefits from the optimization. The compressed sizes for dickens and nci, on the other hand, do not appear to significantly change, especially when considering the relatively large sizes of these files.

dickens

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	3712646	3712545	-101
15	3662864	3662730	-134

mozilla

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	18059041	18054319	- 4722
15	17263761	17258825	-4936

mr

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	3436245	3434871	-1374
15	3446356	3445058	-1298

nci

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	2821065	2820980	-85
15	2405267	2405244	-23

ooffice

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	3029211	3028749	-462
15	2897444	2897162	-282

osdb

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	3713925	3712640	-1285
15	3721588	3720448	-1140

reymont

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	1815541	1815460	-81
15	1717733	1717655	-78

samba

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	5118655	5118507	-148
15	4935781	4935616	-165

sao

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	5216394	5216147	-247
15	5153777	5153780	+3

webster

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	11912905	11912788	-117
15	11498834	11498778	-56

xml

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	677105	677062	-43
15	619435	619423	-12

x-ray

Compression Level	Dev Size (B)	Test Size (B)	Size Delta (B)
13	6078800	6078114	-686
15	5693609	5693051	-558

…d into optimal-huff-depth

Cyan4973 · 2022-10-13T03:55:33Z

lib/compress/huf_compress.c

+    return minBits;
+}
+
+unsigned HUF_optimalTableLog(unsigned maxTableLog, size_t srcSize, unsigned maxSymbolValue, void* workSpace, size_t wkspSize, HUF_CElt* table, const unsigned* count, HUF_depth_mode depthMode)


Some interface implementation detail :
HUF_CElt* table : when expressed this way, it implies that table is an expected output of the function.

But it's not.
Effectively, table is only provided as a kind of temporary workspace, anything it may contain is just thrown away afterwards.

How to make the difference ? Well, to follow the established convention, table should not exist as a separate parameter, but be blended inside workSpace.

Now, I appreciate that it's probably easier to employ workSpace and HUF_CElt* separately because that's how they exist from the caller side, and trying to bundle them together in a single workSpace might end up being more messy for the caller.

So okay, implementation complexity is a criteria.

In which case, please document clearly that table is just a specialized "workspace", not an expected output of the function.

Cyan4973 · 2022-10-13T03:56:39Z

lib/compress/huf_compress.c

+    U32 minBitsSrc = ZSTD_highbit32((U32)(srcSize)) + 1;
+    U32 minBitsSymbols = ZSTD_highbit32(maxSymbolValue) + 2;
+    U32 minBits = minBitsSrc < minBitsSymbols ? minBitsSrc : minBitsSymbols;
+    if (minBits < FSE_MIN_TABLELOG) minBits = FSE_MIN_TABLELOG;


This minimum restriction FSE_MIN_TABLELOG is only specific to FSE.
It's not necessary for Huffman.
Consequently, this line can be dropped.

Cyan4973 · 2022-10-13T04:56:44Z

lib/compress/huf_compress.c

+unsigned HUF_minTableLog(size_t srcSize, unsigned maxSymbolValue)
+{
+    U32 minBitsSrc = ZSTD_highbit32((U32)(srcSize)) + 1;
+    U32 minBitsSymbols = ZSTD_highbit32(maxSymbolValue) + 2;


The little detail that matters:

Note that, in this formula, + 2 is itself an heuristic.
That's because this code is taken from FSE_minTableLog(), where it's presumed to be employed in combination with a fast heuristic.
Consequently, one of the goals was to avoid providing a too low minimal, that would end up being, in general, a rather poor candidate.

But now, with this new brute-force strategy, as long as the candidate is sometimes a better one, it's worth investigating.

So it's possible to change this formula by + 1, which is the real minimal.
(Note : the real minimal should actually be based on the cardinality of the distribution, for which maxSymbolValue is merely a cheap upper bound)

I tried that, with a quick test on silesia.tar.
The new low limit allows finding a few more bytes here and there, achieving an additional 1-2 KB savings at higher compression modes. Not big in absolute value, but relative to existing savings, it improves this strategy by ~20%, so that's a non-negligible contributor.

Where it shines though is in combination with small blocks.
For example, when cutting silesia.tar into blocks of 1 KB, it achieves additional savings of almost 300 KB. Not a mistake. So it's a game changer for this scenario.

The downside of this strategy is that there is now one more distribution to test.
So it's even slower.
That could imply revisiting the algorithm's triggering threshold.

Alternatively, it could also provide an incentive to invest time into a more optimized method, that requires less cpu effort. Which, I believe, is possible, without loss of compression ratio.
Maybe this could be dealt with in a follow up...

Cyan4973 · 2022-10-13T05:18:37Z

I'm somewhat concerned by the speed delta measured for small blocks (-B1K).
This may invite a more complex heuristic to decide when to enable the feature, with srcSize or blockSize as a potential parameter.
Or better optimization...

daniellerozenblit · 2022-10-13T13:55:41Z

I'm somewhat concerned by the speed delta measured for small blocks (-B1K). This may invite a more complex heuristic to decide when to enable the feature, with srcSize or blockSize as a potential parameter. Or better optimization...

I am also concerned about the large speed delta. Considering the additional wins that you found after modifying HUF_minTableLog(), I am inclined to first attempt better optimization. It would be a shame to omit this feature for small blocks, when it appears that this is where we see the biggest win.

Cyan4973 · 2022-10-15T07:07:38Z

lib/compress/huf_compress.c

+
+unsigned HUF_minTableLog(size_t srcSize, unsigned symbolCardinality)
+{
+    U32 minBitsSrc = ZSTD_highbit32((U32)(srcSize)) + 1;


I would simplify this function.

The reason srcSize was previously part of the formula
is that maxSymbolValue used to be a generous upper bound of the real cardinality.
Using srcSize as a second estimator would impact cases where srcSize < maxSymbolValue, producing a tighter upper bound estimate for these cases.

But now that we have the real cardinality, there is no need for another approximative upper bound.
Instead, derive the result from symbolCardinality directly.

This makes sense thanks for explaining, I initially wasn't sure why we did the first check.

Cyan4973 · 2022-10-15T07:12:40Z

tests/regression/results.csv

@@ -29,9 +29,9 @@ github.tar,                         level 7,                            compress
 github.tar,                         level 9,                            compress simple,                    36760
 github.tar,                         level 13,                           compress simple,                    35501
 github.tar,                         level 16,                           compress simple,                    40471
-github.tar,                         level 19,                           compress simple,                    32134
+github.tar,                         level 19,                           compress simple,                    32149


Do I read correctly that compressed size is (slightly) worse for this scenario ? (github.tar + level 19)

I assume that this is the case that Nick suggested, where the chosen depth might be locally optimal, but is actually worse when used repeatedly for the next block(s).

I can't really think of another reason why the local optimal would not be globally optimal.

Yes, and it's likely a good guess.
I was thinking it was a good opportunity to investigate, transform an hypothesis into confirmed knowledge.

Cyan4973 · 2022-10-15T07:17:39Z

That's very good @daniellerozenblit !

Just one last minor simplification suggestion, and I believe this PR is good to go !

Also, I believe it would be interesting to understand why the compressed size becomes (slightly) worse for the github.tar + level 19 scenario. It might not be fixable, but it's still interesting to understand what's going on.

…lity for minTableLog

daniellerozenblit and others added 4 commits October 11, 2022 13:12

CI failure fixes

8888a2d

Merge branch 'facebook:dev' into optimal-huff-depth

d880960

Set threshold to use optimal table log

fa7d9c1

Merge branch 'optimal-huff-depth' of github.com:daniellerozenblit/zst…

5978627

…d into optimal-huff-depth

facebook-github-bot added the CLA Signed label Oct 11, 2022

daniellerozenblit changed the title ~~Optimal huff depth~~ Optimal huf depth Oct 11, 2022

Update regression results

117fddc

daniellerozenblit requested a review from Cyan4973 October 12, 2022 21:12

Cyan4973 reviewed Oct 13, 2022

View reviewed changes

daniellerozenblit added 3 commits October 14, 2022 10:37

Additional ratio optimizations

e60cae3

Update threshold to use optimal depth

c4853e1

Update regression results and better variable naming for HUF_cardinality

75cd42a

Cyan4973 reviewed Oct 15, 2022

View reviewed changes

daniellerozenblit added 2 commits October 17, 2022 07:55

Minor simplication: no longer need to check src size if using cardina…

b347290

…lity for minTableLog

No longer pass srcSize to minTableLog

a910489

Cyan4973 approved these changes Oct 17, 2022

View reviewed changes

daniellerozenblit marked this pull request as ready for review October 17, 2022 18:20

daniellerozenblit merged commit 0d5d571 into facebook:dev Oct 18, 2022

This was referenced Oct 27, 2022

Optimal huff depth speed improvements daniellerozenblit/zstd#8

Closed

Optimal huff depth speed improvements #3302

Merged

daniellerozenblit deleted the optimal-huff-depth branch January 4, 2023 16:01

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal huf depth #3285

Optimal huf depth #3285

daniellerozenblit commented Oct 11, 2022 •

edited

Loading

Cyan4973 Oct 13, 2022

Cyan4973 Oct 13, 2022

Cyan4973 Oct 13, 2022 •

edited

Loading

Cyan4973 commented Oct 13, 2022

daniellerozenblit commented Oct 13, 2022

Cyan4973 Oct 15, 2022 •

edited

Loading

daniellerozenblit Oct 17, 2022

Cyan4973 Oct 15, 2022 •

edited

Loading

daniellerozenblit Oct 17, 2022

Cyan4973 Oct 17, 2022

Cyan4973 commented Oct 15, 2022

Optimal huf depth #3285

Optimal huf depth #3285

Conversation

daniellerozenblit commented Oct 11, 2022 • edited Loading

TLDR

Benchmarking

Silesia.tar

Compression Ratio

-B1KB

-B16KB

-B32KB

Compression Speed

-B1KB

-B16KB

-B32KB

Silesia unpacked

dickens

mozilla

mr

nci

ooffice

osdb

reymont

samba

sao

webster

xml

x-ray

Cyan4973 Oct 13, 2022

Choose a reason for hiding this comment

Cyan4973 Oct 13, 2022

Choose a reason for hiding this comment

Cyan4973 Oct 13, 2022 • edited Loading

Choose a reason for hiding this comment

The little detail that matters:

Cyan4973 commented Oct 13, 2022

daniellerozenblit commented Oct 13, 2022

Cyan4973 Oct 15, 2022 • edited Loading

Choose a reason for hiding this comment

daniellerozenblit Oct 17, 2022

Choose a reason for hiding this comment

Cyan4973 Oct 15, 2022 • edited Loading

Choose a reason for hiding this comment

daniellerozenblit Oct 17, 2022

Choose a reason for hiding this comment

Cyan4973 Oct 17, 2022

Choose a reason for hiding this comment

Cyan4973 commented Oct 15, 2022

daniellerozenblit commented Oct 11, 2022 •

edited

Loading

Cyan4973 Oct 13, 2022 •

edited

Loading

Cyan4973 Oct 15, 2022 •

edited

Loading

Cyan4973 Oct 15, 2022 •

edited

Loading