Block splitter control parameter #4180

Cyan4973 · 2024-10-28T23:53:13Z

Make it possible to explicit select a block splitter level, via a new CCtx parameter ZSTD_c_blockSplitter_level.

This capability is then exploited in a test, ensuring that incompressible data is not overly split, even in presence of an adversarial input (with full knowledge of the sampling pattern).

Note: possible follow-up:
This PR just adds a new parameter, to control the behavior of the new block splitter.
It doesn't modify the existing parameters.

But as a consequence, there are now 2 parameters for block splitters,
one (legacy) that is controlling the post block-splitter (after sequences are determined)
and a new one that is controlling the new pre block-splitter (before sequences are produced).
Already, it's debatable if it's useful for a user to be exposed to these concepts.
More importantly, the distinction between pre and post block splitter is not clear from the current parameters' names (ZSTD_c_useBlockSplitter vs ZSTD_c_blockSplitter_level).

So it opens the question of a refactoring of these parameters.
For example, maybe both parameters could be fused into a single one, the new ZSTD_c_blockSplitter_level, that would be charged to enable both when level is high enough.
Or maybe there is still value in keeping both these parameters separated, for example for an optimizer tool which could more naturally influence both code paths and maybe find a better combination for some specific use case. In which case, it's probably still useful to debate about meaningful parameter names.

not yet exposed to the interface. Also: renames `useBlockSplitter` to `postBlockSplitter` to better qualify the difference between the 2 settings.

test both that the new parameter works as intended, and that the over-split protection works as intended

terrelln · 2024-10-29T17:09:39Z

lib/zstd.h

+ * to ensure expansion guarantees in presence of incompressible data.
+ */
+#define ZSTD_BLOCKSPLITTER_LEVEL_MAX 6
+#define ZSTD_c_blockSplitter_level ZSTD_c_experimentalParam20


nit: Lets call this ZSTD_c_blockSplitterLevel to match the naming convention of the rest of the parameters.

@terrelln

suggested by @terrelln

Cyan4973 · 2024-10-30T18:07:50Z

I think I'll keep both parameters,
because both splitting methods (before and after sequences) may evolve independently,
and the way they combine or compete could change over time.

But it's necessary to change the names, so that it's less misleading.

In particular, ZSTD_c_useBlockSplitter implies a full on/off control over anything block-splitter related,
but that's not what this parameter is doing: it only controls the second splitter (which used to be the only splitter), which is triggered after sequences determination. So the parameter name should reflect that scope.

Current name in mind: ZSTD_c_sequenceSplitter, which emphasizes the fact that it's related to sequences. Importantly, it makes it clear that ZSTD_c_blockSplitterLevel and ZSTD_c_sequenceSplitter are 2 separate decisions.
However, the name could also be understood as splitting sequences,
as opposed to splitting blocks according to sequences already found.

from ZSTD_c_useBlockSplitter to ZSTD_c_splitAfterSequences.

Cyan4973 · 2024-10-31T20:46:33Z

Finally settled on ZSTD_c_splitAfterSequences.

Cyan4973 added 4 commits October 28, 2024 16:31

add internal compression parameter preBlockSplitter_level

01474bf

not yet exposed to the interface. Also: renames `useBlockSplitter` to `postBlockSplitter` to better qualify the difference between the 2 settings.

expose new parameter ZSTD_c_blockSplitter_level

226ae73

added a test

37706a6

test both that the new parameter works as intended, and that the over-split protection works as intended

fixed minor conversion warning

fcbf6b0

Cyan4973 self-assigned this Oct 28, 2024

facebook-github-bot added the CLA Signed label Oct 28, 2024

removed trace left over

f593ccd

Cyan4973 changed the title ~~Block splitter parameter~~ Block splitter control parameter Oct 29, 2024

terrelln approved these changes Oct 29, 2024

View reviewed changes

changed variable name to ZSTD_c_blockSplitterLevel

4f93206

suggested by @terrelln

Cyan4973 force-pushed the split_param branch from 4d41911 to 4f93206 Compare October 29, 2024 18:12

Cyan4973 marked this pull request as ready for review October 31, 2024 17:31

change experimental parameter name

bbaba45

from ZSTD_c_useBlockSplitter to ZSTD_c_splitAfterSequences.

Cyan4973 merged commit 15c2916 into dev Oct 31, 2024
94 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block splitter control parameter #4180

Block splitter control parameter #4180

Cyan4973 commented Oct 28, 2024 •

edited

Loading

terrelln Oct 29, 2024

Cyan4973 commented Oct 30, 2024

Cyan4973 commented Oct 31, 2024

Block splitter control parameter #4180

Block splitter control parameter #4180

Conversation

Cyan4973 commented Oct 28, 2024 • edited Loading

terrelln Oct 29, 2024

Choose a reason for hiding this comment

Cyan4973 commented Oct 30, 2024

Cyan4973 commented Oct 31, 2024

Cyan4973 commented Oct 28, 2024 •

edited

Loading