Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block splitter control parameter #4180

Merged
merged 7 commits into from
Oct 31, 2024
Merged

Block splitter control parameter #4180

merged 7 commits into from
Oct 31, 2024

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Oct 28, 2024

Make it possible to explicit select a block splitter level, via a new CCtx parameter ZSTD_c_blockSplitter_level.

This capability is then exploited in a test, ensuring that incompressible data is not overly split, even in presence of an adversarial input (with full knowledge of the sampling pattern).

Note: possible follow-up:
This PR just adds a new parameter, to control the behavior of the new block splitter.
It doesn't modify the existing parameters.

But as a consequence, there are now 2 parameters for block splitters,
one (legacy) that is controlling the post block-splitter (after sequences are determined)
and a new one that is controlling the new pre block-splitter (before sequences are produced).
Already, it's debatable if it's useful for a user to be exposed to these concepts.
More importantly, the distinction between pre and post block splitter is not clear from the current parameters' names (ZSTD_c_useBlockSplitter vs ZSTD_c_blockSplitter_level).

So it opens the question of a refactoring of these parameters.
For example, maybe both parameters could be fused into a single one, the new ZSTD_c_blockSplitter_level, that would be charged to enable both when level is high enough.
Or maybe there is still value in keeping both these parameters separated, for example for an optimizer tool which could more naturally influence both code paths and maybe find a better combination for some specific use case. In which case, it's probably still useful to debate about meaningful parameter names.

not yet exposed to the interface.

Also: renames `useBlockSplitter` to `postBlockSplitter`
to better qualify the difference between the 2 settings.
test both that the new parameter works as intended,
and that the over-split protection works as intended
@Cyan4973 Cyan4973 changed the title Block splitter parameter Block splitter control parameter Oct 29, 2024
lib/zstd.h Outdated
* to ensure expansion guarantees in presence of incompressible data.
*/
#define ZSTD_BLOCKSPLITTER_LEVEL_MAX 6
#define ZSTD_c_blockSplitter_level ZSTD_c_experimentalParam20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Lets call this ZSTD_c_blockSplitterLevel to match the naming convention of the rest of the parameters.

@Cyan4973
Copy link
Contributor Author

I think I'll keep both parameters,
because both splitting methods (before and after sequences) may evolve independently,
and the way they combine or compete could change over time.

But it's necessary to change the names, so that it's less misleading.

In particular, ZSTD_c_useBlockSplitter implies a full on/off control over anything block-splitter related,
but that's not what this parameter is doing: it only controls the second splitter (which used to be the only splitter), which is triggered after sequences determination. So the parameter name should reflect that scope.

Current name in mind: ZSTD_c_sequenceSplitter, which emphasizes the fact that it's related to sequences. Importantly, it makes it clear that ZSTD_c_blockSplitterLevel and ZSTD_c_sequenceSplitter are 2 separate decisions.
However, the name could also be understood as splitting sequences,
as opposed to splitting blocks according to sequences already found.

@Cyan4973 Cyan4973 marked this pull request as ready for review October 31, 2024 17:31
from ZSTD_c_useBlockSplitter to ZSTD_c_splitAfterSequences.
@Cyan4973
Copy link
Contributor Author

Finally settled on ZSTD_c_splitAfterSequences.

@Cyan4973 Cyan4973 merged commit 15c2916 into dev Oct 31, 2024
94 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants