ALP exponent sampling improvements #919

lwwmanning · 2024-09-24T16:32:51Z

The original ALP paper, as well as the duckdb implementation, does two-level sampling to find exponents, which should materially improve throughput for larger datasets.

Notably, in duckdb, they have row groups of 60 vectors, of 2048 elements each (i.e., 122880 elements). They do a stratified sample (8 samples of 32 elements) within those 122K elements to find the top ~5 combinations of exponents, and then they pick from those top 5 exponent pairs to compress each 2048 element vector individually.

I like the duckdb criteria for determining top N combinations: https://github.com/duckdb/duckdb/blob/v1.1.1/src/include/duckdb/storage/compression/alp/algorithm/alp.hpp#L136-L154

We could either:

(Simplest) take the top N candidate exponent pairs from the first chunk of 65K elements, and only consider those as candidates for subsequent chunks of that column
(More complex, potentially more robust?) do a stratified sample across all rows of the entire column (i.e., across all chunks), fit per sample, find top N exponents, then pass that set to compress per chunk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALP exponent sampling improvements #919

ALP exponent sampling improvements #919

lwwmanning commented Sep 24, 2024 •

edited

Loading

ALP exponent sampling improvements #919

ALP exponent sampling improvements #919

Comments

lwwmanning commented Sep 24, 2024 • edited Loading

lwwmanning commented Sep 24, 2024 •

edited

Loading