You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The original ALP paper, as well as the duckdb implementation, does two-level sampling to find exponents, which should materially improve throughput for larger datasets.
Notably, in duckdb, they have row groups of 60 vectors, of 2048 elements each (i.e., 122880 elements). They do a stratified sample (8 samples of 32 elements) within those 122K elements to find the top ~5 combinations of exponents, and then they pick from those top 5 exponent pairs to compress each 2048 element vector individually.
(Simplest) take the top N candidate exponent pairs from the first chunk of 65K elements, and only consider those as candidates for subsequent chunks of that column
(More complex, potentially more robust?) do a stratified sample across all rows of the entire column (i.e., across all chunks), fit per sample, find top N exponents, then pass that set to compress per chunk.
The text was updated successfully, but these errors were encountered:
The original ALP paper, as well as the duckdb implementation, does two-level sampling to find exponents, which should materially improve throughput for larger datasets.
Notably, in duckdb, they have row groups of 60 vectors, of 2048 elements each (i.e., 122880 elements). They do a stratified sample (8 samples of 32 elements) within those 122K elements to find the top ~5 combinations of exponents, and then they pick from those top 5 exponent pairs to compress each 2048 element vector individually.
I like the duckdb criteria for determining top N combinations: https://github.com/duckdb/duckdb/blob/v1.1.1/src/include/duckdb/storage/compression/alp/algorithm/alp.hpp#L136-L154
We could either:
The text was updated successfully, but these errors were encountered: