Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALP exponent sampling improvements #919

Open
lwwmanning opened this issue Sep 24, 2024 · 0 comments
Open

ALP exponent sampling improvements #919

lwwmanning opened this issue Sep 24, 2024 · 0 comments

Comments

@lwwmanning
Copy link
Member

lwwmanning commented Sep 24, 2024

The original ALP paper, as well as the duckdb implementation, does two-level sampling to find exponents, which should materially improve throughput for larger datasets.

Notably, in duckdb, they have row groups of 60 vectors, of 2048 elements each (i.e., 122880 elements). They do a stratified sample (8 samples of 32 elements) within those 122K elements to find the top ~5 combinations of exponents, and then they pick from those top 5 exponent pairs to compress each 2048 element vector individually.

I like the duckdb criteria for determining top N combinations: https://github.com/duckdb/duckdb/blob/v1.1.1/src/include/duckdb/storage/compression/alp/algorithm/alp.hpp#L136-L154

We could either:

  1. (Simplest) take the top N candidate exponent pairs from the first chunk of 65K elements, and only consider those as candidates for subsequent chunks of that column
  2. (More complex, potentially more robust?) do a stratified sample across all rows of the entire column (i.e., across all chunks), fit per sample, find top N exponents, then pass that set to compress per chunk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant