Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: hilbert clustering #16296

Merged
merged 17 commits into from
Sep 2, 2024
Merged

feat: hilbert clustering #16296

merged 17 commits into from
Sep 2, 2024

Conversation

zhyass
Copy link
Member

@zhyass zhyass commented Aug 20, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Refer to https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=bfd6d94c98627756989b0147a68b7ab1f881a0d

Unlike traditional linear clustering, which relies on sorting data based on a linear order of the cluster key, Hilbert clustering encodes the cluster key using the Hilbert curve (a space-filling curve that preserves locality). The data is then sorted according to the Hilbert-encoded values.

Hilbert clustering optimizes data layout for queries with predicates on non-primary columns, enabling more effective filtering.

Syntax

CREATE TABLE table_name (
    column_definitions
)
CLUSTER BY [linear | hilbert] (exprs);

ALTER TABLE table_name CLUSTER BY [linear | hilbert] (exprs);

Performance

The table contains 100,000,000 rows.

create table test_hilbert(
    id1 bigint not null,
    id2 bigint not null,
) cluster by hilbert(id1,id2);

create table test_linear(
    id1 bigint not null,
    id2 bigint not null,
) cluster by (id1,id2);
explain select * from test_hilbert where id1=94495891752316098;
pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 252 to 20, bloom pruning: 20 to 1>]

explain select * from test_hilbert where id2=2575674957183765083;
pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 252 to 20, bloom pruning: 20 to 1>]
explain select * from test_linear where id1=94495891752316098;
pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 251 to 1, bloom pruning: 1 to 1>]

explain select * from test_linear where id2=2575674957183765083;
pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 251 to 251, bloom pruning: 251 to 5>]

Important Consideration

The limitations of the Hilbert cluster lie in its dependence on the data distribution across different dimensions. If the data distribution is uneven in a particular dimension or if there are significant differences in the distribution characteristics between dimensions, it may result in poor clustering performance, leading to the need to scan more blocks during queries. Therefore, when using a Hilbert cluster, it is essential to consider whether the selected keys adequately represent the data's distribution characteristics.

create table test_hilbert(
    id1 int16 not null,
    id2 bigint not null,
) cluster by hilbert(id1,id2);

create table test_linear(
    id1 int16 not null,
    id2 bigint not null,
) cluster by (id1,id2);
explain select * from test_hilbert where id1=8113;
pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 208 to 104, bloom pruning: 104 to 104>]

explain select * from test_hilbert where id2=1416852658076925472;
pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 208 to 4, bloom pruning: 4 to 1>]

explain select * from test_linear where id1=8113;
pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 208 to 1, bloom pruning: 1 to 1>]

explain select * from test_linear where id2=1416852658076925472;
pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 208 to 208, bloom pruning: 208 to 1>]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@zhyass zhyass marked this pull request as draft August 20, 2024 16:03
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Aug 20, 2024
@zhyass zhyass force-pushed the feature_cluster branch 2 times, most recently from c7961f9 to 65939f5 Compare August 23, 2024 16:05
@zhyass zhyass added the ci-cloud Build docker image for cloud test label Aug 26, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-16296-3ea63dd-1724664794

note: this image tag is only available for internal use,
please check the internal doc for more details.

@zhyass zhyass marked this pull request as ready for review August 29, 2024 11:28
@dantengsky
Copy link
Member

dantengsky commented Sep 1, 2024

Looks good as a baseline implementation of Hilbert clustering for internal evaluation.

This feature is not yet mature enough to be used by users (including private beta).

Suggested next steps:

  • Map the values of the columns involved in clustering to an appropriate range based on their distribution.

    For example, for each column in the given blocks, partition the rows into (up to) 2^n partitions so that the number of rows in each partition is as evenly distributed as possible. Then, during curve filling, use the partition IDs as the "dimension" for mapping data onto the Hilbert curve.

    We could start by leveraging block-level min/max and row count statistics to infer partition boundaries, which seems like a good starting point. Alternatively, if we need more accurate boundaries, we could sample the rows from the blocks later.

    This should allow the data to be more evenly and efficiently mapped onto the Hilbert curve, in my opinion.

  • The strategy of "iteratively" re-clustering used in linear clustering may no longer fit with Hilbert clustering.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 1, 2024
Copy link
Contributor

github-actions bot commented Sep 1, 2024

Docker Image for PR

  • tag: pr-16296-eb0b013-1725208290

note: this image tag is only available for internal use,
please check the internal doc for more details.

@dantengsky dantengsky added this pull request to the merge queue Sep 2, 2024
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Sep 2, 2024
@BohuTANG BohuTANG added this pull request to the merge queue Sep 2, 2024
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Sep 2, 2024
@BohuTANG BohuTANG merged commit 6766899 into databendlabs:main Sep 2, 2024
82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test lgtm This PR has been approved by a maintainer pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants