Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: estimate selectivity by table sample #16362

Merged
merged 8 commits into from
Sep 5, 2024

Conversation

xudong963
Copy link
Member

@xudong963 xudong963 commented Sep 1, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Support estimating filter selectivity by table sample(0.2%).

The dynamic sample may be increased runtime for the query, so to avoid performance downgrading, add a time budget to limit the time, see the setting dynamic_sample_time_budget_ms, default is 0 ms.

For example, SQL from join order benchmark

Use sample:

EXPLAIN ANALYZE PARTIAL
SELECT
  *
FROM
  movie_companies AS mc,
  movie_info_idx AS mi_idx
WHERE
  mc.movie_id = mi_idx.movie_id
  AND mc.note NOT LIKE '%(as Metro-Goldwyn-Mayer Pictures)%'
  AND (
    mc.note LIKE '%(co-production)%'
    OR mc.note LIKE '%(presents)%'
  )

-[ EXPLAIN ]-----------------------------------
HashJoin
├── estimated rows: 84512.79
├── output rows: 62.66 thousand
├── Filter
│   ├── filters: [is_true(NOT like(mc.note (#4), '%(as Metro-Goldwyn-Mayer Pictures)%')), is_true(like(mc.note (#4), '%(co-production)%') OR like(mc.note (#4), '%(presents)%'))]
│   ├── estimated rows: 28495.95
│   ├── output rows: 28.89 thousand
│   └── TableScan
│       ├── table: default.imdb.movie_companies
│       ├── estimated rows: 2609129.00
│       └── output rows: 2.61 million
└── TableScan
    ├── table: default.imdb.movie_info_idx
    ├── estimated rows: 1380035.00
    └── output rows: 1.38 million

15 rows explain in 1.831 sec. Processed 0 rows, 0 B (0 rows/s, 0 B/s)

For filter, estimated rows is similar to output rows(real rows).

No sample

-[ EXPLAIN ]-----------------------------------
HashJoin
├── estimated rows: 15584.87
├── output rows: 62.66 thousand
├── Filter
│   ├── filters: [is_true(NOT like(mc.note (#4), '%(as Metro-Goldwyn-Mayer Pictures)%')), is_true(like(mc.note (#4), '%(co-production)%') OR like(mc.note (#4), '%(presents)%'))]
│   ├── estimated rows: 5254.89
│   ├── output rows: 28.89 thousand
│   └── TableScan
│       ├── table: default.imdb.movie_companies
│       ├── estimated rows: 2609129.00
│       └── output rows: 2.61 million
└── TableScan
    ├── table: default.imdb.movie_info_idx
    ├── estimated rows: 1380035.00
    └── output rows: 1.38 million

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Sep 1, 2024
@xudong963 xudong963 marked this pull request as draft September 1, 2024 15:17
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Sep 1, 2024
@dosubot dosubot bot added the A-planner Area: planner/optimizer label Sep 1, 2024
@xudong963 xudong963 marked this pull request as ready for review September 4, 2024 09:23
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Sep 4, 2024
@dosubot dosubot bot added the C-feature Category: feature label Sep 4, 2024
@xudong963 xudong963 force-pushed the selectivity_sample branch 2 times, most recently from ad4112a to aec0de4 Compare September 4, 2024 10:08
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 5, 2024
@Dousir9 Dousir9 added this pull request to the merge queue Sep 5, 2024
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Sep 5, 2024
@BohuTANG BohuTANG merged commit fb9fc9a into databendlabs:main Sep 5, 2024
78 checks passed
@xudong963 xudong963 deleted the selectivity_sample branch September 6, 2024 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-planner Area: planner/optimizer C-feature Category: feature lgtm This PR has been approved by a maintainer pr-feature this PR introduces a new feature to the codebase size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants