Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update sort algorithm using loser tree for multi sort merge #15869

Merged
merged 30 commits into from
Jul 4, 2024

Conversation

forsaken628
Copy link
Collaborator

@forsaken628 forsaken628 commented Jun 23, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Fixes #11604

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jun 23, 2024
@sundy-li
Copy link
Member

Is there any performance comparison data of this pr?

Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
@forsaken628
Copy link
Collaborator Author

Is there any performance comparison data of this pr?

Not yet, do I need to add an algorithm level benchmark, or just ci-benchmark?

Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
@sundy-li
Copy link
Member

you can pref it in local, eg:

create table t(a int, b string, d float, e date);
create table t_random like t engine = random;

## generate data
insert into t select * from t_random limit 5000000;
insert into t select * from t_random limit 5000000;
insert into t select * from t_random limit 5000000;
insert into t select * from t_random limit 5000000;
...

## sort perf 

select * from t order by a,b ignore_result;
...

Signed-off-by: coldWater <forsaken628@gmail.com>
@forsaken628
Copy link
Collaborator Author

you can pref it in local, eg:

create table t(a int, b string, d float, e date);
create table t_random like t engine = random;

## generate data
insert into t select * from t_random limit 5000000;
insert into t select * from t_random limit 5000000;
insert into t select * from t_random limit 5000000;
insert into t select * from t_random limit 5000000;
...

## sort perf 

select * from t order by a,b ignore_result;
...

No noticeable changes. There seems to be a slight improvement from flamegraph, not obvious. It could also be due to the distribution of the data.

flamegraph

data set

read rows: 25000000
read size: 202.54 MiB
partitions total: 24
partitions scanned: 24
pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 24 to 24>]
push downs: [filters: [], limit: NONE]
estimated rows: 25000000.00

sql

sort

select * from t order by a,b ignore_result;
pprof.cpu.heap.sort.pb.gz
pprof.cpu.loser.sort.pb.gz

window

SELECT MAX(d) OVER (PARTITION BY a) FROM t ignore_result;
pprof.cpu.heap.window.pb.gz
pprof.cpu.loser.window.pb.gz

@sundy-li
Copy link
Member

The codes LGTM. Some comments about this pr:

Better add a new setting named enable_loser_tree_merge_sort, default to 1.

Then we can create the merger in MultiSortMergeProcessor by setting in runtime. If any bug happens, we can switch to use different impl.

Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
@forsaken628 forsaken628 requested a review from sundy-li June 28, 2024 03:16
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
@forsaken628
Copy link
Collaborator Author

Here's another part about Merger's changes.

forsaken628/databend@loser-tree...forsaken628:databend:peek-mut

The cursor contains Arc, and iterative cloning and deletion incurs a lot of overhead, so I'm replacing cloning with in-place updating.

Should these two parts of the changes be separately PR or combined?

@sundy-li
Copy link
Member

sundy-li commented Jul 1, 2024

Should these two parts of the changes be separately PR or combined?

You can add into this pr directly.

Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
@sundy-li sundy-li requested a review from zhang2014 July 2, 2024 08:43
@BohuTANG BohuTANG merged commit c49a27a into databendlabs:main Jul 4, 2024
72 of 73 checks passed
@forsaken628 forsaken628 deleted the loser-tree branch July 4, 2024 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Update sort algorithm using Loser Tree
4 participants