Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE-3121: simple chunk strategy in insertion #3122

Merged
merged 18 commits into from
Nov 30, 2021

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Nov 26, 2021

I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

Summary

the simple strategy goes like this:

For each query node that processes the insertion, while consuming the SendableDatablockStream

  • for each DEFAULT_CHUNK_BLOCK_NUM of blocks, let's say B, the following reshape is applied:
    for each n successive data block in B, if the sum of their memory_size exceeds BLOCK_SIZE_THRESHOLD, they will be merged into one larger block.

NOTE:

  • currently, DEFAULT_CHUNK_BLOCK_NUM and BLOCK_SIZE_THRESHOLD will be fetched first from table options
    if not there, default values will be applied
  • the max size of merged-block will be 2 * block_size_threshold (in memory size)
  • after compression, the on-storage size of the block varies
  • for block that is larger than block_size_threshold, they will NOT be split
    it might be better to apply fine-grained strategy in table compact/optimization

after this simple strategy is applied, the read performance might be improved for somehow:

A very non-scientific, local debug test:

SELECT
    count(1),
    avg(Year),
    sum(DayOfWeek)
FROM ontime

┌─count(1)─┬─avg(Year)─┬─sum(DayOfWeek)─┐
│  2970856 │      2020 │       11657448 │
└──────────┴───────────┴────────────────┘

1 rows in set. Elapsed: 1.313 sec.
  • this PR
select count(1) ,avg(Year), sum(DayOfWeek)  from ontime;
+----------+-----------+----------------+
| count(1) | avg(Year) | sum(DayOfWeek) |
+----------+-----------+----------------+
|  2970856 |      2020 |       11657448 |
+----------+-----------+----------------+
1 row in set (0.37 sec)
Read 0 rows, 0 B in 0.287 sec., 0 rows/sec., 0 B/sec.


select count(1) ,avg(Year), sum(DayOfWeek)  from ontime;
+----------+-----------+----------------+
| count(1) | avg(Year) | sum(DayOfWeek) |
+----------+-----------+----------------+
| 11883424 |      2020 |       46629792 |
+----------+-----------+----------------+
1 row in set (1.10 sec)
Read 0 rows, 0 B in 0.894 sec., 0 rows/sec., 0 B/sec.

data of ontime is generated by first v1 streaming_load, and then insert into ontime select * from ontime several times.

Changelog

  • Improvement
  • Not for changelog (changelog entry is not required)

Related Issues

Fixes #3121

Test Plan

Unit Tests

Stateless Tests

@databend-bot
Copy link
Member

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@sundy-li
Copy link
Member

Why not make the merge strategy by num of rows?

@dantengsky
Copy link
Member Author

dantengsky commented Nov 26, 2021

Why not make the merge strategy by num of rows?

cause DataBlock::concat_blocks is too handy :-)

just kidding, by num of rows or by block size is more sensible, will be addressed in upcoming PR(s)

DataBlock::slice_block_by_size is handy as well! let me impl by num of rows as well

EDIT:

A simple merge-by-block-size strategy is added. pls let me reconsider where/when to put the merge by num of rows strategy.

@codecov-commenter
Copy link

codecov-commenter commented Nov 26, 2021

Codecov Report

Merging #3122 (e792e48) into main (7dae4cb) will decrease coverage by 0%.
The diff coverage is 98%.

Impacted file tree graph

@@          Coverage Diff           @@
##            main   #3122    +/-   ##
======================================
- Coverage     66%     66%    -1%     
======================================
  Files        667     672     +5     
  Lines      34858   35172   +314     
======================================
+ Hits       23304   23451   +147     
- Misses     11554   11721   +167     
Impacted Files Coverage Δ
query/src/storages/fuse/io/location_gen.rs 100% <ø> (ø)
query/src/storages/fuse/io/mod.rs 100% <ø> (ø)
query/src/storages/fuse/operations/commit.rs 88% <ø> (ø)
query/src/storages/fuse/operations/truncate.rs 92% <ø> (ø)
query/src/storages/fuse/table.rs 87% <66%> (-1%) ⬇️
query/src/storages/fuse/io/block_appender.rs 95% <97%> (+1%) ⬆️
query/src/storages/fuse/index/min_max_test.rs 92% <100%> (ø)
query/src/storages/fuse/io/block_appender_test.rs 98% <100%> (+5%) ⬆️
query/src/storages/fuse/operations/append.rs 96% <100%> (+7%) ⬆️
query/src/storages/fuse/operations/read.rs 93% <100%> (ø)
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dae4cb...e792e48. Read the comment docs.

@mergify
Copy link
Contributor

mergify bot commented Nov 29, 2021

This pull request has merge conflicts that must be resolved before it can be merged. @dantengsky please rebase it 🙏

@dantengsky dantengsky marked this pull request as ready for review November 29, 2021 13:21
@dantengsky
Copy link
Member Author

@sundy-li PTAL

@databend-bot
Copy link
Member

Wait for another reviewer approval

@BohuTANG BohuTANG merged commit 379a7c0 into databendlabs:main Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

simple chunked insertion (by number of blocks)
5 participants