ISSUE-3121: simple chunk strategy in insertion #3122

dantengsky · 2021-11-26T09:51:15Z

I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

Summary

the simple strategy goes like this:

For each query node that processes the insertion, while consuming the SendableDatablockStream

for each DEFAULT_CHUNK_BLOCK_NUM of blocks, let's say B, the following reshape is applied:
for each n successive data block in B, if the sum of their memory_size exceeds BLOCK_SIZE_THRESHOLD, they will be merged into one larger block.

NOTE:

currently, DEFAULT_CHUNK_BLOCK_NUM and BLOCK_SIZE_THRESHOLD will be fetched first from table options
if not there, default values will be applied
the max size of merged-block will be 2 * block_size_threshold (in memory size)
after compression, the on-storage size of the block varies
for block that is larger than block_size_threshold, they will NOT be split
it might be better to apply fine-grained strategy in table compact/optimization

after this simple strategy is applied, the read performance might be improved for somehow:

A very non-scientific, local debug test:

for main branch commit 05f3ab5:

SELECT
    count(1),
    avg(Year),
    sum(DayOfWeek)
FROM ontime

┌─count(1)─┬─avg(Year)─┬─sum(DayOfWeek)─┐
│  2970856 │      2020 │       11657448 │
└──────────┴───────────┴────────────────┘

1 rows in set. Elapsed: 1.313 sec.

this PR

select count(1) ,avg(Year), sum(DayOfWeek)  from ontime;
+----------+-----------+----------------+
| count(1) | avg(Year) | sum(DayOfWeek) |
+----------+-----------+----------------+
|  2970856 |      2020 |       11657448 |
+----------+-----------+----------------+
1 row in set (0.37 sec)
Read 0 rows, 0 B in 0.287 sec., 0 rows/sec., 0 B/sec.


select count(1) ,avg(Year), sum(DayOfWeek)  from ontime;
+----------+-----------+----------------+
| count(1) | avg(Year) | sum(DayOfWeek) |
+----------+-----------+----------------+
| 11883424 |      2020 |       46629792 |
+----------+-----------+----------------+
1 row in set (1.10 sec)
Read 0 rows, 0 B in 0.894 sec., 0 rows/sec., 0 B/sec.

data of ontime is generated by first v1 streaming_load, and then insert into ontime select * from ontime several times.

Changelog

Improvement
Not for changelog (changelog entry is not required)

Related Issues

Fixes #3121

Test Plan

Unit Tests

Stateless Tests

databend-bot · 2021-11-26T09:51:19Z

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

sundy-li · 2021-11-26T11:23:55Z

Why not make the merge strategy by num of rows?

dantengsky · 2021-11-26T11:55:18Z

Why not make the merge strategy by num of rows?

cause DataBlock::concat_blocks is too handy :-)

~~just kidding, by num of rows or by block size is more sensible, will be addressed in upcoming PR(s)~~

~~DataBlock::slice_block_by_size is handy as well! let me impl by num of rows as well~~

EDIT:

A simple merge-by-block-size strategy is added. pls let me reconsider where/when to put the merge by num of rows strategy.

codecov-commenter · 2021-11-26T12:24:44Z

Codecov Report

Merging #3122 (e792e48) into main (7dae4cb) will decrease coverage by 0%.
The diff coverage is 98%.

@@          Coverage Diff           @@
##            main   #3122    +/-   ##
======================================
- Coverage     66%     66%    -1%     
======================================
  Files        667     672     +5     
  Lines      34858   35172   +314     
======================================
+ Hits       23304   23451   +147     
- Misses     11554   11721   +167

Impacted Files	Coverage Δ
query/src/storages/fuse/io/location_gen.rs	`100% <ø> (ø)`
query/src/storages/fuse/io/mod.rs	`100% <ø> (ø)`
query/src/storages/fuse/operations/commit.rs	`88% <ø> (ø)`
query/src/storages/fuse/operations/truncate.rs	`92% <ø> (ø)`
query/src/storages/fuse/table.rs	`87% <66%> (-1%)`	⬇️
query/src/storages/fuse/io/block_appender.rs	`95% <97%> (+1%)`	⬆️
query/src/storages/fuse/index/min_max_test.rs	`92% <100%> (ø)`
query/src/storages/fuse/io/block_appender_test.rs	`98% <100%> (+5%)`	⬆️
query/src/storages/fuse/operations/append.rs	`96% <100%> (+7%)`	⬆️
query/src/storages/fuse/operations/read.rs	`93% <100%> (ø)`
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dae4cb...e792e48. Read the comment docs.

mergify · 2021-11-29T12:18:46Z

This pull request has merge conflicts that must be resolved before it can be merged. @dantengsky please rebase it 🙏

query/src/storages/fuse/io/block_appender.rs

dantengsky · 2021-11-30T06:47:44Z

@sundy-li PTAL

databend-bot · 2021-11-30T10:29:46Z

Wait for another reviewer approval

simple chunk strategy in insertion

3a3723f

databend-bot added pr-improvement labels Nov 26, 2021

databend-bot added the need-review label Nov 26, 2021

dantengsky added 5 commits November 26, 2021 18:02

use ErrorCode wrapped in TryChunksError

976fa63

rm unnecessary dependency

5b3f826

Merge remote-tracking branch 'origin/main' into fix-3121

72e0bd4

test case for multi segments

2182155

fix unit tests

648f69a

Merge remote-tracking branch 'origin/main' into fix-3121

4748154

dantengsky added 7 commits November 29, 2021 10:30

WIP: chunk by block size

c1afa71

Merge remote-tracking branch 'origin/main' into fix-3121

a9ba043

chunk by block size

d3f49fb

unit test for reshape_blocks

36e4d57

more ut cases for reshape_blocks

b77752d

some comments & undo unnecessary changes

de46333

fix lint error

8a2a12a

Merge remote-tracking branch 'origin/main' into fix-3121

362309e

dantengsky marked this pull request as ready for review November 29, 2021 13:21

dantengsky requested review from sundy-li and BohuTANG November 29, 2021 13:22

sundy-li reviewed Nov 30, 2021

View reviewed changes

query/src/storages/fuse/io/block_appender.rs Show resolved Hide resolved

dantengsky added 3 commits November 30, 2021 12:02

sort blocks before merge

7d2b43e

Desc sort

df38d8f

fix lint error

e792e48

sundy-li approved these changes Nov 30, 2021

View reviewed changes

BohuTANG merged commit 379a7c0 into databendlabs:main Nov 30, 2021

This was referenced Nov 30, 2021

chunked insertion (by size of blocks) #3147

Closed

syntax sugar : compact table #3176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISSUE-3121: simple chunk strategy in insertion #3122

ISSUE-3121: simple chunk strategy in insertion #3122

dantengsky commented Nov 26, 2021 •

edited

Loading

databend-bot commented Nov 26, 2021

sundy-li commented Nov 26, 2021

dantengsky commented Nov 26, 2021 •

edited

Loading

codecov-commenter commented Nov 26, 2021 •

edited

Loading

mergify bot commented Nov 29, 2021

dantengsky commented Nov 30, 2021

databend-bot commented Nov 30, 2021

ISSUE-3121: simple chunk strategy in insertion #3122

ISSUE-3121: simple chunk strategy in insertion #3122

Conversation

dantengsky commented Nov 26, 2021 • edited Loading

Summary

Changelog

Related Issues

Test Plan

databend-bot commented Nov 26, 2021

sundy-li commented Nov 26, 2021

dantengsky commented Nov 26, 2021 • edited Loading

codecov-commenter commented Nov 26, 2021 • edited Loading

Codecov Report

mergify bot commented Nov 29, 2021

dantengsky commented Nov 30, 2021

databend-bot commented Nov 30, 2021

dantengsky commented Nov 26, 2021 •

edited

Loading

dantengsky commented Nov 26, 2021 •

edited

Loading

codecov-commenter commented Nov 26, 2021 •

edited

Loading