-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISSUE-3121: simple chunk strategy in insertion #3122
Conversation
Thanks for the contribution! Please review the labels and make any necessary changes. |
Why not make the merge strategy by num of rows? |
cause
EDIT: A simple merge-by-block-size strategy is added. pls let me reconsider where/when to put the |
Codecov Report
@@ Coverage Diff @@
## main #3122 +/- ##
======================================
- Coverage 66% 66% -1%
======================================
Files 667 672 +5
Lines 34858 35172 +314
======================================
+ Hits 23304 23451 +147
- Misses 11554 11721 +167
Continue to review full report at Codecov.
|
This pull request has merge conflicts that must be resolved before it can be merged. @dantengsky please rebase it 🙏 |
@sundy-li PTAL |
Wait for another reviewer approval |
I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/
Summary
the simple strategy goes like this:
For each query node that processes the insertion, while consuming the
SendableDatablockStream
DEFAULT_CHUNK_BLOCK_NUM
of blocks, let's say B, the following reshape is applied:for each n successive data block in B, if the sum of their
memory_size
exceedsBLOCK_SIZE_THRESHOLD
, they will be merged into one larger block.NOTE:
DEFAULT_CHUNK_BLOCK_NUM
andBLOCK_SIZE_THRESHOLD
will be fetched first from table optionsif not there, default values will be applied
block_size_threshold
, they will NOT be splitit might be better to apply fine-grained strategy in table compact/optimization
after this simple strategy is applied, the read performance might be improved for somehow:
A very non-scientific, local debug test:
data of
ontime
is generated by firstv1 streaming_load
, and theninsert into ontime select * from ontime
several times.Changelog
Related Issues
Fixes #3121
Test Plan
Unit Tests
Stateless Tests