-
Notifications
You must be signed in to change notification settings - Fork 753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISSUE-3225: Make parquet reader works with only one stream reader #3757
Conversation
Thanks for the contribution! Please review the labels and make any necessary changes. |
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/databend/databend/6FHWisJQQ6fDerkupYzymkhc9y63 [Deployment for d6163b0 canceled] |
Codecov Report
@@ Coverage Diff @@
## main #3757 +/- ##
=====================================
- Coverage 59% 59% -1%
=====================================
Files 707 708 +1
Lines 38206 38253 +47
=====================================
- Hits 22808 22778 -30
- Misses 15398 15475 +77
Continue to review full report at Codecov.
|
let mut data_cols = Vec::with_capacity(cols.len()); | ||
for (col_meta, idx) in cols { | ||
let col_pages = | ||
get_page_stream(&col_meta, &mut self.reader, vec![], Arc::new(|_, _| true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we do not read columns in parallel, will it slow down the performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for fuse engine, the newly added BlockReader
will be used, which keeps reading the columns parallelly.
@dantengsky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
Wait for another reviewer approval |
I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/
Summary
common_streams::ParquetSource
AysncRead + AsyncSeek
reader is enough, no longer depend on the whole DataAccessor traitAdd a dedicated
BlockReader
to fuse storage, which willkeeps reading the columns parallelly
with different signature
e.g. returns
Result<Datablock>
instead ofResult<Option<DataBlock>>
Different from
ParquetSource
, in which the caller site is supposed to keep calling theSource::read
method, tillErr
orResult<None>
is returned.BlockReader
is supposed to be used while reading columns of parquet as a block in one go.@Veeupup
cache
has been removed fromBlockReader
since for the time being, it is only used during the construction phase of
BlockReader
. may we add it back later if necessary?Changelog
Related Issues
Fixes #3225
Test Plan
Unit Tests
Stateless Tests