Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE-3225: Make parquet reader works with only one stream reader #3757

Merged
merged 9 commits into from
Jan 5, 2022

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Jan 4, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

  • common_streams::ParquetSource

    • serially read the columns, so that
    • only "one stream reader" is needed during the processing of the parquet source
    • An AysncRead + AsyncSeek reader is enough, no longer depend on the whole DataAccessor trait
  • Add a dedicated BlockReader to fuse storage, which will

    • keeps reading the columns parallelly

    • with different signature

      e.g. returns Result<Datablock> instead of Result<Option<DataBlock>>

      Different from ParquetSource, in which the caller site is supposed to keep calling the Source::read method, till Err or Result<None> is returned. BlockReader is supposed to be used while reading columns of parquet as a block in one go.

  • @Veeupup

    • Field cache has been removed from BlockReader
      since for the time being, it is only used during the construction phase of BlockReader. may we add it back later if necessary?

Changelog

  • Improvement
  • Not for changelog (changelog entry is not required)

Related Issues

Fixes #3225

Test Plan

Unit Tests

Stateless Tests

@databend-bot
Copy link
Member

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@vercel
Copy link

vercel bot commented Jan 4, 2022

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/databend/databend/6FHWisJQQ6fDerkupYzymkhc9y63
✅ Preview: Canceled

[Deployment for d6163b0 canceled]

@codecov-commenter
Copy link

codecov-commenter commented Jan 4, 2022

Codecov Report

Merging #3757 (d1cae1c) into main (a019084) will decrease coverage by 0%.
The diff coverage is 61%.

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #3757   +/-   ##
=====================================
- Coverage     59%     59%   -1%     
=====================================
  Files        707     708    +1     
  Lines      38206   38253   +47     
=====================================
- Hits       22808   22778   -30     
- Misses     15398   15475   +77     
Impacted Files Coverage Δ
common/streams/src/sources/source_factory.rs 0% <0%> (ø)
common/streams/src/sources/source_parquet.rs 0% <0%> (-79%) ⬇️
query/src/interpreters/interpreter_copy.rs 0% <0%> (ø)
query/src/storages/fuse/io/mod.rs 100% <ø> (ø)
query/src/storages/fuse/operations/read.rs 92% <60%> (+2%) ⬆️
query/src/storages/fuse/io/block_reader.rs 90% <90%> (ø)
common/streams/src/sources/source_csv.rs 0% <0%> (-90%) ⬇️
query/src/clusters/cluster.rs 40% <0%> (-1%) ⬇️
common/datavalues/src/types/data_type_coercion.rs 33% <0%> (-1%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a019084...d1cae1c. Read the comment docs.

let mut data_cols = Vec::with_capacity(cols.len());
for (col_meta, idx) in cols {
let col_pages =
get_page_stream(&col_meta, &mut self.reader, vec![], Arc::new(|_, _| true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we do not read columns in parallel, will it slow down the performance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for fuse engine, the newly added BlockReader will be used, which keeps reading the columns parallelly.

@Veeupup
Copy link
Contributor

Veeupup commented Jan 4, 2022

@dantengsky
agree! storage cache may read a column from the cache in BlockReader, we can add it when it is ready.

@dantengsky dantengsky marked this pull request as ready for review January 5, 2022 01:48
@dantengsky dantengsky requested review from Veeupup and sundy-li January 5, 2022 01:48
Copy link
Contributor

@Veeupup Veeupup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@databend-bot
Copy link
Member

Wait for another reviewer approval

@sundy-li sundy-li merged commit b4c0215 into databendlabs:main Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make parquet reader works with only one stream reader.
5 participants