-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: optimize read of small row group in parquet #13530
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Can we do a performance bench for this PR? |
@BohuTANG not do a performance bench first , @youngsofun review it first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation is wrong. We should read all the small row groups at once. Parquet2Groups
processing in this PR is the same as the original Parquet2RowGroup
actually.
@RinChanNOWWW ok, let me fix. |
Yes, but I think it's better to unify to |
let mut groups = HashMap::with_capacity(parts.len()); | ||
|
||
for (gid, p) in parts.into_iter().enumerate() { | ||
max_compression_ratio = max_compression_ratio | ||
.max(p.uncompressed_size() as f64 / p.compressed_size() as f64); | ||
max_compressed_size = max_compressed_size.max(p.compressed_size()); | ||
partitions.push(Arc::new( | ||
Box::new(ParquetPart::Parquet2RowGroup(p)) as Box<dyn PartInfo> | ||
)); | ||
groups.insert(gid, p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the late review.
do you mean each file as a part? not make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe u can expain amore in comments and/or the pr summary. so we can understand your code better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am fixing it , the code would be push tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngsofun please help check
…databend into optimize-read-small-group merged
src/query/storages/parquet/src/parquet2/processors/deserialize_transform.rs
Outdated
Show resolved
Hide resolved
src/query/storages/parquet/src/parquet2/processors/deserialize_transform.rs
Outdated
Show resolved
Hide resolved
…databend into optimize-read-small-group merged
tests/suites/0_stateless/20+_others/20_00014_read_small_group.py
Outdated
Show resolved
Hide resolved
…databend into optimize-read-small-group merged
src/query/storages/parquet/src/parquet_rs/parquet_reader/row_group.rs
Outdated
Show resolved
Hide resolved
src/query/storages/parquet/src/parquet_rs/parquet_reader/row_group.rs
Outdated
Show resolved
Hide resolved
…databend into optimize-read-small-group merged
@zenus Thank you for your persistence and contribution |
I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/
Summary
normally, reading a parquet file on s3 is broken down into several steps
read file meta when read_partitions
read column chunks needed
when the row group is small, it is more efficient to merge reading column chunks.
it's no new things, i just reuse the idea of block_reader of fuse egine.
I did a simple test and it is indeed profitable。
This change is