-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add advanced_parquet_index.rs
example of index in into parquet files
#10701
Conversation
parquet_index_advanced.rs
example of index in into parquet files
2f1b23f
to
3221378
Compare
b7af342
to
2e860ad
Compare
@@ -143,6 +145,12 @@ pub use writer::plan_to_parquet; | |||
/// custom reader is used, it supplies the metadata directly and this parameter | |||
/// is ignored. [`ParquetExecBuilder::with_metadata_size_hint`] for more details. | |||
/// | |||
/// * External indexes: you can use custom external indexes to control exactly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new feature of ParquetExec: provide a custom external index
8969b55
to
7ff0bbd
Compare
7ff0bbd
to
96ee03c
Compare
@@ -127,6 +127,13 @@ impl Column { | |||
}) | |||
} | |||
|
|||
/// return the column's name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some small access APIs to make the example easier to read. I can revert these or put them in a different PR if reviewers prefer
parquet_index_advanced.rs
example of index in into parquet filesadvanced_parquet_index.rs
example of index in into parquet files
/// Create a new IndexTableProvider | ||
/// * `dir` - the directory containing the parquet files | ||
/// * `object_store` - the object store implementation to use for reading the files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is only taking an object store now and creating the directory internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call -- fixed in 79e1476
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice example! Thanks @alamb 👍
Thank you @Weijun-H Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
/// [`SessionContext::read_parquet`] or [`ListingTable`], which also do file | ||
/// pruning based on parquet statistics (using the same underlying APIs) | ||
/// | ||
/// # Diagram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a diagram here to try and explain what is going on visually for those so inclined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, do you generate these by hand? I was making them by hand with auto complete assistance, but wondering if there's a better tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this https://monodraw.helftone.com/
I think there are other online ascii art diagrammers, but that monodraw works well for me
This PR is just waiting for a committer to approve it so I can merge it in. @crepererum or @tustvold might you have a moment to do so? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example, the ascii doc is pure art.
// in this case, the access plan specifies skipping 8 row groups | ||
// and scanning 2 of them. The skipped row groups are not read at all | ||
// | ||
// [Skip, Skip, Scan, Skip, Skip, Skip, Skip, Scan, Skip, Skip] | ||
// | ||
// Note that the parquet reader only does 2 IOs - one for the data from each | ||
// row group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think it might be more clear to move these comments in L184 and L185?
It seems a bit unclear and weird considering the following comments in L197.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree having the comments about what happened after the statement that does it is a big strange. I updated it in 58b2ba6
WHile doing that I realized the comments are old -- this final query makes only a single IO. I updated that as well
// In this case, the access plan specifies skipping all but the last row group | ||
// and within the last row group, reading only the row with id 950 | ||
// | ||
// [Skip, Skip, Skip, Skip, Skip, Skip, Skip, Skip, Skip, Selection(skip 49, select 1, skip 50)] | ||
// | ||
// In order to prune pages, the Page Index must be loaded. This PageIndex is | ||
// loaded in a separate IO request, so the parquet reader makes 2 IO | ||
// requests for this query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
@crepererum do you by any chance have a few minutes to review and approve this PR? |
I can't merge this PR until a committer approves it: @Dandandan or @thinkharderdev any chance you have some time to review an example (mostly comments)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
/// which will skip unneeded data pages: | ||
/// | ||
/// ```text | ||
/// ┌───────────────────────┐ If the RowSelection does not include any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a note here that this can only happen if the parquet file has a PageIndex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call -- done in d2f6477
Thank you everyone for the great comments and feedback |
apache#10701) * Add `advanced_parquet_index.rs` example of indexing into parquet files * pre-load page index * fix comment * Apply suggestions from code review Thank you @Weijun-H Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Add ASCII ART * Update datafusion-examples/README.md Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Update datafusion-examples/examples/advanced_parquet_index.rs Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Improve / clarify comments based on review * Add page index caveat --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
apache#10701) * Add `advanced_parquet_index.rs` example of indexing into parquet files * pre-load page index * fix comment * Apply suggestions from code review Thank you @Weijun-H Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Add ASCII ART * Update datafusion-examples/README.md Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Update datafusion-examples/examples/advanced_parquet_index.rs Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Improve / clarify comments based on review * Add page index caveat --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
apache#10701) * Add `advanced_parquet_index.rs` example of indexing into parquet files * pre-load page index * fix comment * Apply suggestions from code review Thank you @Weijun-H Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Add ASCII ART * Update datafusion-examples/README.md Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Update datafusion-examples/examples/advanced_parquet_index.rs Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Improve / clarify comments based on review * Add page index caveat --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Which issue does this PR close?
Closes #10580
Rationale for this change
See #10580
This shows how to use the APIs in #9929
Building and using external indexes in DataFusion is an important feature. Adding an example of how to do so will help drive the design and APIs and help other people discover it more easiy
Specifically, this example illustrates how to configure the ParquetExec to read only some row groups / pages within a file and shows how to avoid reading the metadata each time. This is required for low latency parquet query access
What changes are included in this PR?
Add a new example:
parquet_index_advanced
Are these changes tested?
Are there any user-facing changes?
This is an example