Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add advanced_parquet_index.rs example of index in into parquet files #10701

Merged
merged 13 commits into from
Jun 22, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 28, 2024

Which issue does this PR close?

Closes #10580

Rationale for this change

See #10580

This shows how to use the APIs in #9929

Building and using external indexes in DataFusion is an important feature. Adding an example of how to do so will help drive the design and APIs and help other people discover it more easiy

Specifically, this example illustrates how to configure the ParquetExec to read only some row groups / pages within a file and shows how to avoid reading the metadata each time. This is required for low latency parquet query access

What changes are included in this PR?

Add a new example: parquet_index_advanced

Are these changes tested?

Are there any user-facing changes?

This is an example

@alamb alamb changed the title Alamb/advanced parquet index Add parquet_index_advanced.rs example of index in into parquet files May 28, 2024
@alamb alamb force-pushed the alamb/advanced_parquet_index branch 2 times, most recently from 2f1b23f to 3221378 Compare May 29, 2024 21:35
@github-actions github-actions bot added the core Core DataFusion crate label May 29, 2024
@alamb alamb force-pushed the alamb/advanced_parquet_index branch from b7af342 to 2e860ad Compare June 3, 2024 19:53
@@ -143,6 +145,12 @@ pub use writer::plan_to_parquet;
/// custom reader is used, it supplies the metadata directly and this parameter
/// is ignored. [`ParquetExecBuilder::with_metadata_size_hint`] for more details.
///
/// * External indexes: you can use custom external indexes to control exactly
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new feature of ParquetExec: provide a custom external index

@alamb alamb force-pushed the alamb/advanced_parquet_index branch from 7ff0bbd to 96ee03c Compare June 13, 2024 11:02
@alamb alamb marked this pull request as ready for review June 13, 2024 11:21
@@ -127,6 +127,13 @@ impl Column {
})
}

/// return the column's name.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some small access APIs to make the example easier to read. I can revert these or put them in a different PR if reviewers prefer

@alamb alamb changed the title Add parquet_index_advanced.rs example of index in into parquet files Add advanced_parquet_index.rs example of index in into parquet files Jun 13, 2024
Comment on lines 175 to 177
/// Create a new IndexTableProvider
/// * `dir` - the directory containing the parquet files
/// * `object_store` - the object store implementation to use for reading the files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is only taking an object store now and creating the directory internally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call -- fixed in 79e1476

Copy link
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example! Thanks @alamb 👍

datafusion-examples/README.md Outdated Show resolved Hide resolved
datafusion-examples/examples/advanced_parquet_index.rs Outdated Show resolved Hide resolved
datafusion-examples/examples/advanced_parquet_index.rs Outdated Show resolved Hide resolved
datafusion-examples/examples/advanced_parquet_index.rs Outdated Show resolved Hide resolved
datafusion-examples/examples/advanced_parquet_index.rs Outdated Show resolved Hide resolved
alamb and others added 4 commits June 14, 2024 06:29
Thank you @Weijun-H

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
/// [`SessionContext::read_parquet`] or [`ListingTable`], which also do file
/// pruning based on parquet statistics (using the same underlying APIs)
///
/// # Diagram
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a diagram here to try and explain what is going on visually for those so inclined

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do you generate these by hand? I was making them by hand with auto complete assistance, but wondering if there's a better tool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this https://monodraw.helftone.com/

I think there are other online ascii art diagrammers, but that monodraw works well for me

@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2024

This PR is just waiting for a committer to approve it so I can merge it in. @crepererum or @tustvold might you have a moment to do so?

@alamb alamb added the documentation Improvements or additions to documentation label Jun 18, 2024
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example, the ascii doc is pure art.

Comment on lines 189 to 195
// in this case, the access plan specifies skipping 8 row groups
// and scanning 2 of them. The skipped row groups are not read at all
//
// [Skip, Skip, Scan, Skip, Skip, Skip, Skip, Scan, Skip, Skip]
//
// Note that the parquet reader only does 2 IOs - one for the data from each
// row group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think it might be more clear to move these comments in L184 and L185?

It seems a bit unclear and weird considering the following comments in L197.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree having the comments about what happened after the statement that does it is a big strange. I updated it in 58b2ba6

WHile doing that I realized the comments are old -- this final query makes only a single IO. I updated that as well

Comment on lines 206 to 213
// In this case, the access plan specifies skipping all but the last row group
// and within the last row group, reading only the row with id 950
//
// [Skip, Skip, Skip, Skip, Skip, Skip, Skip, Skip, Skip, Selection(skip 49, select 1, skip 50)]
//
// In order to prune pages, the Page Index must be loaded. This PageIndex is
// loaded in a separate IO request, so the parquet reader makes 2 IO
// requests for this query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jun 19, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 20, 2024

@crepererum do you by any chance have a few minutes to review and approve this PR?

@alamb
Copy link
Contributor Author

alamb commented Jun 21, 2024

I can't merge this PR until a committer approves it:

Screenshot 2024-06-21 at 12 47 52 PM

@Dandandan or @thinkharderdev any chance you have some time to review an example (mostly comments)?

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!

/// which will skip unneeded data pages:
///
/// ```text
/// ┌───────────────────────┐ If the RowSelection does not include any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note here that this can only happen if the parquet file has a PageIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call -- done in d2f6477

@alamb alamb merged commit ea46e82 into apache:main Jun 22, 2024
23 checks passed
@alamb
Copy link
Contributor Author

alamb commented Jun 22, 2024

Thank you everyone for the great comments and feedback

@alamb alamb deleted the alamb/advanced_parquet_index branch June 22, 2024 12:17
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jun 22, 2024
apache#10701)

* Add `advanced_parquet_index.rs` example of indexing into parquet files

* pre-load page index

* fix comment

* Apply suggestions from code review

Thank you @Weijun-H

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Add ASCII ART

* Update datafusion-examples/README.md

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Update datafusion-examples/examples/advanced_parquet_index.rs

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Improve / clarify comments based on review

* Add page index caveat

---------

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jun 22, 2024
apache#10701)

* Add `advanced_parquet_index.rs` example of indexing into parquet files

* pre-load page index

* fix comment

* Apply suggestions from code review

Thank you @Weijun-H

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Add ASCII ART

* Update datafusion-examples/README.md

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Update datafusion-examples/examples/advanced_parquet_index.rs

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Improve / clarify comments based on review

* Add page index caveat

---------

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
apache#10701)

* Add `advanced_parquet_index.rs` example of indexing into parquet files

* pre-load page index

* fix comment

* Apply suggestions from code review

Thank you @Weijun-H

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Add ASCII ART

* Update datafusion-examples/README.md

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Update datafusion-examples/examples/advanced_parquet_index.rs

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Improve / clarify comments based on review

* Add page index caveat

---------

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Advanced example for building an external index for Row Groups *within* parquet files
6 participants