Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] Pass in file size and num rows to Rust query planner #1282

Merged
merged 2 commits into from
Aug 21, 2023

Conversation

xcharleslin
Copy link
Contributor

@xcharleslin xcharleslin commented Aug 18, 2023

Currently, file metadata (size, numrows) is computed before constructing the query plan. Then it's passed into the Python query planner. This PR passes it into the Rust query planner too.

Eventually, we will likely want to dynamically get metadata at query execution time; this PR is a feature parity stopgap only.

Manually verified that source scan tasks are being emitted with memory requirements now.

@github-actions github-actions bot added the chore label Aug 18, 2023
@@ -140,23 +140,20 @@ impl ExternalInfo {
#[derive(Debug, Serialize, Deserialize)]
pub struct FileInfo {
pub file_paths: Vec<String>,
pub file_sizes: Option<Vec<i64>>,
pub num_rows: Option<Vec<i64>>,
pub file_formats: Option<Vec<FileFormat>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out the "file format" data Daft currently ingests is probably file vs directory. I only see instances of the string "file"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I believe the other value returned by the S3 API is "directory"

@xcharleslin xcharleslin marked this pull request as ready for review August 18, 2023 20:43
@codecov
Copy link

codecov bot commented Aug 18, 2023

Codecov Report

Merging #1282 (5e9beed) into main (9e4e20f) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1282      +/-   ##
==========================================
+ Coverage   87.60%   87.62%   +0.02%     
==========================================
  Files          61       61              
  Lines        6026     6028       +2     
==========================================
+ Hits         5279     5282       +3     
+ Misses        747      746       -1     
Files Changed Coverage Δ
daft/logical/rust_logical_plan.py 90.12% <100.00%> (+0.12%) ⬆️

... and 1 file with indirect coverage changes

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

If we were interested in doing a larger refactor, another option would be exposing InputFileInfo to Python via pyo3, having RunnerIO.glob_paths_details() return an InputFileInfo instead of a PartitionSet, and pass runner_io.glob_paths_details(...) directly to _LogicalPlanBuilder.table_scan(). This would have a few benefits:

  1. We would be consolidating the glob_paths_details() output --> Arrow table logic, which we currently duplicate in two places.
  2. It narrows the LogicalPlanBuilder.table_scan() API.
  3. We'd elide 2 unnecessary copies of the data that currently happens, via ps.to_pydict() and passing Python lists to Rust over pyo3.
  4. Once we have Rust-native path globbing, we'll have less to refactor.

daft.from_glob_path() could still be supported by calling InputFileInfo::to_arrow() and constructing the PartitionSet, so this would really be a pure win.

However, I think we can just do this refactor later once we have the Rust-native path globbing.

@xcharleslin xcharleslin merged commit 3c49bb5 into main Aug 21, 2023
@xcharleslin xcharleslin deleted the charles/filemetadata branch August 21, 2023 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants