-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHORE] Pass in file size and num rows to Rust query planner #1282
Conversation
@@ -140,23 +140,20 @@ impl ExternalInfo { | |||
#[derive(Debug, Serialize, Deserialize)] | |||
pub struct FileInfo { | |||
pub file_paths: Vec<String>, | |||
pub file_sizes: Option<Vec<i64>>, | |||
pub num_rows: Option<Vec<i64>>, | |||
pub file_formats: Option<Vec<FileFormat>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out the "file format" data Daft currently ingests is probably file vs directory. I only see instances of the string "file"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I believe the other value returned by the S3 API is "directory"
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1282 +/- ##
==========================================
+ Coverage 87.60% 87.62% +0.02%
==========================================
Files 61 61
Lines 6026 6028 +2
==========================================
+ Hits 5279 5282 +3
+ Misses 747 746 -1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
If we were interested in doing a larger refactor, another option would be exposing InputFileInfo
to Python via pyo3, having RunnerIO.glob_paths_details()
return an InputFileInfo
instead of a PartitionSet
, and pass runner_io.glob_paths_details(...)
directly to _LogicalPlanBuilder.table_scan()
. This would have a few benefits:
- We would be consolidating the
glob_paths_details()
output --> Arrow table logic, which we currently duplicate in two places. - It narrows the
LogicalPlanBuilder.table_scan()
API. - We'd elide 2 unnecessary copies of the data that currently happens, via
ps.to_pydict()
and passing Python lists to Rust over pyo3. - Once we have Rust-native path globbing, we'll have less to refactor.
daft.from_glob_path()
could still be supported by calling InputFileInfo::to_arrow()
and constructing the PartitionSet
, so this would really be a pure win.
However, I think we can just do this refactor later once we have the Rust-native path globbing.
Currently, file metadata (size, numrows) is computed before constructing the query plan. Then it's passed into the Python query planner. This PR passes it into the Rust query planner too.
Eventually, we will likely want to dynamically get metadata at query execution time; this PR is a feature parity stopgap only.
Manually verified that source scan tasks are being emitted with memory requirements now.