[CHORE] Pass in file size and num rows to Rust query planner #1282

xcharleslin · 2023-08-18T20:40:22Z

Currently, file metadata (size, numrows) is computed before constructing the query plan. Then it's passed into the Python query planner. This PR passes it into the Rust query planner too.

Eventually, we will likely want to dynamically get metadata at query execution time; this PR is a feature parity stopgap only.

Manually verified that source scan tasks are being emitted with memory requirements now.

xcharleslin · 2023-08-18T20:42:12Z

src/daft-plan/src/source_info.rs

@@ -140,23 +140,20 @@ impl ExternalInfo {
 #[derive(Debug, Serialize, Deserialize)]
 pub struct FileInfo {
    pub file_paths: Vec<String>,
-    pub file_sizes: Option<Vec<i64>>,
-    pub num_rows: Option<Vec<i64>>,
-    pub file_formats: Option<Vec<FileFormat>>,


It turns out the "file format" data Daft currently ingests is probably file vs directory. I only see instances of the string "file"

Yes - I believe the other value returned by the S3 API is "directory"

codecov · 2023-08-18T20:48:50Z

Codecov Report

Merging #1282 (5e9beed) into main (9e4e20f) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1282      +/-   ##
==========================================
+ Coverage   87.60%   87.62%   +0.02%     
==========================================
  Files          61       61              
  Lines        6026     6028       +2     
==========================================
+ Hits         5279     5282       +3     
+ Misses        747      746       -1

Files Changed	Coverage Δ
daft/logical/rust_logical_plan.py	`90.12% <100.00%> (+0.12%)`	⬆️

... and 1 file with indirect coverage changes

clarkzinzow

LGTM!

If we were interested in doing a larger refactor, another option would be exposing InputFileInfo to Python via pyo3, having RunnerIO.glob_paths_details() return an InputFileInfo instead of a PartitionSet, and pass runner_io.glob_paths_details(...) directly to _LogicalPlanBuilder.table_scan(). This would have a few benefits:

We would be consolidating the glob_paths_details() output --> Arrow table logic, which we currently duplicate in two places.
It narrows the LogicalPlanBuilder.table_scan() API.
We'd elide 2 unnecessary copies of the data that currently happens, via ps.to_pydict() and passing Python lists to Rust over pyo3.
Once we have Rust-native path globbing, we'll have less to refactor.

daft.from_glob_path() could still be supported by calling InputFileInfo::to_arrow() and constructing the PartitionSet, so this would really be a pure win.

However, I think we can just do this refactor later once we have the Rust-native path globbing.

Temporarily ingest filesize and numrows at file scan builder time

8398896

xcharleslin requested a review from clarkzinzow August 18, 2023 20:40

Remove print

5e9beed

github-actions bot added the chore label Aug 18, 2023

xcharleslin commented Aug 18, 2023

View reviewed changes

xcharleslin marked this pull request as ready for review August 18, 2023 20:43

clarkzinzow approved these changes Aug 21, 2023

View reviewed changes

xcharleslin merged commit 3c49bb5 into main Aug 21, 2023

xcharleslin deleted the charles/filemetadata branch August 21, 2023 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CHORE] Pass in file size and num rows to Rust query planner #1282

[CHORE] Pass in file size and num rows to Rust query planner #1282

xcharleslin commented Aug 18, 2023 •

edited

Loading

xcharleslin Aug 18, 2023

jaychia Aug 18, 2023

codecov bot commented Aug 18, 2023

clarkzinzow left a comment

[CHORE] Pass in file size and num rows to Rust query planner #1282

[CHORE] Pass in file size and num rows to Rust query planner #1282

Conversation

xcharleslin commented Aug 18, 2023 • edited Loading

xcharleslin Aug 18, 2023

Choose a reason for hiding this comment

jaychia Aug 18, 2023

Choose a reason for hiding this comment

codecov bot commented Aug 18, 2023

Codecov Report

clarkzinzow left a comment

Choose a reason for hiding this comment

xcharleslin commented Aug 18, 2023 •

edited

Loading