Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust, python): read ipc file from network #14861

Conversation

mickvangelderen
Copy link
Contributor

Draft to discuss ongoing work from #14839

@mickvangelderen mickvangelderen force-pushed the read-ipc-file-from-object-store branch from 66301f9 to 47bc620 Compare March 5, 2024 15:19
@mickvangelderen mickvangelderen force-pushed the read-ipc-file-from-object-store branch 2 times, most recently from ed741ae to 63813bc Compare March 6, 2024 16:31
Copy link

codecov bot commented Mar 6, 2024

Codecov Report

Attention: Patch coverage is 88.54167% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 81.00%. Comparing base (78dc628) to head (f655407).

Files Patch % Lines
crates/polars-io/src/ipc/object_store_reader.rs 86.84% 10 Missing ⚠️
...olars-lazy/src/physical_plan/executors/scan/ipc.rs 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14861      +/-   ##
==========================================
+ Coverage   80.99%   81.00%   +0.01%     
==========================================
  Files        1333     1334       +1     
  Lines      173175   173267      +92     
  Branches     2458     2458              
==========================================
+ Hits       140260   140354      +94     
+ Misses      32448    32446       -2     
  Partials      467      467              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mickvangelderen mickvangelderen force-pushed the read-ipc-file-from-object-store branch 2 times, most recently from d4e4b18 to 65a446c Compare March 7, 2024 19:50
@mickvangelderen
Copy link
Contributor Author

@ritchie46 is it correct that we read the schema during the construction of the logical plan, and then only later during execution of the physical plan read the entire file?

If that is the case, I think I should provide a function that reads just the schema and one that reads the necessary data. These functions will take different options. For example, reading the schema wouldn't require projection names. Reading the data can take the projection indices computed from the schema + indices. There are probably other differences. Some options are probably be shared (some of the cloud-related options?).

@mickvangelderen mickvangelderen force-pushed the read-ipc-file-from-object-store branch from 65a446c to 36f18a3 Compare March 8, 2024 09:07
@mickvangelderen mickvangelderen changed the title Read ipc file from object store feat(rust, python): read ipc file from network Mar 8, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Mar 8, 2024
@mickvangelderen mickvangelderen force-pushed the read-ipc-file-from-object-store branch from 36f18a3 to 4174ca1 Compare March 8, 2024 09:11

#[cfg(feature = "cloud")]
{
// TODO: This will block the current thread until the data has been
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as the default engine cannot work on batches anyway. What our goal here will be is downloading parts of the file (e.g. Recordbatches/ Chunks) concurrently.

fn read_ipc<R: MmapBytesReader>(reader: R, options: IpcReadOptions) -> PolarsResult<DataFrame> {
let mut reader = IpcReader::new(reader);

if let Some(columns) = options.columns {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should only accept column indices. This is not user-facing code, so the extra option only complicates it here.

}

async fn read_bytes<O: ObjectStore>(store: &O, path: &ObjectPath) -> PolarsResult<Bytes> {
// TODO: Is `to_compute_err` appropriate? It is used in the Parquet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute error is fine for now.

async fn read_bytes<O: ObjectStore>(store: &O, path: &ObjectPath) -> PolarsResult<Bytes> {
// TODO: Is `to_compute_err` appropriate? It is used in the Parquet
// reader as well but I am not sure it is what we want.
let get_result = store.get(path).await.map_err(to_compute_err)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should split up the work. We should be able download concurrently and the first chunks should be downloaded first. We should have a batched interface that can already get DataFrames of the chunks that are already downloaded.

@mickvangelderen
Copy link
Contributor Author

Closing this in favor of #14984

@mickvangelderen mickvangelderen deleted the read-ipc-file-from-object-store branch March 12, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants