feat(rust, python): read ipc file from network #14861

mickvangelderen · 2024-03-05T15:18:26Z

Draft to discuss ongoing work from #14839

crates/polars-io/src/ipc/object_store_reader.rs

codecov · 2024-03-06T17:41:58Z

Codecov Report

Attention: Patch coverage is 88.54167% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 81.00%. Comparing base (78dc628) to head (f655407).

Files	Patch %	Lines
crates/polars-io/src/ipc/object_store_reader.rs	86.84%	10 Missing ⚠️
...olars-lazy/src/physical_plan/executors/scan/ipc.rs	94.11%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14861      +/-   ##
==========================================
+ Coverage   80.99%   81.00%   +0.01%     
==========================================
  Files        1333     1334       +1     
  Lines      173175   173267      +92     
  Branches     2458     2458              
==========================================
+ Hits       140260   140354      +94     
+ Misses      32448    32446       -2     
  Partials      467      467

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mickvangelderen · 2024-03-07T20:22:20Z

@ritchie46 is it correct that we read the schema during the construction of the logical plan, and then only later during execution of the physical plan read the entire file?

If that is the case, I think I should provide a function that reads just the schema and one that reads the necessary data. These functions will take different options. For example, reading the schema wouldn't require projection names. Reading the data can take the projection indices computed from the schema + indices. There are probably other differences. Some options are probably be shared (some of the cloud-related options?).

ritchie46 · 2024-03-08T14:12:21Z

crates/polars-lazy/src/physical_plan/executors/scan/ipc.rs

+
+            #[cfg(feature = "cloud")]
+            {
+                // TODO: This will block the current thread until the data has been


This is fine as the default engine cannot work on batches anyway. What our goal here will be is downloading parts of the file (e.g. Recordbatches/ Chunks) concurrently.

ritchie46 · 2024-03-08T14:14:41Z

crates/polars-io/src/ipc/object_store_reader.rs

+fn read_ipc<R: MmapBytesReader>(reader: R, options: IpcReadOptions) -> PolarsResult<DataFrame> {
+    let mut reader = IpcReader::new(reader);
+
+    if let Some(columns) = options.columns {


Here we should only accept column indices. This is not user-facing code, so the extra option only complicates it here.

ritchie46 · 2024-03-08T14:14:57Z

crates/polars-io/src/ipc/object_store_reader.rs

+}
+
+async fn read_bytes<O: ObjectStore>(store: &O, path: &ObjectPath) -> PolarsResult<Bytes> {
+    // TODO: Is `to_compute_err` appropriate? It is used in the Parquet


Compute error is fine for now.

ritchie46 · 2024-03-08T14:17:18Z

crates/polars-io/src/ipc/object_store_reader.rs

+async fn read_bytes<O: ObjectStore>(store: &O, path: &ObjectPath) -> PolarsResult<Bytes> {
+    // TODO: Is `to_compute_err` appropriate? It is used in the Parquet
+    // reader as well but I am not sure it is what we want.
+    let get_result = store.get(path).await.map_err(to_compute_err)?;


Here we should split up the work. We should be able download concurrently and the first chunks should be downloaded first. We should have a batched interface that can already get DataFrames of the chunks that are already downloaded.

mickvangelderen · 2024-03-12T13:24:15Z

Closing this in favor of #14984

mickvangelderen force-pushed the read-ipc-file-from-object-store branch from 66301f9 to 47bc620 Compare March 5, 2024 15:19

ritchie46 reviewed Mar 5, 2024

View reviewed changes

crates/polars-io/src/ipc/object_store_reader.rs Outdated Show resolved Hide resolved

mickvangelderen force-pushed the read-ipc-file-from-object-store branch 2 times, most recently from ed741ae to 63813bc Compare March 6, 2024 16:31

mickvangelderen force-pushed the read-ipc-file-from-object-store branch 2 times, most recently from d4e4b18 to 65a446c Compare March 7, 2024 19:50

mickvangelderen force-pushed the read-ipc-file-from-object-store branch from 65a446c to 36f18a3 Compare March 8, 2024 09:07

mickvangelderen changed the title ~~Read ipc file from object store~~ feat(rust, python): read ipc file from network Mar 8, 2024

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Mar 8, 2024

mickvangelderen force-pushed the read-ipc-file-from-object-store branch from 36f18a3 to 4174ca1 Compare March 8, 2024 09:11

ritchie46 reviewed Mar 8, 2024

View reviewed changes

Start implementing async ipc reader

c1a1a37

mickvangelderen force-pushed the read-ipc-file-from-object-store branch from 4174ca1 to c1a1a37 Compare March 11, 2024 09:33

WIP

f655407

mickvangelderen closed this Mar 12, 2024

mickvangelderen deleted the read-ipc-file-from-object-store branch March 12, 2024 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust, python): read ipc file from network #14861

feat(rust, python): read ipc file from network #14861

mickvangelderen commented Mar 5, 2024

codecov bot commented Mar 6, 2024 •

edited

Loading

mickvangelderen commented Mar 7, 2024

ritchie46 Mar 8, 2024

ritchie46 Mar 8, 2024

ritchie46 Mar 8, 2024

ritchie46 Mar 8, 2024

mickvangelderen commented Mar 12, 2024

feat(rust, python): read ipc file from network #14861

feat(rust, python): read ipc file from network #14861

Conversation

mickvangelderen commented Mar 5, 2024

codecov bot commented Mar 6, 2024 • edited Loading

Codecov Report

mickvangelderen commented Mar 7, 2024

ritchie46 Mar 8, 2024

Choose a reason for hiding this comment

ritchie46 Mar 8, 2024

Choose a reason for hiding this comment

ritchie46 Mar 8, 2024

Choose a reason for hiding this comment

ritchie46 Mar 8, 2024

Choose a reason for hiding this comment

mickvangelderen commented Mar 12, 2024

codecov bot commented Mar 6, 2024 •

edited

Loading