polars.LazyPolarsdataset .collect() streaming #519

butterlyn · 2024-01-21T06:33:23Z

Description

Enable Polars streaming by default when saving polars.LazyPolarsDataset

Context

Enables larger-than-memory data processing, one of the main advantages of using Polars LazyFrames.

Possible Implementation

.collect(streaming=True) in polars.LazyPolarsDataset

If streaming cannot be performed for whatever reason, Polars disables streaming automatically at runtime, so having streaming as the default behaviour should be okay.

Possible Alternatives

Add a flag to enable/disable streaming through data catalog load_args. However, this may be problematic given streaming is not an argument of LazyFrame.sink_csv(), but rather an argument of LazyFrame.collect().

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-01-24T11:10:21Z

Thanks @butterlyn for this feature request!

I'm going to suggest a third alternative, which is adding a dataset-level property, like this:

ds:
    type: polars.LazyPolarsDataset
    streaming: true

how does that sound?

butterlyn · 2024-01-24T11:17:32Z

@astrojuanlu Love the idea! That'd be perfect

astrojuanlu · 2024-01-24T12:47:20Z

This one is actually easy I'd say :) It requires adding a new argument to the initialiser:

kedro-plugins/kedro-datasets/kedro_datasets/polars/lazy_polars_dataset.py

Lines 76 to 81 in a88ad7f

    
           def __init__(  # noqa: PLR0913 
        
               self, 
        
               *, 
        
               filepath: str, 
        
               file_format: str, 
        
               load_args: Optional[dict[str, Any]] = None,

And then storing it in an internal property, and using it where appropriate.

astrojuanlu · 2024-02-06T08:24:08Z

In fact, I'm thinking - rather than using .collect() and then .write_*, shouldn't we use .sink_ directly? cc @cpinon-grd (comes from https://linen-slack.kedro.org/t/16374083/hey-team-is-there-any-way-to-store-a-lazypolarsdataframe-wit#76e13870-de5a-4a1d-86c3-f0c30f2ebf25)

Fix #519. Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

MatthiasRoels · 2024-06-27T17:45:09Z

As per my comment here, I wouldn't recommend using streaming or sink_* methods. Even when using .collect(streaming=True), it is explicitly mentioned in the docs that streaming mode is considered unstable.

astrojuanlu · 2024-06-27T20:10:40Z

Streaming functionality is indeed considered unstable pola-rs/polars#13948

But as far as I understand, sink_* methods in non-streaming mode are okay?

astrojuanlu · 2024-06-27T20:14:48Z

Let's close this issue in favour of #702, therefore no streaming=True but let's continue the discussion on using the lazy methods for LazyPolarsDataset.

cpinon-grd · 2024-06-28T07:18:43Z

Hey! If I'm not wrong, processing larger than memory datasets is one of the key features of Polars. Polars docs state:

With the lazy API Polars doesn't run each query line-by-line but instead processes the full query end-to-end. To get the most out of Polars it is important that you use the lazy API because:

the lazy API allows Polars to apply automatic query optimization with the query optimizer

the lazy API allows you to work with larger than memory datasets using streaming

the lazy API can catch schema errors before processing the data

Isn't it a bit weird that in order to "get the most out of Polars", the Polars team recommends an unstable solution? If using streaming mode is unstable, what is the "recommended"/"your go to" solution?

MatthiasRoels · 2024-06-28T07:35:35Z

If you use the Lazy API, you already get some optimisations such as predicate and filter pushdown. This means that you only read the rows/columns in memory that you need (as opposed to the full dataset).

astrojuanlu added enhancement New feature or request Community Issue/PR opened by the open-source community datasets labels Jan 24, 2024

astrojuanlu added the good first issue Good for newcomers label Jan 24, 2024

astrojuanlu mentioned this issue Mar 19, 2024

polars.EagerPolarsDataset fails to read parquet #590

Open

astrojuanlu added a commit that referenced this issue Mar 19, 2024

Use lazy sink instead of collecting data

f8893b6

Fix #519. Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

astrojuanlu mentioned this issue Mar 19, 2024

fix(datasets): Use lazy .sink_* instead of collecting data #619

Closed

4 tasks

astrojuanlu closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polars.LazyPolarsdataset .collect() streaming #519

polars.LazyPolarsdataset .collect() streaming #519

butterlyn commented Jan 21, 2024

astrojuanlu commented Jan 24, 2024

butterlyn commented Jan 24, 2024 •

edited

Loading

astrojuanlu commented Jan 24, 2024

astrojuanlu commented Feb 6, 2024

MatthiasRoels commented Jun 27, 2024

astrojuanlu commented Jun 27, 2024 •

edited

Loading

astrojuanlu commented Jun 27, 2024

cpinon-grd commented Jun 28, 2024 •

edited

Loading

MatthiasRoels commented Jun 28, 2024 •

edited

Loading

polars.LazyPolarsdataset .collect() streaming #519

polars.LazyPolarsdataset .collect() streaming #519

Comments

butterlyn commented Jan 21, 2024

Description

Context

Possible Implementation

Possible Alternatives

astrojuanlu commented Jan 24, 2024

butterlyn commented Jan 24, 2024 • edited Loading

astrojuanlu commented Jan 24, 2024

astrojuanlu commented Feb 6, 2024

MatthiasRoels commented Jun 27, 2024

astrojuanlu commented Jun 27, 2024 • edited Loading

astrojuanlu commented Jun 27, 2024

cpinon-grd commented Jun 28, 2024 • edited Loading

MatthiasRoels commented Jun 28, 2024 • edited Loading

butterlyn commented Jan 24, 2024 •

edited

Loading

astrojuanlu commented Jun 27, 2024 •

edited

Loading

cpinon-grd commented Jun 28, 2024 •

edited

Loading

MatthiasRoels commented Jun 28, 2024 •

edited

Loading