-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
polars.LazyPolarsdataset .collect() streaming #519
Comments
Thanks @butterlyn for this feature request! I'm going to suggest a third alternative, which is adding a dataset-level property, like this: ds:
type: polars.LazyPolarsDataset
streaming: true how does that sound? |
@astrojuanlu Love the idea! That'd be perfect |
This one is actually easy I'd say :) It requires adding a new argument to the initialiser:
And then storing it in an internal property, and using it where appropriate. |
In fact, I'm thinking - rather than using |
Fix #519. Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
As per my comment here, I wouldn't recommend using streaming or |
Streaming functionality is indeed considered unstable pola-rs/polars#13948 But as far as I understand, |
Let's close this issue in favour of #702, therefore no |
Hey! If I'm not wrong, processing larger than memory datasets is one of the key features of Polars. Polars docs state:
Isn't it a bit weird that in order to "get the most out of Polars", the Polars team recommends an unstable solution? If using streaming mode is unstable, what is the "recommended"/"your go to" solution? |
If you use the Lazy API, you already get some optimisations such as predicate and filter pushdown. This means that you only read the rows/columns in memory that you need (as opposed to the full dataset). |
Description
Enable Polars streaming by default when saving polars.LazyPolarsDataset
Context
Enables larger-than-memory data processing, one of the main advantages of using Polars LazyFrames.
Possible Implementation
.collect(streaming=True)
in polars.LazyPolarsDatasetIf streaming cannot be performed for whatever reason, Polars disables streaming automatically at runtime, so having streaming as the default behaviour should be okay.
Possible Alternatives
Add a flag to enable/disable streaming through data catalog load_args. However, this may be problematic given streaming is not an argument of LazyFrame.sink_csv(), but rather an argument of LazyFrame.collect().
The text was updated successfully, but these errors were encountered: