-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars Lazyframe Support #775
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @buggtb for driving this!
I think we can simplify a few things actually.
polars_extensions.py
: All existing polarsDataSaver
s can have the followingapplicable_types()
:
@classmethod
def applicable_types(cls) -> Collection[Type]:
return [DATAFRAME_TYPE, pl.LazyFrame]
Then in the save_data()
method, add union type annotation with pl.LazyFrame, and then in the body:
if isinstance(data, pl.LazyFrame):
data = data.collect()
# continue with rest of function
-
h_polars.py
should be modified to:
(a) include a PolarsLazyDataFrameResult -- this won't call collect, it will just return a LazyFrame. (though not sure if it makes sense)
(b) the existing PolarsDataFrameResult should have isinstance checks for lazyframe, and if found, should collect and move on.
This means we can deleteh_polars_lazyframe.py
. -
polars_lazyframe_extensions.py
: I would usespark_extensions.py
as the template. No need to copy all of whatpolars_extesions.py
has. Otherwise it should house thescan_*
andsink_*
DataLoaders and DataSavers, e.g.scan_csv
. Let me know if you need more guidance here. -
Remove
polars_shared.py
.
I think this is a little over engineering. I don't see the value in the abstraction right now -- I think a little duplication is more straightforward at this time. -
io/utils.py
. I think the changes here are okay. @elijahbenizzy and I can think about whether there's a nicer way to do this.
Thanks for the review @skrawcz A couple of follow up questions: the save function is fair enough, the load is more problematic in terms of design, if I want to load into a lazy frame instead of a dataframe, there's nothing in there currently to allow that. I could extend that function to default to dataframe and let a user request a lazyframe I guess. I'm unsure how you'd want that pattern to look perhaps something like this, or something else:
also there are some minor differences in kwargs in the lazyframe when compared to the dataframe and so I'd have to add some conditional logic to the get_kwargs() functions to return the valid sets there. Not entirely sure how the spark extensions example applies here as I suspect we'll need loaders and savers for each type, unless we can overload an existing class but that doesn't appear to be what the spark_extensions example does, but I'm probably missing something. |
Yep, so I'm thinking don't couple them at all precisely for the reasons you mentioned. So
I just mean to use it in terms of structure. Since pyspark doesn't make use of the column stuff. Lazyframes wouldn't make use of it either. |
Okay so I removed Polars shared, I then removed most from polars_lazyframe_exensions but kept the loader. Can you check to make sure it's as you imaged, before I go fill out the other save/loaders etc. I see how the save_to and load_from annotations work it makes sense now, I wasn't sure how you could blend loaders and savers across classes, all good. In h_polars.py I took the same logic I had in the lazyframe version and I do run collect and return a dataframe. When I was running it, to me at least it made sense to collect at the end and return a dataframe as the pipeline is done, but now I think about it there are times, perhaps if you're running chained pipelines, subdags or whatever that you may not want to, I don't know, the jury is out there! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep looking good. Added code for what I would change.
Yep! So that's a vote for a second one specific to LazyFrames IMO. |
The applicable types for PolarsCSVWriter, PolarsParquetWriter, and PolarsFeatherWriter have been extended to include pl.LazyFrame in addition to the existing pl.DataFrame. This change allows these writer classes to handle both eager and lazy data frames from the polars library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great. A few minor things.
We should also have a lazyframe test for each of the savers. If code duplication is the simplest here, I think that's fine; not sure there's a DRY way that would be worth the time/effort since this is a one off.
The PolarsLazyFrameResult class now uses the PolarsLazyFrameResult instead of the PolarsDataFrameResult. The logging statement in register_types() has been removed. DataSaver classes have been updated to handle both DATAFRAME_TYPE and pl.LazyFrame types, with a check added to collect data if it's a LazyFrame before saving. Tests have been updated and expanded to cover these changes, including checks for applicable types and correct handling of LazyFrames.
The applicable_types method in the PolarsSpreadsheetWriter class and corresponding test assertions have been updated to include pl.LazyFrame, along with the existing DATAFRAME_TYPE. This change extends the functionality of our Polars extensions to handle LazyFrames as well as DataFrames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @buggtb for the work! Looking good I think we just need an example to help people show how to use the new functionality you just added!
So two things:
- Can you add an example exercising the scan variants? E.g. add something to
examples/polars/lazyframe
? I think we should add this to complete this PR. - There are two minor things (see suggested commits), I can do the commit before merging, or you can. Let me know.
Otherwise question, should I create an issue for the sink_*
variants for lazyframe writing to csv, etc. to track that? No need to add it in this PR, but I can see that be a natural progression if we have scan_* support.
Thanks!
This update introduces a new example demonstrating the use of Polars LazyFrame. The changes include: - Creation of two new Python scripts: one defining functions for loading data and calculating spend per signup, and another script to execute these functions. - Addition of a README file explaining how to run the example, visualize execution, and detailing some caveats with Polars. - Inclusion of a requirements.txt file specifying necessary dependencies. - Addition of sample CSV data for testing purposes.
The test methods for PolarsScanParquetReader and PolarsScanFeatherReader have been updated. Instead of using pl.DataFrame to load data, they now use pl.LazyFrame. This change aligns with the applicable types for these readers.
Added a test which loads a CSV using the annotation, does a simple sum and dumps out the result as a LazyFrame, added a comment in my_script.py explaining how you could use either LazyFrame or DataFrame as a resultset depending on what you needed. Added #792 and #791 on the issues lists for some missing sources and sinks |
Looks great let's 🚢 it. But, argh, I see there's a rebase that's needed. Mind doing that? Then we'll be able to squash merge. |
Pulled the latest main branch into my fork, should be good, I think. |
Done. Thanks @buggtb 🍾 . This will go out next week. |
This PR is to aid for support of Polars LazyFrames in Hamilton.
Changes
I've currently stubbed out the CSV Reader / Writer to work both on Eager and Lazy mode in Polars.
How I tested this
There is an accompanying test that mimics the Eager test but using Lazyframes instead
Notes
I've also had to update the get_dataframe_metadata in utils.py to allow it to work with Lazyframes that don't have a row count. I abstracted all the lookups so that if others passed/failed in the future for support of other read/writers they would return what they can.
Checklist