Polars Lazyframe Support #775

buggtb · 2024-03-22T11:04:02Z

This PR is to aid for support of Polars LazyFrames in Hamilton.

Changes

I've currently stubbed out the CSV Reader / Writer to work both on Eager and Lazy mode in Polars.

How I tested this

There is an accompanying test that mimics the Eager test but using Lazyframes instead

Notes

I've also had to update the get_dataframe_metadata in utils.py to allow it to work with Lazyframes that don't have a row count. I abstracted all the lookups so that if others passed/failed in the future for support of other read/writers they would return what they can.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

skrawcz

Thanks @buggtb for driving this!

I think we can simplify a few things actually.

polars_extensions.py: All existing polars DataSavers can have the following applicable_types():

    @classmethod
    def applicable_types(cls) -> Collection[Type]:
        return [DATAFRAME_TYPE, pl.LazyFrame]

Then in the save_data() method, add union type annotation with pl.LazyFrame, and then in the body:

if isinstance(data, pl.LazyFrame):
    data = data.collect()
# continue with rest of function

h_polars.py should be modified to:
(a) include a PolarsLazyDataFrameResult -- this won't call collect, it will just return a LazyFrame. (though not sure if it makes sense)
(b) the existing PolarsDataFrameResult should have isinstance checks for lazyframe, and if found, should collect and move on.
This means we can delete h_polars_lazyframe.py.
polars_lazyframe_extensions.py: I would use spark_extensions.py as the template. No need to copy all of what polars_extesions.py has. Otherwise it should house the scan_* and sink_* DataLoaders and DataSavers, e.g. scan_csv. Let me know if you need more guidance here.
Remove polars_shared.py.
I think this is a little over engineering. I don't see the value in the abstraction right now -- I think a little duplication is more straightforward at this time.
io/utils.py. I think the changes here are okay. @elijahbenizzy and I can think about whether there's a nicer way to do this.

buggtb · 2024-03-25T12:42:19Z

Thanks for the review @skrawcz

A couple of follow up questions:

https://github.com/DAGWorks-Inc/hamilton/pull/775/files#diff-fbd89fcde3b2c949a6f81da8ecfd52a163490f1645f1eca8f9451047cf7785faR179

the save function is fair enough, the load is more problematic in terms of design, if I want to load into a lazy frame instead of a dataframe, there's nothing in there currently to allow that. I could extend that function to default to dataframe and let a user request a lazyframe I guess. I'm unsure how you'd want that pattern to look perhaps something like this, or something else:

    def load_data(self, type_: Type, frametype: Union[DATAFRAME_TYPE, pl.LazyFrame] = pl.DataFrame) -> Tuple[DATAFRAME_TYPE, Dict[str, Any]]:
        if isinstance(type_, pl.LazyFrame):
            df = pl.scan_csv(self.file, **self._get_loading_kwargs())
        else:
            df = pl.read_csv(self.file, **self._get_loading_kwargs())

        metadata = utils.get_file_and_dataframe_metadata(self.file, df)
        return df, metadata

also there are some minor differences in kwargs in the lazyframe when compared to the dataframe and so I'd have to add some conditional logic to the get_kwargs() functions to return the valid sets there.

Not entirely sure how the spark extensions example applies here as I suspect we'll need loaders and savers for each type, unless we can overload an existing class but that doesn't appear to be what the spark_extensions example does, but I'm probably missing something.

skrawcz · 2024-03-25T17:01:32Z

Thanks for the review @skrawcz

A couple of follow up questions:

https://github.com/DAGWorks-Inc/hamilton/pull/775/files#diff-fbd89fcde3b2c949a6f81da8ecfd52a163490f1645f1eca8f9451047cf7785faR179

the save function is fair enough, the load is more problematic in terms of design, if I want to load into a lazy frame instead of a dataframe, there's nothing in there currently to allow that. I could extend that function to default to dataframe and let a user request a lazyframe I guess. I'm unsure how you'd want that pattern to look perhaps something like this, or something else:
    def load_data(self, type_: Type, frametype: Union[DATAFRAME_TYPE, pl.LazyFrame] = pl.DataFrame) -> Tuple[DATAFRAME_TYPE, Dict[str, Any]]:
        if isinstance(type_, pl.LazyFrame):
            df = pl.scan_csv(self.file, **self._get_loading_kwargs())
        else:
            df = pl.read_csv(self.file, **self._get_loading_kwargs())

        metadata = utils.get_file_and_dataframe_metadata(self.file, df)
        return df, metadata
also there are some minor differences in kwargs in the lazyframe when compared to the dataframe and so I'd have to add some conditional logic to the get_kwargs() functions to return the valid sets there.

Yep, so I'm thinking don't couple them at all precisely for the reasons you mentioned. So polars_lazyframe_extensions.py would contain the "load" ones specific to creating lazyframes that pertain to scan_csv, scan_parquet etc. Does that clarify it more?

Not entirely sure how the spark extensions example applies here as I suspect we'll need loaders and savers for each type, unless we can overload an existing class but that doesn't appear to be what the spark_extensions example does, but I'm probably missing something.

I just mean to use it in terms of structure. Since pyspark doesn't make use of the column stuff. Lazyframes wouldn't make use of it either.

buggtb · 2024-03-26T12:24:46Z

Okay so I removed Polars shared, I then removed most from polars_lazyframe_exensions but kept the loader. Can you check to make sure it's as you imaged, before I go fill out the other save/loaders etc. I see how the save_to and load_from annotations work it makes sense now, I wasn't sure how you could blend loaders and savers across classes, all good.

In h_polars.py I took the same logic I had in the lazyframe version and I do run collect and return a dataframe. When I was running it, to me at least it made sense to collect at the end and return a dataframe as the pipeline is done, but now I think about it there are times, perhaps if you're running chained pipelines, subdags or whatever that you may not want to, I don't know, the jury is out there!

skrawcz

Yep looking good. Added code for what I would change.

hamilton/plugins/polars_extensions.py

hamilton/plugins/polars_lazyframe_extensions.py

skrawcz · 2024-03-26T16:56:35Z

In h_polars.py I took the same logic I had in the lazyframe version and I do run collect and return a dataframe. When I was running it, to me at least it made sense to collect at the end and return a dataframe as the pipeline is done, but now I think about it there are times, perhaps if you're running chained pipelines, subdags or whatever that you may not want to, I don't know, the jury is out there!

Yep! So that's a vote for a second one specific to LazyFrames IMO.

hamilton/plugins/polars_extensions.py

The applicable types for PolarsCSVWriter, PolarsParquetWriter, and PolarsFeatherWriter have been extended to include pl.LazyFrame in addition to the existing pl.DataFrame. This change allows these writer classes to handle both eager and lazy data frames from the polars library.

skrawcz

Looking great. A few minor things.

We should also have a lazyframe test for each of the savers. If code duplication is the simplest here, I think that's fine; not sure there's a DRY way that would be worth the time/effort since this is a one off.

hamilton/plugins/h_polars_lazyframe.py

hamilton/plugins/polars_extensions.py

The PolarsLazyFrameResult class now uses the PolarsLazyFrameResult instead of the PolarsDataFrameResult. The logging statement in register_types() has been removed. DataSaver classes have been updated to handle both DATAFRAME_TYPE and pl.LazyFrame types, with a check added to collect data if it's a LazyFrame before saving. Tests have been updated and expanded to cover these changes, including checks for applicable types and correct handling of LazyFrames.

The applicable_types method in the PolarsSpreadsheetWriter class and corresponding test assertions have been updated to include pl.LazyFrame, along with the existing DATAFRAME_TYPE. This change extends the functionality of our Polars extensions to handle LazyFrames as well as DataFrames.

skrawcz

Thanks @buggtb for the work! Looking good I think we just need an example to help people show how to use the new functionality you just added!

So two things:

Can you add an example exercising the scan variants? E.g. add something to examples/polars/lazyframe? I think we should add this to complete this PR.
There are two minor things (see suggested commits), I can do the commit before merging, or you can. Let me know.

Otherwise question, should I create an issue for the sink_* variants for lazyframe writing to csv, etc. to track that? No need to add it in this PR, but I can see that be a natural progression if we have scan_* support.

Thanks!

tests/plugins/test_polars_lazyframe_extensions.py

This update introduces a new example demonstrating the use of Polars LazyFrame. The changes include: - Creation of two new Python scripts: one defining functions for loading data and calculating spend per signup, and another script to execute these functions. - Addition of a README file explaining how to run the example, visualize execution, and detailing some caveats with Polars. - Inclusion of a requirements.txt file specifying necessary dependencies. - Addition of sample CSV data for testing purposes.

The test methods for PolarsScanParquetReader and PolarsScanFeatherReader have been updated. Instead of using pl.DataFrame to load data, they now use pl.LazyFrame. This change aligns with the applicable types for these readers.

buggtb · 2024-03-28T11:59:47Z

Added a test which loads a CSV using the annotation, does a simple sum and dumps out the result as a LazyFrame, added a comment in my_script.py explaining how you could use either LazyFrame or DataFrame as a resultset depending on what you needed.

Added #792 and #791 on the issues lists for some missing sources and sinks

skrawcz · 2024-03-28T15:08:59Z

Looks great let's 🚢 it.

But, argh, I see there's a rebase that's needed. Mind doing that? Then we'll be able to squash merge.

buggtb · 2024-03-28T15:11:18Z

Pulled the latest main branch into my fork, should be good, I think.

skrawcz · 2024-03-28T17:03:18Z

hmm -- 🤔

buggtb · 2024-03-28T17:05:50Z

I'm not sure why I don't see the same

skrawcz · 2024-03-28T18:42:30Z

Done. Thanks @buggtb 🍾 . This will go out next week.

Tom Barber added 3 commits March 21, 2024 16:41

temp updates

51b875a

initial prototype migration to shared polars for lazyframe support

ec37b32

clean up log lines

ef90eb3

buggtb marked this pull request as draft March 22, 2024 11:04

Tom Barber added 10 commits March 22, 2024 11:15

fix function refactor

604d136

Implement materializer so that it returns a dataframe after processing

ae70b41

fix linting

760fe21

fix linting

ec8a11d

fix linting

bf34df2

update linting

1eb3929

update linting

3eaa10a

update linting

a32c04a

fix linting

5cb77d4

fix linting

3ff84b0

skrawcz requested changes Mar 24, 2024

View reviewed changes

update for example

cfbc088

update PR prototype code

7a9eb83

skrawcz reviewed Mar 26, 2024

View reviewed changes

hamilton/plugins/polars_extensions.py Outdated Show resolved Hide resolved

hamilton/plugins/polars_lazyframe_extensions.py Outdated Show resolved Hide resolved

hamilton/plugins/polars_lazyframe_extensions.py Outdated Show resolved Hide resolved

skrawcz reviewed Mar 26, 2024

View reviewed changes

hamilton/plugins/polars_extensions.py Show resolved Hide resolved

Tom Barber and others added 7 commits March 27, 2024 14:56

update tests

7513060

update tests

25662b2

update tests

0abb830

finish tests for other parsers

bed412a

Merge branch 'main' of github.com:buggtb/hamilton

f302b14

Merge branch 'DAGWorks-Inc:main' into main

de9cd80

Add lazyframe implementation

5d48c6e

skrawcz reviewed Mar 27, 2024

View reviewed changes

hamilton/plugins/h_polars_lazyframe.py Outdated Show resolved Hide resolved

hamilton/plugins/polars_extensions.py Outdated Show resolved Hide resolved

hamilton/plugins/polars_extensions.py Outdated Show resolved Hide resolved

hamilton/plugins/polars_extensions.py Show resolved Hide resolved

buggtb added 3 commits March 27, 2024 17:18

fix test

fd1c011

buggtb marked this pull request as ready for review March 27, 2024 18:12

buggtb changed the title ~~DRAFT: Polars Lazyframe Support~~ Polars Lazyframe Support Mar 27, 2024

skrawcz reviewed Mar 27, 2024

View reviewed changes

tests/plugins/test_polars_lazyframe_extensions.py Outdated Show resolved Hide resolved

tests/plugins/test_polars_lazyframe_extensions.py Outdated Show resolved Hide resolved

buggtb added 2 commits March 28, 2024 11:05

Updated data loading method in tests

557ec54

The test methods for PolarsScanParquetReader and PolarsScanFeatherReader have been updated. Instead of using pl.DataFrame to load data, they now use pl.LazyFrame. This change aligns with the applicable types for these readers.

skrawcz approved these changes Mar 28, 2024

View reviewed changes

Merge branch 'DAGWorks-Inc:main' into main

4a8c27b

update to force new build cause why not

2c32470

skrawcz merged commit 39ce9e0 into DAGWorks-Inc:main Mar 28, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars Lazyframe Support #775

Polars Lazyframe Support #775

buggtb commented Mar 22, 2024

skrawcz left a comment •

edited

Loading

buggtb commented Mar 25, 2024 •

edited

Loading

skrawcz commented Mar 25, 2024

buggtb commented Mar 26, 2024 •

edited

Loading

skrawcz left a comment

skrawcz commented Mar 26, 2024

skrawcz left a comment

skrawcz left a comment •

edited

Loading

buggtb commented Mar 28, 2024

skrawcz commented Mar 28, 2024 •

edited

Loading

buggtb commented Mar 28, 2024

skrawcz commented Mar 28, 2024

buggtb commented Mar 28, 2024

skrawcz commented Mar 28, 2024

Polars Lazyframe Support #775

Polars Lazyframe Support #775

Conversation

buggtb commented Mar 22, 2024

Changes

How I tested this

Notes

Checklist

skrawcz left a comment • edited Loading

Choose a reason for hiding this comment

buggtb commented Mar 25, 2024 • edited Loading

skrawcz commented Mar 25, 2024

buggtb commented Mar 26, 2024 • edited Loading

skrawcz left a comment

Choose a reason for hiding this comment

skrawcz commented Mar 26, 2024

skrawcz left a comment

Choose a reason for hiding this comment

skrawcz left a comment • edited Loading

Choose a reason for hiding this comment

buggtb commented Mar 28, 2024

skrawcz commented Mar 28, 2024 • edited Loading

buggtb commented Mar 28, 2024

skrawcz commented Mar 28, 2024

buggtb commented Mar 28, 2024

skrawcz commented Mar 28, 2024

skrawcz left a comment •

edited

Loading

buggtb commented Mar 25, 2024 •

edited

Loading

buggtb commented Mar 26, 2024 •

edited

Loading

skrawcz left a comment •

edited

Loading

skrawcz commented Mar 28, 2024 •

edited

Loading