Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Support pytorch Tensor and Dataset export with new to_torch DataFrame/Series method #15931

Merged
merged 6 commits into from
May 3, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Apr 27, 2024

This adds streamlined DataFrame (and Series) export to PyTorch Tensor and Dataset.

(Note: a torch.IterableDataset1 option would likely be useful for particularly large DataFrames. Needs some experimentation, so will be left to a subsequent PR 🤔).

Features

Supported PyTorch export types:

  • df.to_torch(): export the entire frame to a single 2D Tensor (equivalent to df.to_torch("tensor")).
  • df.to_torch("dict"): export frame to a dictionary of Tensors.
  • df.to_torch("dataset"): export frame to a PolarsDataset (inheriting from TensorDataset, but additionally offering clean frame integration and a handful of other niceties; can also be imported independently from polars.ml.torch).

The PolarsDataset object:

  • Inherits from TensorDataset and, once initialised, is drop-in compatible with it.
  • Supports designation of label/features via Polars selectors2.
  • Offers convenience half() method for experimenting with float16 data early.
  • Offers convenience schema attribute, showing the current feature/label dtypes.
  • Offers convenience features and labels attributes (in addition to tensors).
  • More informative repr, for example:
    <PolarsDataset [len:20640, features:8, labels:1] at 0x3301B1EE0>.

Examples

  • As 2D Tensor:

    import polars as pl
    df = pl.DataFrame(
        data=[(0, 1, 1.5), (1, 0, -0.5), (2, 0, 0.0), (3, 1, -2.,25)],
        schema=["lbl", "feat1","feat2"],
    )
    df.to_torch()  # or df.to_torch("tensor")
    # tensor([[0.0000,  1.0000,  1.5000],
    #         [1.0000,  0.0000, -0.5000],
    #         [2.0000,  0.0000,  0.0000],
    #         [3.0000,  1.0000, -2.0000]], dtype=torch.float64)
  • As dict of Tensors:

    df.to_torch("dict")
    # {'lbl': tensor([0, 1, 2, 3]),
    #  'feat1': tensor([1, 0, 0, 1]),
    #  'feat2': tensor([1.5000, -0.5000,  0.0000, -2.0000], dtype=torch.float64)}
    from torch.utils.data import TensorDataset
    ds = TensorDataset(*df.to_torch("dict").values())
    ds.tensors
    # (tensor([0, 1, 2, 3]),
    #  tensor([1, 0, 0, 1]),
    #  tensor([ 1.5000, -0.5000,  0.0000, -2.0000], dtype=torch.float64))
  • Demonstrate PolarsDataset usage with some scikit-learn data:

    Establish a DataFrame from the sklearn datasets...

    from sklearn.datasets import fetch_california_housing
    import polars as pl
    
    housing = fetch_california_housing()
    
    df = pl.DataFrame(
        data=housing.data,
        schema=housing.feature_names,
    ).with_columns(
        target=housing.target,
    )

    ...trivially export a float32 Dataset with features/labels...

    train = df.to_torch("dataset", label="target", dtype=pl.Float32)
    train.schema
    # {'features': torch.float32, 'labels': torch.float32}

    ...and pass to a DataLoader:

    from torch.utils.data import DataLoader
    train_iter = iter(DataLoader(
        train,
        shuffle=True,
        batch_size=64,
    ))
    next(train_iter)
    # [tensor([[2.1764e+00,  2.4000e+01,  4.4074e+00,  1.0397e+00,  3.9960e+03,
    #           2.2640e+00,  3.4020e+01, -1.1751e+02],
    #          [2.2467e+00,  4.6000e+01,  5.9407e+00,  1.1045e+00,  1.3390e+03,
    #           3.7825e+00,  3.7740e+01, -1.2218e+02],
    # // ...

Follow-up

Feedback on this one is welcome! It has been marked as unstable (though likely only for a limited time) to allow for quick iteration/tweaks if necessary.

Likely upcoming additions:

  • A PolarsIterableDataset could be useful for constraining peak memory usage (eg: don't materialise all frame data to Tensor format up-front).
  • A sequence_length parameter for Dataset export has been suggested, which could be helpful for transformer use-cases (vs linear regression).

Note

The associated unit tests provide 100% line coverage of the new code 🎯

Footnotes

  1. https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset

  2. https://docs.pola.rs/py-polars/html/reference/selectors.html

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Apr 27, 2024
@alexander-beedie alexander-beedie force-pushed the torch-tensor-export branch 7 times, most recently from 86287c7 to 082f662 Compare April 27, 2024 11:19
Copy link

codecov bot commented Apr 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.31%. Comparing base (2e28176) to head (e14a539).
Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15931      +/-   ##
==========================================
+ Coverage   81.26%   81.31%   +0.04%     
==========================================
  Files        1381     1382       +1     
  Lines      176636   176953     +317     
  Branches     3034     3056      +22     
==========================================
+ Hits       143549   143882     +333     
+ Misses      32606    32589      -17     
- Partials      481      482       +1     
Flag Coverage Δ
python 74.76% <100.00%> (+0.03%) ⬆️
rust 78.37% <9.75%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie alexander-beedie force-pushed the torch-tensor-export branch 6 times, most recently from db05dc3 to a7838c2 Compare April 28, 2024 19:37
@alexander-beedie alexander-beedie added the highlight Highlight this PR in the changelog label Apr 28, 2024
@ritchie46
Copy link
Member

It has been marked as unstable

Good. I think we should keep that for third party exports as it feels like much out of our control and not our core. Otherwise I'd be much more hesitant.

On the testing, I'd like to have a new pytest mark that for larger third party integrations. That way we can test this in CI and not be bothered with the extra testing overhead when we are developing local to the core. I don't anticipate any of this breaking when I do large refactors.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Apr 29, 2024

Good. I think we should keep that for third party exports as it feels like much out of our control and not our core. Otherwise I'd be much more hesitant.

Yup. While it's a large commit, the functionality itself is actually straightforward. Unless PyTorch undergoes some fairly radical restructuring we won't have any trouble. Definitely should leave it 'unstable' for now, but we can look to soften/drop that message once we have a decent number of releases without incident/modification ✌️

On the testing, I'd like to have a new pytest mark that for larger third party integrations.

Done! Added a new @pytest.mark.third_party_integration marker, grouped all of the new tests into a class, and set the new marker on that to tag them all in one go. Won't run locally by default (unless -m ""), but will still participate in CI.


Also, this started out in polars.io.torch (which didn't really make much sense) but I moved it into polars.ml.torch; adding a dedicated ml subdirectory provides a natural top-level home for this and any other ml-related integrations (I already have several follow-ups in mind). If anyone has a better idea though, I'm all ears 🤔

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex. Looks good. Especially given the "unstable", let's give it a spin!

@ritchie46 ritchie46 merged commit 8322323 into pola-rs:main May 3, 2024
24 checks passed
@alexander-beedie alexander-beedie deleted the torch-tensor-export branch May 3, 2024 13:44
@alexander-beedie alexander-beedie added the A-interop Area: interoperability with other libraries label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop Area: interoperability with other libraries enhancement New feature or an improvement of an existing feature highlight Highlight this PR in the changelog python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants