feat(python): Support `pytorch` Tensor and Dataset export with new `to_torch` DataFrame/Series method #15931

alexander-beedie · 2024-04-27T10:53:27Z

This adds streamlined DataFrame (and Series) export to PyTorch Tensor and Dataset.

(Note: a torch.IterableDataset¹ option would likely be useful for particularly large DataFrames. Needs some experimentation, so will be left to a subsequent PR 🤔).

Features

Supported PyTorch export types:

df.to_torch(): export the entire frame to a single 2D Tensor (equivalent to df.to_torch("tensor")).
df.to_torch("dict"): export frame to a dictionary of Tensors.
df.to_torch("dataset"): export frame to a PolarsDataset (inheriting from TensorDataset, but additionally offering clean frame integration and a handful of other niceties; can also be imported independently from polars.ml.torch).

The `PolarsDataset` object:

Inherits from TensorDataset and, once initialised, is drop-in compatible with it.
Supports designation of label/features via Polars selectors².
Offers convenience half() method for experimenting with float16 data early.
Offers convenience schema attribute, showing the current feature/label dtypes.
Offers convenience features and labels attributes (in addition to tensors).
More informative repr, for example:
<PolarsDataset [len:20640, features:8, labels:1] at 0x3301B1EE0>.

Examples

As 2D Tensor:

import polars as pl
df = pl.DataFrame(
    data=[(0, 1, 1.5), (1, 0, -0.5), (2, 0, 0.0), (3, 1, -2.,25)],
    schema=["lbl", "feat1","feat2"],
)
df.to_torch()  # or df.to_torch("tensor")
# tensor([[0.0000,  1.0000,  1.5000],
#         [1.0000,  0.0000, -0.5000],
#         [2.0000,  0.0000,  0.0000],
#         [3.0000,  1.0000, -2.0000]], dtype=torch.float64)

As dict of Tensors:

df.to_torch("dict")
# {'lbl': tensor([0, 1, 2, 3]),
#  'feat1': tensor([1, 0, 0, 1]),
#  'feat2': tensor([1.5000, -0.5000,  0.0000, -2.0000], dtype=torch.float64)}

from torch.utils.data import TensorDataset
ds = TensorDataset(*df.to_torch("dict").values())
ds.tensors
# (tensor([0, 1, 2, 3]),
#  tensor([1, 0, 0, 1]),
#  tensor([ 1.5000, -0.5000,  0.0000, -2.0000], dtype=torch.float64))

Demonstrate `PolarsDataset` usage with some `scikit-learn` data:

Establish a DataFrame from the sklearn datasets...

from sklearn.datasets import fetch_california_housing
import polars as pl

housing = fetch_california_housing()

df = pl.DataFrame(
    data=housing.data,
    schema=housing.feature_names,
).with_columns(
    target=housing.target,
)

...trivially export a float32 Dataset with features/labels...

train = df.to_torch("dataset", label="target", dtype=pl.Float32)
train.schema
# {'features': torch.float32, 'labels': torch.float32}

...and pass to a DataLoader:

from torch.utils.data import DataLoader
train_iter = iter(DataLoader(
    train,
    shuffle=True,
    batch_size=64,
))
next(train_iter)
# [tensor([[2.1764e+00,  2.4000e+01,  4.4074e+00,  1.0397e+00,  3.9960e+03,
#           2.2640e+00,  3.4020e+01, -1.1751e+02],
#          [2.2467e+00,  4.6000e+01,  5.9407e+00,  1.1045e+00,  1.3390e+03,
#           3.7825e+00,  3.7740e+01, -1.2218e+02],
# // ...

Follow-up

Feedback on this one is welcome! It has been marked as unstable (though likely only for a limited time) to allow for quick iteration/tweaks if necessary.

Likely upcoming additions:

A PolarsIterableDataset could be useful for constraining peak memory usage (eg: don't materialise all frame data to Tensor format up-front).
A sequence_length parameter for Dataset export has been suggested, which could be helpful for transformer use-cases (vs linear regression).

Note

The associated unit tests provide 100% line coverage of the new code 🎯

codecov · 2024-04-27T11:35:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.31%. Comparing base (2e28176) to head (e14a539).
Report is 12 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15931      +/-   ##
==========================================
+ Coverage   81.26%   81.31%   +0.04%     
==========================================
  Files        1381     1382       +1     
  Lines      176636   176953     +317     
  Branches     3034     3056      +22     
==========================================
+ Hits       143549   143882     +333     
+ Misses      32606    32589      -17     
- Partials      481      482       +1

Flag	Coverage Δ
python	`74.76% <100.00%> (+0.03%)`	⬆️
rust	`78.37% <9.75%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ritchie46 · 2024-04-29T06:29:05Z

It has been marked as unstable

Good. I think we should keep that for third party exports as it feels like much out of our control and not our core. Otherwise I'd be much more hesitant.

On the testing, I'd like to have a new pytest mark that for larger third party integrations. That way we can test this in CI and not be bothered with the extra testing overhead when we are developing local to the core. I don't anticipate any of this breaking when I do large refactors.

…o_torch` DataFrame/Series method

… subdir

alexander-beedie · 2024-04-29T07:16:40Z

Good. I think we should keep that for third party exports as it feels like much out of our control and not our core. Otherwise I'd be much more hesitant.

Yup. While it's a large commit, the functionality itself is actually straightforward. Unless PyTorch undergoes some fairly radical restructuring we won't have any trouble. Definitely should leave it 'unstable' for now, but we can look to soften/drop that message once we have a decent number of releases without incident/modification ✌️

On the testing, I'd like to have a new pytest mark that for larger third party integrations.

Done! Added a new @pytest.mark.third_party_integration marker, grouped all of the new tests into a class, and set the new marker on that to tag them all in one go. Won't run locally by default (unless -m ""), but will still participate in CI.

Also, this started out in polars.io.torch (which didn't really make much sense) but I moved it into polars.ml.torch; adding a dedicated ml subdirectory provides a natural top-level home for this and any other ml-related integrations (I already have several follow-ups in mind). If anyone has a better idea though, I'm all ears 🤔

ritchie46

Thanks Alex. Looks good. Especially given the "unstable", let's give it a spin!

alexander-beedie requested review from ritchie46, stinodego, c-peters, MarcoGorelli and reswqa as code owners April 27, 2024 10:53

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Apr 27, 2024

alexander-beedie mentioned this pull request Apr 27, 2024

Request for Docs: torch #3685

Open

alexander-beedie force-pushed the torch-tensor-export branch 7 times, most recently from 86287c7 to 082f662 Compare April 27, 2024 11:19

alexander-beedie force-pushed the torch-tensor-export branch 6 times, most recently from db05dc3 to a7838c2 Compare April 28, 2024 19:37

alexander-beedie added the highlight Highlight this PR in the changelog label Apr 28, 2024

alexander-beedie force-pushed the torch-tensor-export branch from a7838c2 to d051f9c Compare April 28, 2024 19:38

alexander-beedie added 2 commits April 29, 2024 10:42

feat(python): Support pytorch Tensor and Dataset export with new `t…

e7315e4

…o_torch` DataFrame/Series method

additional test coverage, lint, docs

5937e81

alexander-beedie force-pushed the torch-tensor-export branch from d051f9c to 854a843 Compare April 29, 2024 07:05

add a new "third_party_integration" pytest marker, and create an ml…

c672849

… subdir

alexander-beedie force-pushed the torch-tensor-export branch from 854a843 to c672849 Compare April 29, 2024 07:16

alexander-beedie added 3 commits April 29, 2024 16:17

streamline torch 'features' selection/ordering

71cd5aa

support use of selectors for designating label/feature cols

dfa6d9d

selector test coverage

e14a539

ritchie46 approved these changes May 3, 2024

View reviewed changes

ritchie46 merged commit 8322323 into pola-rs:main May 3, 2024
24 checks passed

alexander-beedie deleted the torch-tensor-export branch May 3, 2024 13:44

alexander-beedie mentioned this pull request May 17, 2024

feat(python): Add to_jax methods to support Jax Array export from DataFrame and Series #16294

Merged

alexander-beedie added the A-interop Area: interoperability with other libraries label May 17, 2024

uditrana mentioned this pull request Oct 3, 2024

to_torch doesn't support list/array types #19092

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Support `pytorch` Tensor and Dataset export with new `to_torch` DataFrame/Series method #15931

feat(python): Support `pytorch` Tensor and Dataset export with new `to_torch` DataFrame/Series method #15931

alexander-beedie commented Apr 27, 2024 •

edited

Loading

codecov bot commented Apr 27, 2024 •

edited

Loading

ritchie46 commented Apr 29, 2024

alexander-beedie commented Apr 29, 2024 •

edited

Loading

ritchie46 left a comment

feat(python): Support pytorch Tensor and Dataset export with new to_torch DataFrame/Series method #15931

feat(python): Support pytorch Tensor and Dataset export with new to_torch DataFrame/Series method #15931

Conversation

alexander-beedie commented Apr 27, 2024 • edited Loading

Features

Supported PyTorch export types:

The PolarsDataset object:

Examples

As 2D Tensor:

As dict of Tensors:

Demonstrate PolarsDataset usage with some scikit-learn data:

Follow-up

Note

Footnotes

codecov bot commented Apr 27, 2024 • edited Loading

Codecov Report

ritchie46 commented Apr 29, 2024

alexander-beedie commented Apr 29, 2024 • edited Loading

ritchie46 left a comment

Choose a reason for hiding this comment

feat(python): Support `pytorch` Tensor and Dataset export with new `to_torch` DataFrame/Series method #15931

feat(python): Support `pytorch` Tensor and Dataset export with new `to_torch` DataFrame/Series method #15931

alexander-beedie commented Apr 27, 2024 •

edited

Loading

The `PolarsDataset` object:

Demonstrate `PolarsDataset` usage with some `scikit-learn` data:

codecov bot commented Apr 27, 2024 •

edited

Loading

alexander-beedie commented Apr 29, 2024 •

edited

Loading