Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Polars DataFrame Support for Dataset #2029

Merged
merged 24 commits into from
Jul 31, 2024

Conversation

ragyabraham
Copy link
Contributor

@ragyabraham ragyabraham commented Jul 17, 2024

This PR enhances the library's flexibility by allowing users to leverage Polars DataFrames for training purposes.

Checklist

  • Executed run-checks all script
  • Updated documentation to reflect changes in this PR

Related Issues/PRs

Fixes #2015

Changes

  • Added ability to create a burn dataset directly from a Polars DataFrame for training
  • Added a deserializer to deserialize df row into a user provided struct.
  • Implemented dataframe.rs module in the burn-dataset crate under the dataframe feature flag
  • Updated dataset document

Testing

  • Added additional unit tests to cover new functionality

@ragyabraham
Copy link
Contributor Author

@antimora

Copy link
Collaborator

@antimora antimora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few inlined comments and a question.

Polars has serde (feature) support for dataframes (see code). Probably you can bypass using JSON to deserialize row altogether. Have you looked into this?

crates/burn-dataset/Cargo.toml Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/mod.rs Outdated Show resolved Hide resolved
@ragyabraham
Copy link
Contributor Author

I have a few inlined comments and a question.

Polars has serde (feature) support for dataframes (see code). Probably you can bypass using JSON to deserialize row altogether. Have you looked into this?

Hey @antimora, I did, but I found that the serialize method is not fully implemented in polars yet. I've opened an issue on the Polars repo.

@antimora
Copy link
Collaborator

I have a few inlined comments and a question.
Polars has serde (feature) support for dataframes (see code). Probably you can bypass using JSON to deserialize row altogether. Have you looked into this?

Hey @antimora, I did, but I found that the serialize method is not fully implemented in polars yet. I've opened an issue on the Polars repo.

If I understood correctly not all column types supported? Can we support a subset of possible types? This still would be highly useful. Otherwise it seems one needs to encode row into JSON, which won't much more useful than having json row encoded lines in a plain text file. Or maybe I am misunderstanding the use of JSON here.

@ragyabraham
Copy link
Contributor Author

I have a few inlined comments and a question.
Polars has serde (feature) support for dataframes (see code). Probably you can bypass using JSON to deserialize row altogether. Have you looked into this?

Hey @antimora, I did, but I found that the serialize method is not fully implemented in polars yet. I've opened an issue on the Polars repo.

If I understood correctly not all column types supported? Can we support a subset of possible types? This still would be highly useful. Otherwise it seems one needs to encode row into JSON, which won't much more useful than having json row encoded lines in a plain text file. Or maybe I am misunderstanding the use of JSON here.

it seems that the pull request to fix the todo! issue in polars might get through. if that's the case then we can just utilise that. will be cleaner

@antimora
Copy link
Collaborator

I have a few inlined comments and a question.
Polars has serde (feature) support for dataframes (see code). Probably you can bypass using JSON to deserialize row altogether. Have you looked into this?

Hey @antimora, I did, but I found that the serialize method is not fully implemented in polars yet. I've opened an issue on the Polars repo.

If I understood correctly not all column types supported? Can we support a subset of possible types? This still would be highly useful. Otherwise it seems one needs to encode row into JSON, which won't much more useful than having json row encoded lines in a plain text file. Or maybe I am misunderstanding the use of JSON here.

it seems that the pull request to fix the todo! issue in polars might get through. if that's the case then we can just utilise that. will be cleaner

Yeah I think we should definitely use Dataframe's native serialization/deserialization methods. Otherwise this Dataset will be little of use.

A cool use case I envision as follows:

  1. One can open any Polars datasource (CVS, JSON, binaries, SQL Dbs, etc) into Dataframe.
  2. Apply dataframe filters to filter (
  3. Define struct for a row
  4. Wrap with Dataframe Dataset (can also be passed to SqliteDataset or MemoryDataset to consume and cache locall)
  5. Pass it to training/testing

BTW, your PR is waiting to fix this. Not sure if you noticed:

image

Copy link
Collaborator

@antimora antimora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spoke offline about using Polar's serde feature to deserialize a row. I am not sure if new code wasn't pushed (that you mentioned working), or there is still misunderstanding.

crates/burn-dataset/Cargo.toml Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
@ragyabraham
Copy link
Contributor Author

We spoke offline about using Polar's serde feature to deserialize a row. I am not sure if new code wasn't pushed (that you mentioned working), or there is still misunderstanding.

The code that I showed you working is the code in this PR as it stands

@antimora
Copy link
Collaborator

OK. I removed serde_json and implemented the native deserializer. Please finish up and clean up the code. Also please add more tests and data types.

One test is passing:

[burn-dataset]% cargo test dataframe --features dataframe
    Finished `test` profile [unoptimized] target(s) in 0.12s
     Running unittests src/lib.rs (/Users/dilshod/Projects/burn/target/debug/deps/burn_dataset-c350f307979d05bd)

running 1 test
test dataset::dataframe::tests::test_dataframe_dataset ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 25 filtered out; finished in 0.00s

@ragyabraham
Copy link
Contributor Author

OK. I removed serde_json and implemented the native deserializer. Please finish up and clean up the code. Also please add more tests and data types.

One test is passing:

[burn-dataset]% cargo test dataframe --features dataframe
    Finished `test` profile [unoptimized] target(s) in 0.12s
     Running unittests src/lib.rs (/Users/dilshod/Projects/burn/target/debug/deps/burn_dataset-c350f307979d05bd)

running 1 test
test dataset::dataframe::tests::test_dataframe_dataset ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 25 filtered out; finished in 0.00s

@antimora I've added support for LazyFrame, added support to deserialize a few more datatypes and added some tests. So far, there's no serialization for complex types, but this is a good starting point. Let me know what you think and if there's anything urgent I need to fix/add

Two tests are passing

image

Copy link
Collaborator

@antimora antimora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove Lazy for now. I thought it would be less complex and quick to add it but as it stands it's not complete and I don't want to burden you with this feature, sorry.

We need a couple minor/quick fixes. It should be ready to merge afterwards.

Also please rebase your branch to get the latest lock file.

crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/dataset/dataframe.rs Outdated Show resolved Hide resolved
@antimora antimora self-assigned this Jul 31, 2024
@antimora antimora changed the title [DRAFT] Dataset from polars dataframe Add Polars DataFrame Support for Training Jul 31, 2024
@antimora antimora marked this pull request as ready for review July 31, 2024 05:00
@antimora
Copy link
Collaborator

@laggui @nathanielsimard I took over this PR to complete it. I made final changes and ready for your review.

@antimora antimora changed the title Add Polars DataFrame Support for Training Add Polars DataFrame Support for Dataset Jul 31, 2024
Copy link

codecov bot commented Jul 31, 2024

Codecov Report

Attention: Patch coverage is 92.97125% with 22 lines in your changes missing coverage. Please review.

Project coverage is 86.19%. Comparing base (f673721) to head (3f81d53).

Files Patch % Lines
crates/burn-dataset/src/dataset/dataframe.rs 92.97% 22 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2029      +/-   ##
==========================================
+ Coverage   86.16%   86.19%   +0.02%     
==========================================
  Files         686      687       +1     
  Lines       87871    88184     +313     
==========================================
+ Hits        75717    76008     +291     
- Misses      12154    12176      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@nathanielsimard nathanielsimard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my end. The deserializer logic is a bit complex, but the tests comparing with a vector dataset are reassuring. And I guess it's necessary with Polar, though I'm not familiar with the library.

Copy link
Collaborator

@antimora antimora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I made the actual changes and fixes, I approve it.

@antimora antimora merged commit 04d7ff2 into tracel-ai:main Jul 31, 2024
14 checks passed
@wangjiawen2013
Copy link
Contributor

Good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable dataset creation from a Polars Dataframe
4 participants