-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Polars DataFrame Support for Dataset #2029
Add Polars DataFrame Support for Dataset #2029
Conversation
…t polars rows to burn dataset values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few inlined comments and a question.
Polars has serde
(feature) support for dataframes (see code). Probably you can bypass using JSON to deserialize row altogether. Have you looked into this?
Hey @antimora, I did, but I found that the serialize method is not fully implemented in polars yet. I've opened an issue on the Polars repo. |
If I understood correctly not all column types supported? Can we support a subset of possible types? This still would be highly useful. Otherwise it seems one needs to encode row into JSON, which won't much more useful than having json row encoded lines in a plain text file. Or maybe I am misunderstanding the use of JSON here. |
it seems that the pull request to fix the todo! issue in polars might get through. if that's the case then we can just utilise that. will be cleaner |
Yeah I think we should definitely use Dataframe's native serialization/deserialization methods. Otherwise this Dataset will be little of use. A cool use case I envision as follows:
BTW, your PR is waiting to fix this. Not sure if you noticed: |
… map to json string. added comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We spoke offline about using Polar's serde feature to deserialize a row. I am not sure if new code wasn't pushed (that you mentioned working), or there is still misunderstanding.
The code that I showed you working is the code in this PR as it stands |
…. no longer convert to object before converting to string. no longer using serde_json to_string method
OK. I removed serde_json and implemented the native deserializer. Please finish up and clean up the code. Also please add more tests and data types. One test is passing:
|
… data. added a few more tests
@antimora I've added support for Two tests are passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove Lazy for now. I thought it would be less complex and quick to add it but as it stands it's not complete and I don't want to burden you with this feature, sorry.
We need a couple minor/quick fixes. It should be ready to merge afterwards.
Also please rebase your branch to get the latest lock file.
@laggui @nathanielsimard I took over this PR to complete it. I made final changes and ready for your review. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2029 +/- ##
==========================================
+ Coverage 86.16% 86.19% +0.02%
==========================================
Files 686 687 +1
Lines 87871 88184 +313
==========================================
+ Hits 75717 76008 +291
- Misses 12154 12176 +22 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my end. The deserializer logic is a bit complex, but the tests comparing with a vector dataset are reassuring. And I guess it's necessary with Polar, though I'm not familiar with the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I made the actual changes and fixes, I approve it.
Good! |
This PR enhances the library's flexibility by allowing users to leverage Polars DataFrames for training purposes.
Checklist
run-checks all
scriptRelated Issues/PRs
Fixes #2015
Changes
dataframe.rs
module in theburn-dataset
crate under thedataframe
feature flagTesting