-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet loader + from_parquet and to_parquet #2537
Conversation
Also I still need to add dummy data to test the parquet builder. |
I had to bump the minimum pyarrow version to 3.0.0 to properly support parquet. Everything is ready for review now :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lhoestq.
Only some small comments/questions...
setup.py
Outdated
# pyarrow 4.0.0 introduced segfault bug, see: https://github.com/huggingface/datasets/pull/2268 | ||
"pyarrow>=1.0.0,!=4.0.0", | ||
"pyarrow>=3.0.0,!=4.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that it is a good idea to stop supporting pyarrow < 3.0.0? Just to be sure of this choice... Maybe a softer option would be to set this requirement only for the users that want to use Parquet files...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I just checked and there are still many projects that use pyarrow 2.0.0 and 1.0.1
I'll make the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some little fixes.
Done ! |
Continuation of #2247
I added a "parquet" dataset builder, as well as the methods
Dataset.from_parquet
andDataset.to_parquet
.As usual, the data are converted to arrow in a batched way to avoid loading everything in memory.