Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet loader + from_parquet and to_parquet #2537

Merged
merged 11 commits into from
Jun 30, 2021
Merged

Add Parquet loader + from_parquet and to_parquet #2537

merged 11 commits into from
Jun 30, 2021

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Jun 22, 2021

Continuation of #2247

I added a "parquet" dataset builder, as well as the methods Dataset.from_parquet and Dataset.to_parquet.
As usual, the data are converted to arrow in a batched way to avoid loading everything in memory.

@lhoestq
Copy link
Member Author

lhoestq commented Jun 22, 2021

pyarrow 1.0.0 doesn't support some types in parquet, we'll have to bump its minimum version.

Also I still need to add dummy data to test the parquet builder.

@lhoestq lhoestq marked this pull request as ready for review June 23, 2021 16:32
@lhoestq
Copy link
Member Author

lhoestq commented Jun 24, 2021

I had to bump the minimum pyarrow version to 3.0.0 to properly support parquet.

Everything is ready for review now :)
I reused pretty much the same tests we had for CSV

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lhoestq.

Only some small comments/questions...

setup.py Outdated
# pyarrow 4.0.0 introduced segfault bug, see: https://github.com/huggingface/datasets/pull/2268
"pyarrow>=1.0.0,!=4.0.0",
"pyarrow>=3.0.0,!=4.0.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that it is a good idea to stop supporting pyarrow < 3.0.0? Just to be sure of this choice... Maybe a softer option would be to set this requirement only for the users that want to use Parquet files...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I just checked and there are still many projects that use pyarrow 2.0.0 and 1.0.1
I'll make the change

tests/io/test_parquet.py Outdated Show resolved Hide resolved
src/datasets/dataset_dict.py Outdated Show resolved Hide resolved
tests/test_arrow_dataset.py Outdated Show resolved Hide resolved
tests/test_dataset_dict.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/dataset_dict.py Outdated Show resolved Hide resolved
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some little fixes.

tests/utils.py Outdated Show resolved Hide resolved
.circleci/config.yml Outdated Show resolved Hide resolved
.circleci/config.yml Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member Author

lhoestq commented Jun 30, 2021

Done !
Now we're still allowing pyarrow>=1.0.0, but when users want to use parquet features they're asked to update to pyarrow>=3.0.0

@lhoestq lhoestq merged commit 13434ae into master Jun 30, 2021
@lhoestq lhoestq deleted the parquet branch June 30, 2021 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants