Add Parquet loader + from_parquet and to_parquet #2537

lhoestq · 2021-06-22T17:28:23Z

Continuation of #2247

I added a "parquet" dataset builder, as well as the methods Dataset.from_parquet and Dataset.to_parquet.
As usual, the data are converted to arrow in a batched way to avoid loading everything in memory.

lhoestq · 2021-06-22T17:35:50Z

pyarrow 1.0.0 doesn't support some types in parquet, we'll have to bump its minimum version.

Also I still need to add dummy data to test the parquet builder.

lhoestq · 2021-06-24T13:13:05Z

I had to bump the minimum pyarrow version to 3.0.0 to properly support parquet.

Everything is ready for review now :)
I reused pretty much the same tests we had for CSV

albertvillanova

Thanks @lhoestq.

Only some small comments/questions...

albertvillanova · 2021-06-29T14:50:05Z

setup.py

    # pyarrow 4.0.0 introduced segfault bug, see: https://github.com/huggingface/datasets/pull/2268
-    "pyarrow>=1.0.0,!=4.0.0",
+    "pyarrow>=3.0.0,!=4.0.0",


Are we sure that it is a good idea to stop supporting pyarrow < 3.0.0? Just to be sure of this choice... Maybe a softer option would be to set this requirement only for the users that want to use Parquet files...

Good point, I just checked and there are still many projects that use pyarrow 2.0.0 and 1.0.1
I'll make the change

tests/io/test_parquet.py

src/datasets/dataset_dict.py

tests/test_arrow_dataset.py

tests/test_dataset_dict.py

src/datasets/arrow_dataset.py

src/datasets/dataset_dict.py

…read/write

albertvillanova

Some little fixes.

tests/utils.py

.circleci/config.yml

lhoestq · 2021-06-30T16:30:23Z

Done !
Now we're still allowing pyarrow>=1.0.0, but when users want to use parquet features they're asked to update to pyarrow>=3.0.0

add parquet loader, from_parquet, to_parquet

59068ed

lhoestq added 4 commits June 23, 2021 18:14

bump pyarrow to 2.0.0

e94a563

add dummy parquet data

d3eadd1

bump to 3.0.0

0c537ea

don't look for dataset cards for packaged dataset builders

17301d8

lhoestq marked this pull request as ready for review June 23, 2021 16:32

lhoestq added 2 commits June 23, 2021 18:41

update CI to use pyarrow 3.0.0

ebeb9c1

update benchmarks

a877d8b

lhoestq requested a review from albertvillanova June 24, 2021 13:11

albertvillanova approved these changes Jun 29, 2021

View reviewed changes

lhoestq added 2 commits June 30, 2021 12:20

albert's comments in docstrings

077648b

back to pyarrow 1.0.0 + raise error if using old pyarrow for parquet …

11a2c9f

…read/write

albertvillanova requested changes Jun 30, 2021

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

.circleci/config.yml Outdated Show resolved Hide resolved

.circleci/config.yml Outdated Show resolved Hide resolved

lhoestq added 2 commits June 30, 2021 17:51

fix CI + message

33af6d5

typo

00b43fa

lhoestq merged commit 13434ae into master Jun 30, 2021

lhoestq deleted the parquet branch June 30, 2021 16:31

lhoestq mentioned this pull request Jul 26, 2021

Implement Dataset from Parquet #2247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parquet loader + from_parquet and to_parquet #2537

Add Parquet loader + from_parquet and to_parquet #2537

lhoestq commented Jun 22, 2021

lhoestq commented Jun 22, 2021

lhoestq commented Jun 24, 2021 •

edited

Loading

albertvillanova left a comment

albertvillanova Jun 29, 2021

lhoestq Jun 30, 2021

albertvillanova left a comment

lhoestq commented Jun 30, 2021

Add Parquet loader + from_parquet and to_parquet #2537

Add Parquet loader + from_parquet and to_parquet #2537

Conversation

lhoestq commented Jun 22, 2021

lhoestq commented Jun 22, 2021

lhoestq commented Jun 24, 2021 • edited Loading

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova Jun 29, 2021

Choose a reason for hiding this comment

lhoestq Jun 30, 2021

Choose a reason for hiding this comment

albertvillanova left a comment

Choose a reason for hiding this comment

lhoestq commented Jun 30, 2021

lhoestq commented Jun 24, 2021 •

edited

Loading