Sharded save_to_disk + multiprocessing #5268

lhoestq · 2022-11-18T18:50:01Z

Added num_shards= and num_proc= to save_to_disk()

EDIT: also added max_shard_size= to save_to_disk(), and also num_shards= to push_to_hub

I also:

deprecated the fs parameter in favor of storage_options (for consistency with the rest of the lib) in save_to_disk and load_from_disk
always embed the image/audio data in arrow when doing save_to_disk
added a tqdm bar in save_to_disk
Use the MockFileSystem in tests for save_to_disk and load_from_disk
removed the unused integration tests with S3, since we can now test with mockfs instead of s3fs

TODO:

implem save_to_disk for dataset dict
save_to_disk for dataset dict tests
deprecate fs in dataset dict load_from_disk as well
update docs

Close #5263
Close #4196
Close #4351

HuggingFaceDocBuilderDev · 2022-11-18T18:58:07Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2022-11-21T18:07:59Z

src/datasets/arrow_dataset.py

+            if config.PYARROW_VERSION.major >= 8:
+                for pa_table in table_iter(shard.data.table, batch_size=batch_size):
+                    writer.write_table(pa_table)
+                    num_examples_progress_update += len(pa_table)
+                    if time.time() > _time + refresh_rate:
+                        _time = time.time()
+                        yield job_id, False, num_examples_progress_update
+                        num_examples_progress_update = 0
+            else:
+                for i in range(0, shard.num_rows, batch_size):
+                    pa_table = shard.data.slice(i, batch_size)
+                    writer.write_table(pa_table)
+                    num_examples_progress_update += len(pa_table)
+                    if time.time() > _time + refresh_rate:
+                        _time = time.time()
+                        yield job_id, False, num_examples_progress_update
+                        num_examples_progress_update = 0


I iterate on batches here to update the tqdm bar, but for old versions of pyarrow this may be too slow since table_iter only works for pyarrow>=8.

I think we may have to implement table_iter even on old versions for performance reasons. It can be based on pa.Table.to_record_batches - lmk what you think

Ok I just implemented pa.Table.to_reader for pyarrow < 8 for our datasets.table.Table. This way we don't have to check the pyarrow version anymore

mariosasko

Nice job!

docs/source/filesystems.mdx

src/datasets/arrow_dataset.py

lhoestq · 2022-12-08T17:38:56Z

Added both num_shards and max_shard_size in push_to_hub/save_to_disk. Will take care of updating the tests later

lhoestq · 2022-12-12T18:16:27Z

It's ready for a final review @mariosasko and @albertvillanova, let me know what you think :)

mariosasko

Some nits.

src/datasets/arrow_dataset.py

src/datasets/dataset_dict.py

lhoestq · 2022-12-14T17:31:11Z

Took your comments into account, and also changed iflatmap_unordered to take an iterable of kwargs to make the code more redable :)

mariosasko

Thanks, LGTM!

lhoestq and others added 7 commits November 17, 2022 19:34

add num_shards, num_proc, storage_options to save_to_disk

beef55e

minor

d1b7fb5

add tests

2d59bb6

remove old s3fs integreation tests

2e270dc

style

532ae18

Merge branch 'main' into sharded-save_to_disk

05436cc

style

f548e01

lhoestq added 6 commits November 21, 2022 17:33

Update DatasetDict.save_to_disk

dcd6363

test dataset dict

26a3e15

update dataset dict load_from_disk

291a883

minor

c55028b

update test

f122f6d

update docs

8305f8c

lhoestq marked this pull request as ready for review November 21, 2022 18:04

lhoestq requested review from albertvillanova and mariosasko November 21, 2022 18:04

lhoestq commented Nov 21, 2022

View reviewed changes

lhoestq added 2 commits November 22, 2022 19:02

backport to_reader to pyarrow < 8

d1d8ef8

typo

7057792

mariosasko reviewed Nov 23, 2022

View reviewed changes

docs/source/filesystems.mdx Show resolved Hide resolved

src/datasets/arrow_dataset.py Show resolved Hide resolved

src/datasets/arrow_dataset.py Show resolved Hide resolved

src/datasets/arrow_dataset.py Show resolved Hide resolved

lhoestq added 4 commits December 7, 2022 16:36

support both max_shard_size and num_shards

5e737c0

style

598b9da

docstrings

24e24bf

Merge branch 'main' into sharded-save_to_disk

16bb14c

lhoestq added 4 commits December 9, 2022 13:26

Merge branch 'main' into sharded-save_to_disk

917f921

test _estimate_nbytes

75347aa

Merge branch 'main' into sharded-save_to_disk

f86ed9f

add test for num_shards

fc39b83

lhoestq and others added 2 commits December 13, 2022 12:18

Merge branch 'main' into sharded-save_to_disk

a103ff0

style

d004f58

mariosasko reviewed Dec 14, 2022

View reviewed changes

lhoestq added 5 commits December 14, 2022 18:00

Merge branch 'main' into sharded-save_to_disk

5b36d97

mario's comment

c1db7bd

add config.PBAR_REFRESH_TIME_INTERVAL

f3562d2

fix docstrings

c2b38fa

use kwargs_iterable in iflatmap_unordered

ce66732

fix tests

44e5156

mariosasko approved these changes Dec 14, 2022

View reviewed changes

lhoestq merged commit 232a439 into main Dec 14, 2022

lhoestq deleted the sharded-save_to_disk branch December 14, 2022 18:22

mattdeeperinsights mentioned this pull request Dec 22, 2022

Problems after upgrading to 2.6.1 #5150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded save_to_disk + multiprocessing #5268

Sharded save_to_disk + multiprocessing #5268

lhoestq commented Nov 18, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 18, 2022 •

edited

Loading

lhoestq Nov 21, 2022

lhoestq Nov 22, 2022

mariosasko left a comment

lhoestq commented Dec 8, 2022

lhoestq commented Dec 12, 2022

mariosasko left a comment

lhoestq commented Dec 14, 2022

mariosasko left a comment

Sharded save_to_disk + multiprocessing #5268

Sharded save_to_disk + multiprocessing #5268

Conversation

lhoestq commented Nov 18, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Nov 18, 2022 • edited Loading

lhoestq Nov 21, 2022

Choose a reason for hiding this comment

lhoestq Nov 22, 2022

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

lhoestq commented Dec 8, 2022

lhoestq commented Dec 12, 2022

mariosasko left a comment

Choose a reason for hiding this comment

lhoestq commented Dec 14, 2022

mariosasko left a comment

Choose a reason for hiding this comment

lhoestq commented Nov 18, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 18, 2022 •

edited

Loading