-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sharded save_to_disk + multiprocessing #5268
Conversation
The documentation is not available anymore as the PR was closed or merged. |
src/datasets/arrow_dataset.py
Outdated
if config.PYARROW_VERSION.major >= 8: | ||
for pa_table in table_iter(shard.data.table, batch_size=batch_size): | ||
writer.write_table(pa_table) | ||
num_examples_progress_update += len(pa_table) | ||
if time.time() > _time + refresh_rate: | ||
_time = time.time() | ||
yield job_id, False, num_examples_progress_update | ||
num_examples_progress_update = 0 | ||
else: | ||
for i in range(0, shard.num_rows, batch_size): | ||
pa_table = shard.data.slice(i, batch_size) | ||
writer.write_table(pa_table) | ||
num_examples_progress_update += len(pa_table) | ||
if time.time() > _time + refresh_rate: | ||
_time = time.time() | ||
yield job_id, False, num_examples_progress_update | ||
num_examples_progress_update = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I iterate on batches here to update the tqdm bar, but for old versions of pyarrow this may be too slow since table_iter
only works for pyarrow>=8.
I think we may have to implement table_iter
even on old versions for performance reasons. It can be based on pa.Table.to_record_batches
- lmk what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I just implemented pa.Table.to_reader
for pyarrow < 8 for our datasets.table.Table
. This way we don't have to check the pyarrow version anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job!
Added both num_shards and max_shard_size in push_to_hub/save_to_disk. Will take care of updating the tests later |
It's ready for a final review @mariosasko and @albertvillanova, let me know what you think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits.
Took your comments into account, and also changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
Added
num_shards=
andnum_proc=
tosave_to_disk()
EDIT: also added
max_shard_size=
tosave_to_disk()
, and alsonum_shards=
topush_to_hub
I also:
save_to_disk
save_to_disk
save_to_disk
andload_from_disk
mockfs
instead ofs3fs
TODO:
Close #5263
Close #4196
Close #4351