Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[load_dataset] shard and parallelize the process #2650

Closed
stas00 opened this issue Jul 14, 2021 · 4 comments
Closed

[load_dataset] shard and parallelize the process #2650

stas00 opened this issue Jul 14, 2021 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@stas00
Copy link
Contributor

stas00 commented Jul 14, 2021

  • Some huge datasets take forever to build the first time. (e.g. oscar/en) as it's done in a single cpu core.
  • If the build crashes, everything done up to that point gets lost

Request: Shard the build over multiple arrow files, which would enable:

  • much faster build by parallelizing the build process
  • if the process crashed, the completed arrow files don't need to be re-built again

Thank you!

@lhoestq

@stas00 stas00 added the enhancement New feature or request label Jul 14, 2021
@stas00 stas00 changed the title [load_dataset] parallelize [load_dataset] shard and parallelize the process Jul 14, 2021
@vanpersie32
Copy link

I need the same feature for distributed training

@huggingface huggingface deleted a comment from vanpersie32 Oct 25, 2021
@lhoestq
Copy link
Member

lhoestq commented Oct 7, 2022

I think @TevenLeScao is exploring adding multiprocessing in GeneratorBasedBuilder._prepare_split - feel free to post updates here :)

@TevenLeScao
Copy link
Contributor

Posted a PR to address the building side, still needs something to load sharded arrow files + tests

@mariosasko
Copy link
Collaborator

Closing as this feature has been implemented in #5107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants