-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocessed dataset builder #5107
Multiprocessed dataset builder #5107
Conversation
I would also like to add a test, but am not sure whether it should go into
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Nice ! I think the test can go in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing it for both the GeneratorBasedBuilder and the ArrowBasedBuilder !
I added a comment and a question about the dataset order:
src/datasets/builder.py
Outdated
# should rename everything at the end, scheme still TBD | ||
def _rename_shard(shard_id_and_rank: Tuple[int]): | ||
shard_id, rank = shard_id_and_rank | ||
global_shard_id = sum(shards_per_rank[:rank]) + shard_id | ||
self._rename( | ||
fpath.replace("SSSSS", f"{shard_id:05d}").replace("RRRRR", f"{rank:05d}"), | ||
fpath.replace("RRRRR-SSSSS", f"{global_shard_id:05d}").replace("NNNNN", f"{total_shards:05d}"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this preserve the order of the original dataset ? If so that's amazing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does! Or at least, this preserves the order of the shards in split_generator.gen_kwargs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it doesn't after testing, but I can't quite figure out why :/
I've added sharded arrow dataset loading. Two WIP items in the PR:
Also @lvwerra if you don't care about order, as I do, it's functional for now but I'd still quite like to get to the bottom of this. |
Found the ordering bug ! ( |
I fixed the tqdm to be less misleading, but it can't tell where to stop. I am a bit hesitant to add a top-level tqdm (on the shard iterator) since for most intents it will do 0 -> N shards straight, but I am not sure what is the best way to present that info here. |
I'm continuing the PR :) |
Alright this is ready for review - sorry it ended up so big ^^' If I can do anything to make it easier for your to review this PR @mariosasko let me know |
Multiprocessing is disabled by default but we may show a warning to encourage users to pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Do we have some benchmarks to see the speed-up?
Some nits:
Hey, is this error seems to you guys natural? The package built from >>> datasets.__version__
'2.6.1.dev0'
>>> >>> data = load_dataset('dataset_loaders/rfw2latentplay', num_proc=14)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/load.py", line 1719, in load_dataset
builder_instance = load_dataset_builder(
File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/load.py", line 1523, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/builder.py", line 1292, in __init__
super().__init__(*args, **kwargs)
File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/builder.py", line 303, in __init__
self.config, self.config_id = self._create_builder_config(
File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/builder.py", line 456, in _create_builder_config
builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
TypeError: __init__() got an unexpected keyword argument 'num_proc' Let me know if I can help fixing this ... |
On my machine running |
I don't know where you got the |
Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
Splits vs ShardsMaybe it's a good idea to add some documentation on the I had to read the whole dataset generation source code to find this out ... |
This is part of this PR :) you can check the changes in docs/source/dataset_script.mdx |
I took your comments into account @mariosasko thanks ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good now (besides the failing "PR docs" job, but not sure if we need to address this)!
The doc CI should be fixed by now hopefully, merging ! |
This PR adds the multiprocessing part of #2650 (but not the caching of already-computed arrow files). On the other side, loading of sharded arrow files still needs to be implemented (sharded parquet files can already be loaded).