Multiprocessed dataset builder #5107

TevenLeScao · 2022-10-12T19:59:17Z

This PR adds the multiprocessing part of #2650 (but not the caching of already-computed arrow files). On the other side, loading of sharded arrow files still needs to be implemented (sharded parquet files can already be loaded).

TevenLeScao · 2022-10-12T20:01:14Z

I would also like to add a test, but am not sure whether it should go into test_builder (more natural imo) or test_load (which already contains a lot of the things I have to import to run my current testing setup). For reference, what I run to test that it works looks like:

import os
from pathlib import Path
import shutil

import datasets
from datasets.builder import DatasetBuilder
from datasets.features import Features, Value

DATASET_LOADING_SCRIPT_NAME = "__dummy_dataset1__"

DATASET_LOADING_SCRIPT_CODE = """
import os

import datasets
from datasets import DatasetInfo, Features, Split, SplitGenerator, Value


class __DummyDataset1__(datasets.GeneratorBasedBuilder):

    def _info(self) -> DatasetInfo:
        return DatasetInfo(features=Features({"text": Value("string")}))

    def _split_generators(self, dl_manager):
        return [
            SplitGenerator(Split.TRAIN, gen_kwargs={"filepaths": [os.path.join(dl_manager.manual_dir, "train1.txt"), os.path.join(dl_manager.manual_dir, "train2.txt")]}),
            SplitGenerator(Split.TEST, gen_kwargs={"filepaths": [os.path.join(dl_manager.manual_dir, "test.txt")]}),
        ]

    def _generate_examples(self, filepaths, **kwargs):
        idx = 0
        for filepath in filepaths:
            with open(filepath, "r", encoding="utf-8") as f:
                for line in f:
                    yield idx, {"text": line.strip()}
                    idx += 1
"""


def dataset_loading_script_dir(tmp_path):
    script_name = DATASET_LOADING_SCRIPT_NAME
    script_dir = tmp_path / script_name
    script_dir.mkdir()
    script_path = script_dir / f"{script_name}.py"
    with open(script_path, "w") as f:
        f.write(DATASET_LOADING_SCRIPT_CODE)
    return str(script_dir)


def data_dir(tmp_path):
    data_dir = tmp_path / "data_dir"
    data_dir.mkdir()
    with open(data_dir / "train1.txt", "w") as f:
        f.write("foo\n" * 10)
    with open(data_dir / "train2.txt", "w") as f:
        f.write("foo\n" * 10)
    with open(data_dir / "test.txt", "w") as f:
        f.write("bar\n" * 10)
    return str(data_dir)


def load_dataset_builder_multiprocessed(tmp_path):
    builder = datasets.load_dataset_builder(
        os.path.join(dataset_loading_script_dir(tmp_path), DATASET_LOADING_SCRIPT_NAME + ".py"),
        data_dir=data_dir(tmp_path),
    )
    assert isinstance(builder, DatasetBuilder)
    assert builder.name == DATASET_LOADING_SCRIPT_NAME
    assert builder.info.features == Features({"text": Value("string")})
    builder.download_and_prepare(tmp_path / "prepare_target", max_shard_size=500, num_proc=2)

if __name__ == "__main__":
    tmp_path = "tmp"
    if os.path.exists(tmp_path):
        raise FileExistsError(f"path {tmp_path} already exists")
    os.makedirs(tmp_path)
    try:
        load_dataset_builder_multiprocessed(Path(tmp_path))
    finally:
        # pass
        shutil.rmtree(tmp_path)

HuggingFaceDocBuilderDev · 2022-10-12T20:08:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

lhoestq · 2022-10-13T09:34:29Z

Nice ! I think the test can go in test_builder.py :)

lhoestq

Thanks for doing it for both the GeneratorBasedBuilder and the ArrowBasedBuilder !

I added a comment and a question about the dataset order:

src/datasets/iterable_dataset.py

lhoestq · 2022-10-13T09:48:13Z

src/datasets/builder.py

+        # should rename everything at the end, scheme still TBD
+        def _rename_shard(shard_id_and_rank: Tuple[int]):
+            shard_id, rank = shard_id_and_rank
+            global_shard_id = sum(shards_per_rank[:rank]) + shard_id
+            self._rename(
+                fpath.replace("SSSSS", f"{shard_id:05d}").replace("RRRRR", f"{rank:05d}"),
+                fpath.replace("RRRRR-SSSSS", f"{global_shard_id:05d}").replace("NNNNN", f"{total_shards:05d}"),
+            )


Does this preserve the order of the original dataset ? If so that's amazing :)

It does! Or at least, this preserves the order of the shards in split_generator.gen_kwargs.

Actually it doesn't after testing, but I can't quite figure out why :/

…o multiprocessed_dataset_prep

TevenLeScao · 2022-10-19T10:34:22Z

I've added sharded arrow dataset loading. Two WIP items in the PR:

~~Order is not conserved (it seems like the sharded files are read in the wrong order)~~
the tqdm for preparing the splits is wrong (it compares against the size of the whole split rather than against the size of the multiprocessing shard, but I am not sure how to access the latter)

Also naming.filenames_for_dataset_split is not very elegant imo.

@lvwerra if you don't care about order, as I do, it's functional for now but I'd still quite like to get to the bottom of this.

TevenLeScao · 2022-10-19T11:52:43Z

Found the ordering bug ! (glob.glob returning stuff in arbitrary order)

TevenLeScao · 2022-10-19T13:14:25Z

I fixed the tqdm to be less misleading, but it can't tell where to stop. I am a bit hesitant to add a top-level tqdm (on the shard iterator) since for most intents it will do 0 -> N shards straight, but I am not sure what is the best way to present that info here.

lhoestq · 2022-10-20T14:09:00Z

I'm continuing the PR :)

lhoestq · 2022-11-02T19:17:03Z

Alright this is ready for review - sorry it ended up so big ^^'

If I can do anything to make it easier for your to review this PR @mariosasko let me know

…set_prep

lhoestq · 2022-11-03T13:45:16Z

Multiprocessing is disabled by default but we may show a warning to encourage users to pass num_proc if the dataset is split in many files. Let me know what you think

mariosasko

Looks great!

Do we have some benchmarks to see the speed-up?

Some nits:

docs/source/dataset_script.mdx

src/datasets/arrow_writer.py

src/datasets/builder.py

parsa-ra · 2022-11-06T12:14:32Z

Hey, is this error seems to you guys natural?

The package built from 0d4e3907 commit tag, and here is the version displayed from the import ...

>>> datasets.__version__
'2.6.1.dev0'
>>>

>>> data = load_dataset('dataset_loaders/rfw2latentplay', num_proc=14)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/load.py", line 1719, in load_dataset
    builder_instance = load_dataset_builder(
  File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/load.py", line 1523, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/builder.py", line 1292, in __init__
    super().__init__(*args, **kwargs)
  File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/builder.py", line 303, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/somewhere//mambaforge/envs/datasets/lib/python3.8/site-packages/datasets/builder.py", line 456, in _create_builder_config
    builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
TypeError: __init__() got an unexpected keyword argument 'num_proc'

Let me know if I can help fixing this ...

lhoestq · 2022-11-07T18:29:01Z

Do we have some benchmarks to see the speed-up?

On my machine running load_dataset("oscar-corpus/OSCAR-2201", "br") (which is split in shards) I go from 2-3k examples per sec to 4-5k examples per sec with num_proc=2 😉

lhoestq · 2022-11-07T18:29:56Z

Hey, is this error seems to you guys natural?

The package built from 0d4e390 commit tag, and here is the version displayed from the import ...

I don't know where you got the 0d4e3907 commit tag from, it doesn't seem to be in this PR. You should try installing from this PR, or wait for it to be merged on main

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

…set_prep

parsa-ra · 2022-11-09T10:05:34Z

Splits vs Shards

Maybe it's a good idea to add some documentation on the sharding that can be achieved by passing list based arguments to the SplitGenerators gen_kwargs ...

I had to read the whole dataset generation source code to find this out ...

lhoestq · 2022-11-09T11:01:47Z

Maybe it's a good idea to add some documentation on the sharding that can be achieved by passing list based arguments to the SplitGenerators gen_kwargs ...

This is part of this PR :) you can check the changes in docs/source/dataset_script.mdx

lhoestq · 2022-11-09T13:52:18Z

I took your comments into account @mariosasko thanks !
Let me know if it's good for you now ;)

mariosasko

Looks all good now (besides the failing "PR docs" job, but not sure if we need to address this)!

lhoestq · 2022-11-09T17:11:38Z

The doc CI should be fixed by now hopefully, merging !

TevenLeScao added 3 commits October 11, 2022 15:19

multiprocessing-compatible naming scheme and refactor

a802ba5

multiprocessed shard writing for GeneratorBasedBuilder

ea56329

multiprocessed shard writing for ArrowBasedBuilder

9536184

TevenLeScao requested review from lhoestq and mariosasko October 12, 2022 20:01

style

31d8395

lhoestq reviewed Oct 13, 2022

View reviewed changes

TevenLeScao added 6 commits October 15, 2022 16:39

multiprocessed dataset loading

9c5843a

compatibility with non-sharded datasets

328112e

bugfix

9dc8539

bugfix

21a603a

Merge remote-tracking branch 'origin/multiprocessed_dataset_prep' int…

55cb365

…o multiprocessed_dataset_prep

removed unused import

94efbdb

fixed bad ordering

bac2b2f

less misleading tqdm

3e4f337

lhoestq added 8 commits October 20, 2022 19:56

fix gen_kwargs distribution + read shards

b2f634d

minor

296302f

minor2

9b312d4

support beam datasets

d2e70f2

docstrings + minor

e3a30fa

add iflatmap_unordered for parallel write & progress updates

cf6fd25

use 1 tqdm bar receiving updates from subprocesses

3e5d0cc

docs

09c13a7

lhoestq added 5 commits November 2, 2022 19:17

fix multiprocessing on windows

b321c61

keep multiprocessing disabled by default

b05e551

again + docs

020eb89

more docs

142f822

more docs

f22c162

lhoestq added 3 commits November 3, 2022 14:28

Merge remote-tracking branch 'upstream/main' into multiprocessed_data…

08b8626

…set_prep

some var renaming

4ce2d12

style

e05ad83

mariosasko reviewed Nov 4, 2022

View reviewed changes

lhoestq and others added 6 commits November 8, 2022 17:11

Apply suggestions from code review

c621cb6

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

Apply suggestions from code review

22d965e

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

added utils/sharding.py

dc0ef15

Merge remote-tracking branch 'upstream/main' into multiprocessed_data…

95cdd0b

…set_prep

style

12d69f3

style

db45b3b

mariosasko approved these changes Nov 9, 2022

View reviewed changes

lhoestq merged commit 2945690 into huggingface:main Nov 9, 2022

lhoestq changed the title ~~Multiprocessed dataset builder [WIP]~~ Multiprocessed dataset builder Dec 1, 2022

albertvillanova mentioned this pull request Jan 10, 2023

RuntimeError: Sharding is ambiguous for this dataset #5415

Closed

mariosasko mentioned this pull request Jan 16, 2023

feat(dataset): multiprocessing _generate_examples #786

Closed

mariosasko mentioned this pull request Nov 28, 2023

[load_dataset] shard and parallelize the process #2650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessed dataset builder #5107

Multiprocessed dataset builder #5107

TevenLeScao commented Oct 12, 2022

TevenLeScao commented Oct 12, 2022

HuggingFaceDocBuilderDev commented Oct 12, 2022

lhoestq commented Oct 13, 2022

lhoestq left a comment •

edited

Loading

lhoestq Oct 13, 2022 •

edited

Loading

TevenLeScao Oct 13, 2022

TevenLeScao Oct 19, 2022

TevenLeScao commented Oct 19, 2022 •

edited

Loading

TevenLeScao commented Oct 19, 2022

TevenLeScao commented Oct 19, 2022

lhoestq commented Oct 20, 2022

lhoestq commented Nov 2, 2022

lhoestq commented Nov 3, 2022

mariosasko left a comment

parsa-ra commented Nov 6, 2022 •

edited

Loading

lhoestq commented Nov 7, 2022

lhoestq commented Nov 7, 2022

parsa-ra commented Nov 9, 2022 •

edited

Loading

lhoestq commented Nov 9, 2022

lhoestq commented Nov 9, 2022

mariosasko left a comment

lhoestq commented Nov 9, 2022

Multiprocessed dataset builder #5107

Multiprocessed dataset builder #5107

Conversation

TevenLeScao commented Oct 12, 2022

TevenLeScao commented Oct 12, 2022

HuggingFaceDocBuilderDev commented Oct 12, 2022

lhoestq commented Oct 13, 2022

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

lhoestq Oct 13, 2022 • edited Loading

Choose a reason for hiding this comment

TevenLeScao Oct 13, 2022

Choose a reason for hiding this comment

TevenLeScao Oct 19, 2022

Choose a reason for hiding this comment

TevenLeScao commented Oct 19, 2022 • edited Loading

TevenLeScao commented Oct 19, 2022

TevenLeScao commented Oct 19, 2022

lhoestq commented Oct 20, 2022

lhoestq commented Nov 2, 2022

lhoestq commented Nov 3, 2022

mariosasko left a comment

Choose a reason for hiding this comment

parsa-ra commented Nov 6, 2022 • edited Loading

lhoestq commented Nov 7, 2022

lhoestq commented Nov 7, 2022

parsa-ra commented Nov 9, 2022 • edited Loading

Splits vs Shards

lhoestq commented Nov 9, 2022

lhoestq commented Nov 9, 2022

mariosasko left a comment

Choose a reason for hiding this comment

lhoestq commented Nov 9, 2022

lhoestq left a comment •

edited

Loading

lhoestq Oct 13, 2022 •

edited

Loading

TevenLeScao commented Oct 19, 2022 •

edited

Loading

parsa-ra commented Nov 6, 2022 •

edited

Loading

parsa-ra commented Nov 9, 2022 •

edited

Loading