Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid stuck map operation when subprocesses crashes #5976

Merged
merged 7 commits into from
Jul 10, 2023

Conversation

pappacena
Copy link
Contributor

@pappacena pappacena commented Jun 21, 2023

I've been using Dataset.map() with num_proc=os.cpu_count() to leverage multicore processing for my datasets, but from time to time I get stuck processes waiting forever. Apparently, when one of the subprocesses is abruptly killed (OOM killer, segfault, SIGKILL, etc), the main process keeps waiting for the async task sent to that child process to finish.

It seems to be easy to reproduce the issue with the following script:

import os
from datasets import Dataset, Features, Value


def do_stuck(item):
    os.kill(os.getpid(), 9)

data = {
    "col1": list(range(5)),
    "col2": list(range(5)),
}

ds = Dataset.from_dict(
    data,
    features=Features({
        "col1": Value("int64"),
        "col2": Value("int64"),
    }),
)

print(ds.map(do_stuck, num_proc=4))

This is an old behavior in Python, which apparently was fixed a few years ago in concurrent.futures.ProcessPoolExecutor (ref), but not in multiprocessing.pool.Pool / multiprocess.pool.Pool, which is used by Dataset.map (ref).

This PR is an idea to try to detect when a child process gets killed, and raises a RuntimeError warning the dataset.map() caller.

EDIT: Related proposal for future improvement: #5977

@lhoestq
Copy link
Member

lhoestq commented Jun 22, 2023

Hi ! Do you think this can be fixed at the Pool level ? Ideally it should be the Pool responsibility to handle this, not the map code. We could even subclass Pool if needed (at least the one from multiprocess)

@pappacena
Copy link
Contributor Author

@lhoestq it makes sense to me. Just pushed a refactoring creating a class ProcessPool(multiprocess.pool.Pool) to keep track of the PID changes.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 26, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Some additional suggestions:

def iflatmap_unordered(
pool: Union[multiprocessing.pool.Pool, multiprocess.pool.Pool],
pool: ProcessPool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to keep Union[multiprocessing.pool.Pool, multiprocess.pool.Pool] support in iflatmap_unordered

src/datasets/utils/py_utils.py Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member

lhoestq commented Jun 26, 2023

I managed to raise an error without subclassing Pool with two additions to iflatmap_unordered:

  1. at the beggining
original_pool = list(pool._pool)
  1. in the loop
if any(async_result._pool != original_pool for async_result in async_results) and queue.empty():
    raise RuntimeError(
        "One of the subprocesses has abruptly died during map operation."
        "To debug the error, disable multiprocessing."
    )

It's still a fix that only works for iflatmap_unordered (so not for map, imap etc) but is maybe simpler that subclassing. It also works for both multiprocessing.Pool and multiprocess.Pool

@pappacena pappacena force-pushed the tfp/avoid-stuck-map-operation branch from e6a19f3 to 45a5ee0 Compare July 5, 2023 22:09
@pappacena pappacena force-pushed the tfp/avoid-stuck-map-operation branch from 45a5ee0 to 2ad4253 Compare July 5, 2023 22:14
@pappacena pappacena force-pushed the tfp/avoid-stuck-map-operation branch from 2ad4253 to 095d5ea Compare July 5, 2023 22:14
@pappacena
Copy link
Contributor Author

@lhoestq sorry for the delay. Busy weeks here.

I just pushed the change you requested. It looks closer to the original proposal, actually.

It seems that map actually uses iflatmap_unordered (here). I think this solution works fine for the map method (which is the one being tested by the new tests/test_arrow_dataset.py::BaseDatasetTest::test_map_crash_subprocess, right?).

@lhoestq
Copy link
Member

lhoestq commented Jul 6, 2023

Yes fixing iflatmap_unordered does fix Dataset.map, but it won't fix any Pool.map that we may use elsewhere so we'll have to keep this in mind.

@lhoestq
Copy link
Member

lhoestq commented Jul 6, 2023

It looks all good to me, feel free to fix code formatting by running make style and we can merge :)

@pappacena
Copy link
Contributor Author

Yes fixing iflatmap_unordered does fix Dataset.map, but it won't fix any Pool.map that we may use elsewhere so we'll have to keep this in mind.

Right, I agree. The best way moving forward is probably not using the buggy multiprocess.Pool anymore, and replace it with concurrent.futures.ProcessPoolExecutor as much as possible.

Anyway, I've run make style now. Thanks for the support!

@lhoestq
Copy link
Member

lhoestq commented Jul 6, 2023

It looks like checking the async_result._pool doesn't always work - sorry about that. We might just go back to your original solution then. Would also be cool to open an issue in multiprocess to ask if they have a solution or if they plan to fix this.

@pappacena
Copy link
Contributor Author

@lhoestq no problem! Reverted to the previous version.

TBH, given the discussions in this python issue, I don't think the error in multiprocess will be merged upstream any time soon...

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@lhoestq lhoestq merged commit aca4cdc into huggingface:main Jul 10, 2023
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006060 / 0.011353 (-0.005293) 0.003695 / 0.011008 (-0.007313) 0.080484 / 0.038508 (0.041976) 0.061894 / 0.023109 (0.038785) 0.312510 / 0.275898 (0.036612) 0.352398 / 0.323480 (0.028918) 0.004638 / 0.007986 (-0.003348) 0.002918 / 0.004328 (-0.001410) 0.062932 / 0.004250 (0.058681) 0.050859 / 0.037052 (0.013807) 0.316812 / 0.258489 (0.058323) 0.357684 / 0.293841 (0.063843) 0.027622 / 0.128546 (-0.100924) 0.008012 / 0.075646 (-0.067634) 0.260970 / 0.419271 (-0.158302) 0.045807 / 0.043533 (0.002275) 0.321235 / 0.255139 (0.066096) 0.343162 / 0.283200 (0.059962) 0.021136 / 0.141683 (-0.120547) 1.465886 / 1.452155 (0.013731) 1.500216 / 1.492716 (0.007500)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.187286 / 0.018006 (0.169279) 0.428724 / 0.000490 (0.428235) 0.003029 / 0.000200 (0.002829) 0.000063 / 0.000054 (0.000008)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022703 / 0.037411 (-0.014708) 0.072740 / 0.014526 (0.058215) 0.083436 / 0.176557 (-0.093120) 0.144559 / 0.737135 (-0.592577) 0.083958 / 0.296338 (-0.212380)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.435729 / 0.215209 (0.220520) 4.351146 / 2.077655 (2.273491) 2.316627 / 1.504120 (0.812508) 2.144587 / 1.541195 (0.603393) 2.209182 / 1.468490 (0.740692) 0.501131 / 4.584777 (-4.083646) 3.077085 / 3.745712 (-0.668627) 4.353706 / 5.269862 (-0.916156) 2.621523 / 4.565676 (-1.944154) 0.058976 / 0.424275 (-0.365299) 0.006467 / 0.007607 (-0.001141) 0.506690 / 0.226044 (0.280646) 5.085787 / 2.268929 (2.816858) 2.731336 / 55.444624 (-52.713289) 2.419451 / 6.876477 (-4.457025) 2.583649 / 2.142072 (0.441577) 0.589869 / 4.805227 (-4.215359) 0.131040 / 6.500664 (-6.369624) 0.061332 / 0.075469 (-0.014137)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.220542 / 1.841788 (-0.621245) 18.169643 / 8.074308 (10.095335) 13.251704 / 10.191392 (3.060312) 0.142952 / 0.680424 (-0.537472) 0.016639 / 0.534201 (-0.517562) 0.334851 / 0.579283 (-0.244432) 0.361865 / 0.434364 (-0.072499) 0.380933 / 0.540337 (-0.159404) 0.527374 / 1.386936 (-0.859562)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006319 / 0.011353 (-0.005034) 0.003778 / 0.011008 (-0.007231) 0.062388 / 0.038508 (0.023880) 0.062228 / 0.023109 (0.039119) 0.373727 / 0.275898 (0.097829) 0.399442 / 0.323480 (0.075962) 0.005434 / 0.007986 (-0.002551) 0.003020 / 0.004328 (-0.001308) 0.062774 / 0.004250 (0.058524) 0.052784 / 0.037052 (0.015732) 0.376428 / 0.258489 (0.117939) 0.405039 / 0.293841 (0.111198) 0.027884 / 0.128546 (-0.100662) 0.008086 / 0.075646 (-0.067561) 0.067078 / 0.419271 (-0.352194) 0.042927 / 0.043533 (-0.000606) 0.372142 / 0.255139 (0.117003) 0.389604 / 0.283200 (0.106405) 0.021582 / 0.141683 (-0.120101) 1.473332 / 1.452155 (0.021177) 1.536018 / 1.492716 (0.043302)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.184729 / 0.018006 (0.166723) 0.421065 / 0.000490 (0.420575) 0.002681 / 0.000200 (0.002481) 0.000070 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026067 / 0.037411 (-0.011344) 0.077138 / 0.014526 (0.062612) 0.085178 / 0.176557 (-0.091379) 0.139681 / 0.737135 (-0.597454) 0.087528 / 0.296338 (-0.208810)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.444899 / 0.215209 (0.229690) 4.459168 / 2.077655 (2.381513) 2.408792 / 1.504120 (0.904672) 2.237243 / 1.541195 (0.696048) 2.296298 / 1.468490 (0.827808) 0.498508 / 4.584777 (-4.086269) 3.067064 / 3.745712 (-0.678648) 4.470577 / 5.269862 (-0.799284) 2.701972 / 4.565676 (-1.863705) 0.057711 / 0.424275 (-0.366564) 0.006443 / 0.007607 (-0.001164) 0.524046 / 0.226044 (0.298002) 5.229928 / 2.268929 (2.961000) 2.862101 / 55.444624 (-52.582523) 2.545972 / 6.876477 (-4.330504) 2.606459 / 2.142072 (0.464387) 0.593285 / 4.805227 (-4.211942) 0.124913 / 6.500664 (-6.375751) 0.061942 / 0.075469 (-0.013527)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.322162 / 1.841788 (-0.519625) 18.745796 / 8.074308 (10.671488) 13.955443 / 10.191392 (3.764051) 0.145610 / 0.680424 (-0.534814) 0.016817 / 0.534201 (-0.517384) 0.331180 / 0.579283 (-0.248103) 0.343019 / 0.434364 (-0.091345) 0.379459 / 0.540337 (-0.160878) 0.526403 / 1.386936 (-0.860533)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants