Avoid deadlocks in tests that use `popen` #6483

gjoseph92 · 2022-06-01T01:47:53Z

I believe tests using popen may be occasionally failing with subprocess.TimeoutExpired errors because they're deadlocking in the way the subprocess docs warn you to avoid.

Instead of using Popen.wait() to wait for subprocesses to shut down, we should use Popen.communicate(). If the subprocess writes a bunch of stuff to stdout/stderr as it's shutting down, the stdout pipe may get filled up, blocking further writes and preventing the subprocess from shutting down.

I can't confirm this is actually what's happening. I just see these tracebacks pointing to a wait() call, a warning in the docs about wait deadlocking, and my new test confirming that if this did happen, the current implementation would fail with TimeoutExpired. So this seems like the right thing to do regardless. But it's entirely possible this isn't the problem (and it's actually something where the scheduler/worker isn't responding to SIGINT well and isn't shutting down).

c4737b6 is the important change.

In 6a8ad6e, I refactored our popen helper to not even capture stdout/stderr if we weren't going to use it (very few tests do). This may not be strictly necessary, but it just seems much simpler and more reliable.

Previously, we were launching Popen.communicate in a background thread to flush the pipe. This is complicated, and may not have actually worked reliably. Popen.communicate, like all interactions with Popen or file objects, is not thread-safe. Tests like distributed.cli.tests.test_dask_scheduler.test_hostport were timing out despite using flush_output=True, which should in principle have made them immune to this problem. So I'm wondering if Popen.communicate in one thread and Popen.wait in another could intermittently cause some internal Popen state to break.

I'm hoping this will fix the flakiness in:

distributed.cli.tests.test_dask_scheduler.test_hostport
distributed.cli.tests.test_dask_scheduler.test_preload_command
distributed.cli.tests.test_dask_scheduler.test_preload_command_default
distributed.cli.tests.test_dask_scheduler.test_preload_module
distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395
distributed.cli.tests.test_dask_worker.test_error_during_startup[--nanny]
distributed.cli.tests.test_client.test_quiet_close_process[True]

Tests added / passed
Passes pre-commit run --all-files

cc @crusaderky @fjetter @graingert

The subprocess writes a bunch of output when it terminates. Using `Popen.wait()` here will deadlock, as the Python docs loudly warn you in numerous places.

Not a huge fan of this; it's a weird argument to pass in. Maybe should just inline the function.

Our `popen` helper would always capture stdout/stderr. Redirecting output via pipes carries the risk of deadlock (see admonition under https://docs.python.org/3/library/subprocess.html#subprocess.Popen.stderr), so we would run `Popen.communicate` in a background thread to always be draining the pipe. If the test wasn't actually using stdout/stderr (most don't), it's just simpler to just not redirect it and let it print out as normal. As usual, pytest will hide the output if the test passes, and print it if it fails. This change isn't strictly necessary, it's just a simplification. And it makes it a little easier to implement the terminate-communicate logic for the `capture_output=True` case, since you don't have to worry about a background thread already running `communicate`.

github-actions · 2022-06-01T02:48:43Z

Unit Test Results

      15 files ±  0       15 suites ±0 6h 37m 25s ⏱️ + 28m 3s
  2 829 tests +10   2 714 ✔️ - 24   82 💤 +  2 30 ❌ +29 3 🔥 +3
20 966 runs +64 19 986 ✔️ +81 944 💤 - 52 33 ❌ +32 3 🔥 +3

For more details on these failures and errors, see this check.

Results for commit 90fe1b5. ± Comparison against base commit c2b28cf.

♻️ This comment has been updated with latest results.

distributed/utils_test.py

Co-authored-by: Thomas Grainger <tagrain@gmail.com>

distributed/tests/test_utils_test.py

crusaderky

Cosmetic notes only

Co-authored-by: crusaderky <crusaderky@gmail.com>

github-actions · 2022-06-08T18:51:32Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 6h 28m 24s ⏱️ + 19m 2s
  2 854 tests +  35   2 769 ✔️ +  31   82 💤 +  2   3 ❌ +  2
21 144 runs +242 20 182 ✔️ +277 947 💤 - 49 15 ❌ +14

For more details on these failures, see this check.

Results for commit 89669f6. ± Comparison against base commit c2b28cf.

crusaderky

Failing tests:
All 7 runs failed: test_dashboard_non_standard_ports (distributed.cli.tests.test_dask_scheduler)
All 7 runs failed: test_scheduler_port_zero (distributed.cli.tests.test_dask_scheduler)

graingert · 2022-06-09T09:44:34Z

https://github.com/dask/distributed/runs/6798102761?check_suite_focus=true#step:11:1744

______________________ test_dashboard_non_standard_ports _______________________
loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x7f855755d730>
deftest_dashboard_non_standard_ports(loop):
        pytest.importorskip("bokeh")
>       with popen(
            ["dask-scheduler", "--port", "23448", "--dashboard-address", ":24832"],
            flush_output=False,
        ) as proc:
distributed/cli/tests/test_dask_scheduler.py:101: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:113: in __enter__
returnnext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
args = ['/usr/share/miniconda3/envs/dask-distributed/bin/dask-scheduler', '--port', '23448', '--dashboard-address', ':24832']
capture_output = False, kwargs = {'flush_output': False}
@contextmanager
defpopen(
        args: list[str], capture_output: bool = False, **kwargs
    ) -> Iterator[subprocess.Popen[bytes]]:
"""Start a shell command in a subprocess.
    Yields a subprocess.Popen object.
    On exit, the subprocess is terminated.
    Parameters
    ----------
    args: list[str]
        Command line arguments
    capture_output: bool, default False
        Set to True if you need to read output from the subprocess.
        Stdout and stderr will both be piped to ``proc.stdout``.
        If False, the subprocess will write to stdout/stderr normally.
        When True, the test could deadlock if the stdout pipe's buffer gets full
        (Linux default is 65536 bytes; macOS and Windows may be smaller).
        Therefore, you may need to periodically read from ``proc.stdout``, or
        use ``proc.communicate``. All the deadlock warnings apply from
       [ https://docs.python.org/3/library/subprocess.html#subprocess.Popen](https://docs.python.org/3/library/subprocess.html#subprocess.Popen.stderr.)
        Note that ``proc.communicate`` is called automatically when the
        contextmanager exits. Calling code must not call ``proc.communicate``
        in a separate thread, since it's not thread-safe.
    kwargs: optional
        optional arguments to subprocess.Popen
    """
if capture_output:
            kwargs["stdout"] = subprocess.PIPE
            kwargs["stderr"] = subprocess.STDOUT
if sys.platform.startswith("win"):
# Allow using CTRL_C_EVENT / CTRL_BREAK_EVENT
            kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
        args = list(args)
if sys.platform.startswith("win"):
            args[0] = os.path.join(sys.prefix, "Scripts", args[0])
else:
            args[0] = os.path.join(
                os.environ.get("DESTDIR", "") + sys.prefix, "bin", args[0]
            )
>       with subprocess.Popen(args, **kwargs) as proc:
E       TypeError: __init__() got an unexpected keyword argument 'flush_output'
distributed/utils_test.py:1344: TypeError
___________________________ test_scheduler_port_zero ___________________________
loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x7f85577c77c0>
deftest_scheduler_port_zero(loop):
with tmpfile() as fn:
>           with popen(
                ["dask-scheduler", "--no-dashboard", "--scheduler-file", fn, "--port", "0"],
                flush_output=False,
            ) as proc:
distributed/cli/tests/test_dask_scheduler.py:211: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:113: in __enter__
returnnext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
args = ['/usr/share/miniconda3/envs/dask-distributed/bin/dask-scheduler', '--no-dashboard', '--scheduler-file', '/tmp/tmpintk7143.', '--port', '0']
capture_output = False, kwargs = {'flush_output': False}
@contextmanager
defpopen(
        args: list[str], capture_output: bool = False, **kwargs
    ) -> Iterator[subprocess.Popen[bytes]]:
"""Start a shell command in a subprocess.
    Yields a subprocess.Popen object.
    On exit, the subprocess is terminated.
    Parameters
    ----------
    args: list[str]
        Command line arguments
    capture_output: bool, default False
        Set to True if you need to read output from the subprocess.
        Stdout and stderr will both be piped to ``proc.stdout``.
        If False, the subprocess will write to stdout/stderr normally.
        When True, the test could deadlock if the stdout pipe's buffer gets full
        (Linux default is 65536 bytes; macOS and Windows may be smaller).
        Therefore, you may need to periodically read from ``proc.stdout``, or
        use ``proc.communicate``. All the deadlock warnings apply from
       [ https://docs.python.org/3/library/subprocess.html#subprocess.Popen](https://docs.python.org/3/library/subprocess.html#subprocess.Popen.stderr.)
        Note that ``proc.communicate`` is called automatically when the
        contextmanager exits. Calling code must not call ``proc.communicate``
        in a separate thread, since it's not thread-safe.
    kwargs: optional
        optional arguments to subprocess.Popen
    """
if capture_output:
            kwargs["stdout"] = subprocess.PIPE
            kwargs["stderr"] = subprocess.STDOUT
if sys.platform.startswith("win"):
# Allow using CTRL_C_EVENT / CTRL_BREAK_EVENT
            kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
        args = list(args)
if sys.platform.startswith("win"):
            args[0] = os.path.join(sys.prefix, "Scripts", args[0])
else:
            args[0] = os.path.join(
                os.environ.get("DESTDIR", "") + sys.prefix, "bin", args[0]
            )
>       with subprocess.Popen(args, **kwargs) as proc:
E       TypeError: __init__() got an unexpected keyword argument 'flush_output'
distributed/utils_test.py:1344: TypeError

graingert · 2022-06-09T09:45:44Z

looks like 781af78 introduced a new use of flush_output=

gjoseph92 · 2022-06-09T16:24:56Z

Blocked by #6547 (I could incorporate the revert into here, but feels cleaner to have a separate history). This PR supersedes #6502.

crusaderky · 2022-06-09T17:29:55Z

please merge from main

…dlock

gjoseph92 · 2022-06-09T17:44:13Z

thanks @crusaderky

github-actions · 2022-06-09T19:32:29Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files +        8       15 suites +8 6h 14m 22s ⏱️ + 3h 55m 25s
  2 856 tests +        2   2 772 ✔️ +      12   82 💤 -   12 2 ❌ +2
21 158 runs +11 403 20 211 ✔️ +10 821 945 💤 +580 2 ❌ +2

For more details on these failures, see this check.

Results for commit 79f3bcb. ± Comparison against base commit 81e237b.

mrocklin · 2022-06-09T20:20:29Z

If this stops the popen tests then Gabe, you have my undying gratitude.

…

On Thu, Jun 9, 2022 at 2:43 PM crusaderky ***@***.***> wrote: Merged #6483 <#6483> into main. — Reply to this email directly, view it on GitHub <#6483 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEL6MYPO4YWFYTE6CLVOJCMVANCNFSM5XPL3B4Q> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

mrocklin · 2022-06-09T20:20:46Z

*popen test failures

…

On Thu, Jun 9, 2022 at 3:20 PM Matthew Rocklin ***@***.***> wrote: If this stops the popen tests then Gabe, you have my undying gratitude. On Thu, Jun 9, 2022 at 2:43 PM crusaderky ***@***.***> wrote: > Merged #6483 <#6483> into main. > > — > Reply to this email directly, view it on GitHub > <#6483 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTEL6MYPO4YWFYTE6CLVOJCMVANCNFSM5XPL3B4Q> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

gjoseph92 · 2022-06-09T21:38:37Z

Let's see after a few days. I expect some of these tests will still fail, but for typical reasons (port already in use, OSError timed out connecting to scheduler, etc.).

crusaderky · 2022-06-11T09:42:54Z

It doesn't seem to be effective: https://github.com/dask/distributed/runs/6825736211?check_suite_focus=true

gjoseph92 · 2022-06-11T23:35:19Z

Yup. We're slowly getting closer though. Now we can get more information: #6567

gjoseph92 added 4 commits May 31, 2022 17:55

Test for write-on-terminate deadlock

679b36a

The subprocess writes a bunch of output when it terminates. Using `Popen.wait()` here will deadlock, as the Python docs loudly warn you in numerous places.

remove experimental code

1e22bb6

communicate instead of wait in _terminate

c4737b6

Not a huge fan of this; it's a weird argument to pass in. Maybe should just inline the function.

graingert reviewed Jun 1, 2022

View reviewed changes

distributed/utils_test.py Outdated Show resolved Hide resolved

crusaderky assigned gjoseph92 Jun 1, 2022

crusaderky self-requested a review June 1, 2022 15:05

full type annotations

90fe1b5

Co-authored-by: Thomas Grainger <tagrain@gmail.com>

gjoseph92 marked this pull request as ready for review June 1, 2022 15:10

crusaderky reviewed Jun 7, 2022

View reviewed changes

distributed/tests/test_utils_test.py Outdated Show resolved Hide resolved

crusaderky reviewed Jun 7, 2022

View reviewed changes

distributed/tests/test_utils_test.py Outdated Show resolved Hide resolved

crusaderky requested changes Jun 7, 2022

View reviewed changes

gjoseph92 and others added 3 commits June 8, 2022 10:43

Cosmetics

8f07c3d

Co-authored-by: crusaderky <crusaderky@gmail.com>

import textwrap

b0af220

88 character comments

89669f6

gjoseph92 mentioned this pull request Jun 8, 2022

Fix Scheduler.restart logic #6504

Merged

2 tasks

crusaderky approved these changes Jun 9, 2022

View reviewed changes

crusaderky requested changes Jun 9, 2022

View reviewed changes

This was referenced Jun 9, 2022

Fix CLI Scheduler Tests #6502

Merged

Revert "Fix CLI Scheduler Tests" #6547

Merged

Merge remote-tracking branch 'upstream/main' into popen-terminate-dea…

79f3bcb

…dlock

crusaderky approved these changes Jun 9, 2022

View reviewed changes

crusaderky merged commit 43ca938 into dask:main Jun 9, 2022

gjoseph92 deleted the popen-terminate-deadlock branch June 9, 2022 20:17

gjoseph92 mentioned this pull request Jun 17, 2022

Do not log in signal handler #6590

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid deadlocks in tests that use `popen` #6483

Avoid deadlocks in tests that use `popen` #6483

gjoseph92 commented Jun 1, 2022

github-actions bot commented Jun 1, 2022 •

edited

Loading

crusaderky left a comment

github-actions bot commented Jun 8, 2022

crusaderky left a comment •

edited

Loading

graingert commented Jun 9, 2022

graingert commented Jun 9, 2022

gjoseph92 commented Jun 9, 2022

crusaderky commented Jun 9, 2022

gjoseph92 commented Jun 9, 2022

github-actions bot commented Jun 9, 2022

mrocklin commented Jun 9, 2022 via email

mrocklin commented Jun 9, 2022 via email

gjoseph92 commented Jun 9, 2022

crusaderky commented Jun 11, 2022

gjoseph92 commented Jun 11, 2022

Avoid deadlocks in tests that use popen #6483

Avoid deadlocks in tests that use popen #6483

Conversation

gjoseph92 commented Jun 1, 2022

github-actions bot commented Jun 1, 2022 • edited Loading

Unit Test Results

crusaderky left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 8, 2022

Unit Test Results

crusaderky left a comment • edited Loading

Choose a reason for hiding this comment

graingert commented Jun 9, 2022

graingert commented Jun 9, 2022

gjoseph92 commented Jun 9, 2022

crusaderky commented Jun 9, 2022

gjoseph92 commented Jun 9, 2022

github-actions bot commented Jun 9, 2022

Unit Test Results

mrocklin commented Jun 9, 2022 via email

mrocklin commented Jun 9, 2022 via email

gjoseph92 commented Jun 9, 2022

crusaderky commented Jun 11, 2022

gjoseph92 commented Jun 11, 2022

Avoid deadlocks in tests that use `popen` #6483

Avoid deadlocks in tests that use `popen` #6483

github-actions bot commented Jun 1, 2022 •

edited

Loading

crusaderky left a comment •

edited

Loading