Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deadlocks in tests that use popen #6483

Merged
merged 9 commits into from
Jun 9, 2022

Conversation

gjoseph92
Copy link
Collaborator

I believe tests using popen may be occasionally failing with subprocess.TimeoutExpired errors because they're deadlocking in the way the subprocess docs warn you to avoid.

Instead of using Popen.wait() to wait for subprocesses to shut down, we should use Popen.communicate(). If the subprocess writes a bunch of stuff to stdout/stderr as it's shutting down, the stdout pipe may get filled up, blocking further writes and preventing the subprocess from shutting down.

I can't confirm this is actually what's happening. I just see these tracebacks pointing to a wait() call, a warning in the docs about wait deadlocking, and my new test confirming that if this did happen, the current implementation would fail with TimeoutExpired. So this seems like the right thing to do regardless. But it's entirely possible this isn't the problem (and it's actually something where the scheduler/worker isn't responding to SIGINT well and isn't shutting down).

c4737b6 is the important change.

In 6a8ad6e, I refactored our popen helper to not even capture stdout/stderr if we weren't going to use it (very few tests do). This may not be strictly necessary, but it just seems much simpler and more reliable.

Previously, we were launching Popen.communicate in a background thread to flush the pipe. This is complicated, and may not have actually worked reliably. Popen.communicate, like all interactions with Popen or file objects, is not thread-safe. Tests like distributed.cli.tests.test_dask_scheduler.test_hostport were timing out despite using flush_output=True, which should in principle have made them immune to this problem. So I'm wondering if Popen.communicate in one thread and Popen.wait in another could intermittently cause some internal Popen state to break.

I'm hoping this will fix the flakiness in:

  • distributed.cli.tests.test_dask_scheduler.test_hostport
  • distributed.cli.tests.test_dask_scheduler.test_preload_command
  • distributed.cli.tests.test_dask_scheduler.test_preload_command_default
  • distributed.cli.tests.test_dask_scheduler.test_preload_module
  • distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395
  • distributed.cli.tests.test_dask_worker.test_error_during_startup[--nanny]
  • distributed.cli.tests.test_client.test_quiet_close_process[True]
  • Tests added / passed
  • Passes pre-commit run --all-files

cc @crusaderky @fjetter @graingert

gjoseph92 added 4 commits May 31, 2022 17:55
The subprocess writes a bunch of output when it terminates. Using `Popen.wait()` here will deadlock, as the Python docs loudly warn you in numerous places.
Not a huge fan of this; it's a weird argument to pass in. Maybe should just inline the function.
Our `popen` helper would always capture stdout/stderr. Redirecting output via pipes carries the risk of deadlock (see admonition under https://docs.python.org/3/library/subprocess.html#subprocess.Popen.stderr), so we would run `Popen.communicate` in a background thread to always be draining the pipe.

If the test wasn't actually using stdout/stderr (most don't), it's just simpler to just not redirect it and let it print out as normal. As usual, pytest will hide the output if the test passes, and print it if it fails.

This change isn't strictly necessary, it's just a simplification. And it makes it a little easier to implement the terminate-communicate logic for the `capture_output=True` case, since you don't have to worry about a background thread already running `communicate`.
@github-actions
Copy link
Contributor

github-actions bot commented Jun 1, 2022

Unit Test Results

       15 files  ±  0         15 suites  ±0   6h 37m 25s ⏱️ + 28m 3s
  2 829 tests +10    2 714 ✔️  - 24    82 💤 +  2  30 +29  3 🔥 +3 
20 966 runs  +64  19 986 ✔️ +81  944 💤  - 52  33 +32  3 🔥 +3 

For more details on these failures and errors, see this check.

Results for commit 90fe1b5. ± Comparison against base commit c2b28cf.

♻️ This comment has been updated with latest results.

distributed/utils_test.py Outdated Show resolved Hide resolved
@crusaderky crusaderky self-requested a review June 1, 2022 15:05
Co-authored-by: Thomas Grainger <tagrain@gmail.com>
@gjoseph92 gjoseph92 marked this pull request as ready for review June 1, 2022 15:10
Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic notes only

gjoseph92 and others added 3 commits June 8, 2022 10:43
Co-authored-by: crusaderky <crusaderky@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±    0         15 suites  ±0   6h 28m 24s ⏱️ + 19m 2s
  2 854 tests +  35    2 769 ✔️ +  31    82 💤 +  2    3 +  2 
21 144 runs  +242  20 182 ✔️ +277  947 💤  - 49  15 +14 

For more details on these failures, see this check.

Results for commit 89669f6. ± Comparison against base commit c2b28cf.

@gjoseph92 gjoseph92 mentioned this pull request Jun 8, 2022
2 tasks
Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing tests:
All 7 runs failed: test_dashboard_non_standard_ports (distributed.cli.tests.test_dask_scheduler)
All 7 runs failed: test_scheduler_port_zero (distributed.cli.tests.test_dask_scheduler)

@graingert
Copy link
Member

https://github.com/dask/distributed/runs/6798102761?check_suite_focus=true#step:11:1744

______________________ test_dashboard_non_standard_ports _______________________
loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x7f855755d730>
deftest_dashboard_non_standard_ports(loop):
        pytest.importorskip("bokeh")
>       with popen(
            ["dask-scheduler", "--port", "23448", "--dashboard-address", ":24832"],
            flush_output=False,
        ) as proc:
distributed/cli/tests/test_dask_scheduler.py:101: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:113: in __enter__
returnnext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
args = ['/usr/share/miniconda3/envs/dask-distributed/bin/dask-scheduler', '--port', '23448', '--dashboard-address', ':24832']
capture_output = False, kwargs = {'flush_output': False}
@contextmanager
defpopen(
        args: list[str], capture_output: bool = False, **kwargs
    ) -> Iterator[subprocess.Popen[bytes]]:
"""Start a shell command in a subprocess.
    Yields a subprocess.Popen object.
    On exit, the subprocess is terminated.
    Parameters
    ----------
    args: list[str]
        Command line arguments
    capture_output: bool, default False
        Set to True if you need to read output from the subprocess.
        Stdout and stderr will both be piped to ``proc.stdout``.
        If False, the subprocess will write to stdout/stderr normally.
        When True, the test could deadlock if the stdout pipe's buffer gets full
        (Linux default is 65536 bytes; macOS and Windows may be smaller).
        Therefore, you may need to periodically read from ``proc.stdout``, or
        use ``proc.communicate``. All the deadlock warnings apply from
       [ https://docs.python.org/3/library/subprocess.html#subprocess.Popen](https://docs.python.org/3/library/subprocess.html#subprocess.Popen.stderr.)
        Note that ``proc.communicate`` is called automatically when the
        contextmanager exits. Calling code must not call ``proc.communicate``
        in a separate thread, since it's not thread-safe.
    kwargs: optional
        optional arguments to subprocess.Popen
    """
if capture_output:
            kwargs["stdout"] = subprocess.PIPE
            kwargs["stderr"] = subprocess.STDOUT
if sys.platform.startswith("win"):
# Allow using CTRL_C_EVENT / CTRL_BREAK_EVENT
            kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
        args = list(args)
if sys.platform.startswith("win"):
            args[0] = os.path.join(sys.prefix, "Scripts", args[0])
else:
            args[0] = os.path.join(
                os.environ.get("DESTDIR", "") + sys.prefix, "bin", args[0]
            )
>       with subprocess.Popen(args, **kwargs) as proc:
E       TypeError: __init__() got an unexpected keyword argument 'flush_output'
distributed/utils_test.py:1344: TypeError
___________________________ test_scheduler_port_zero ___________________________
loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x7f85577c77c0>
deftest_scheduler_port_zero(loop):
with tmpfile() as fn:
>           with popen(
                ["dask-scheduler", "--no-dashboard", "--scheduler-file", fn, "--port", "0"],
                flush_output=False,
            ) as proc:
distributed/cli/tests/test_dask_scheduler.py:211: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:113: in __enter__
returnnext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
args = ['/usr/share/miniconda3/envs/dask-distributed/bin/dask-scheduler', '--no-dashboard', '--scheduler-file', '/tmp/tmpintk7143.', '--port', '0']
capture_output = False, kwargs = {'flush_output': False}
@contextmanager
defpopen(
        args: list[str], capture_output: bool = False, **kwargs
    ) -> Iterator[subprocess.Popen[bytes]]:
"""Start a shell command in a subprocess.
    Yields a subprocess.Popen object.
    On exit, the subprocess is terminated.
    Parameters
    ----------
    args: list[str]
        Command line arguments
    capture_output: bool, default False
        Set to True if you need to read output from the subprocess.
        Stdout and stderr will both be piped to ``proc.stdout``.
        If False, the subprocess will write to stdout/stderr normally.
        When True, the test could deadlock if the stdout pipe's buffer gets full
        (Linux default is 65536 bytes; macOS and Windows may be smaller).
        Therefore, you may need to periodically read from ``proc.stdout``, or
        use ``proc.communicate``. All the deadlock warnings apply from
       [ https://docs.python.org/3/library/subprocess.html#subprocess.Popen](https://docs.python.org/3/library/subprocess.html#subprocess.Popen.stderr.)
        Note that ``proc.communicate`` is called automatically when the
        contextmanager exits. Calling code must not call ``proc.communicate``
        in a separate thread, since it's not thread-safe.
    kwargs: optional
        optional arguments to subprocess.Popen
    """
if capture_output:
            kwargs["stdout"] = subprocess.PIPE
            kwargs["stderr"] = subprocess.STDOUT
if sys.platform.startswith("win"):
# Allow using CTRL_C_EVENT / CTRL_BREAK_EVENT
            kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
        args = list(args)
if sys.platform.startswith("win"):
            args[0] = os.path.join(sys.prefix, "Scripts", args[0])
else:
            args[0] = os.path.join(
                os.environ.get("DESTDIR", "") + sys.prefix, "bin", args[0]
            )
>       with subprocess.Popen(args, **kwargs) as proc:
E       TypeError: __init__() got an unexpected keyword argument 'flush_output'
distributed/utils_test.py:1344: TypeError

@graingert
Copy link
Member

looks like 781af78 introduced a new use of flush_output=

@gjoseph92
Copy link
Collaborator Author

Blocked by #6547 (I could incorporate the revert into here, but feels cleaner to have a separate history). This PR supersedes #6502.

@crusaderky
Copy link
Collaborator

please merge from main

@gjoseph92
Copy link
Collaborator Author

thanks @crusaderky

@github-actions
Copy link
Contributor

github-actions bot commented Jun 9, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  +         8         15 suites  +8   6h 14m 22s ⏱️ + 3h 55m 25s
  2 856 tests +         2    2 772 ✔️ +       12    82 💤  -   12  2 +2 
21 158 runs  +11 403  20 211 ✔️ +10 821  945 💤 +580  2 +2 

For more details on these failures, see this check.

Results for commit 79f3bcb. ± Comparison against base commit 81e237b.

@crusaderky crusaderky merged commit 43ca938 into dask:main Jun 9, 2022
@gjoseph92 gjoseph92 deleted the popen-terminate-deadlock branch June 9, 2022 20:17
@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2022 via email

@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2022 via email

@gjoseph92
Copy link
Collaborator Author

Let's see after a few days. I expect some of these tests will still fail, but for typical reasons (port already in use, OSError timed out connecting to scheduler, etc.).

@crusaderky
Copy link
Collaborator

@gjoseph92
Copy link
Collaborator Author

Yup. We're slowly getting closer though. Now we can get more information: #6567

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants