-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid deadlocks in tests that use popen
#6483
Conversation
The subprocess writes a bunch of output when it terminates. Using `Popen.wait()` here will deadlock, as the Python docs loudly warn you in numerous places.
Not a huge fan of this; it's a weird argument to pass in. Maybe should just inline the function.
Our `popen` helper would always capture stdout/stderr. Redirecting output via pipes carries the risk of deadlock (see admonition under https://docs.python.org/3/library/subprocess.html#subprocess.Popen.stderr), so we would run `Popen.communicate` in a background thread to always be draining the pipe. If the test wasn't actually using stdout/stderr (most don't), it's just simpler to just not redirect it and let it print out as normal. As usual, pytest will hide the output if the test passes, and print it if it fails. This change isn't strictly necessary, it's just a simplification. And it makes it a little easier to implement the terminate-communicate logic for the `capture_output=True` case, since you don't have to worry about a background thread already running `communicate`.
Unit Test Results 15 files ± 0 15 suites ±0 6h 37m 25s ⏱️ + 28m 3s For more details on these failures and errors, see this check. Results for commit 90fe1b5. ± Comparison against base commit c2b28cf. ♻️ This comment has been updated with latest results. |
Co-authored-by: Thomas Grainger <tagrain@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetic notes only
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 28m 24s ⏱️ + 19m 2s For more details on these failures, see this check. Results for commit 89669f6. ± Comparison against base commit c2b28cf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing tests:
All 7 runs failed: test_dashboard_non_standard_ports (distributed.cli.tests.test_dask_scheduler)
All 7 runs failed: test_scheduler_port_zero (distributed.cli.tests.test_dask_scheduler)
https://github.com/dask/distributed/runs/6798102761?check_suite_focus=true#step:11:1744
|
looks like 781af78 introduced a new use of flush_output= |
please merge from main |
thanks @crusaderky |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files + 8 15 suites +8 6h 14m 22s ⏱️ + 3h 55m 25s For more details on these failures, see this check. Results for commit 79f3bcb. ± Comparison against base commit 81e237b. |
If this stops the popen tests then Gabe, you have my undying gratitude.
…On Thu, Jun 9, 2022 at 2:43 PM crusaderky ***@***.***> wrote:
Merged #6483 <#6483> into main.
—
Reply to this email directly, view it on GitHub
<#6483 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEL6MYPO4YWFYTE6CLVOJCMVANCNFSM5XPL3B4Q>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
*popen test failures
…On Thu, Jun 9, 2022 at 3:20 PM Matthew Rocklin ***@***.***> wrote:
If this stops the popen tests then Gabe, you have my undying gratitude.
On Thu, Jun 9, 2022 at 2:43 PM crusaderky ***@***.***>
wrote:
> Merged #6483 <#6483> into main.
>
> —
> Reply to this email directly, view it on GitHub
> <#6483 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTEL6MYPO4YWFYTE6CLVOJCMVANCNFSM5XPL3B4Q>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
Let's see after a few days. I expect some of these tests will still fail, but for typical reasons (port already in use, OSError timed out connecting to scheduler, etc.). |
It doesn't seem to be effective: https://github.com/dask/distributed/runs/6825736211?check_suite_focus=true |
Yup. We're slowly getting closer though. Now we can get more information: #6567 |
I believe tests using
popen
may be occasionally failing withsubprocess.TimeoutExpired
errors because they're deadlocking in the way thesubprocess
docs warn you to avoid.Instead of using
Popen.wait()
to wait for subprocesses to shut down, we should usePopen.communicate()
. If the subprocess writes a bunch of stuff to stdout/stderr as it's shutting down, the stdout pipe may get filled up, blocking further writes and preventing the subprocess from shutting down.I can't confirm this is actually what's happening. I just see these tracebacks pointing to a
wait()
call, a warning in the docs aboutwait
deadlocking, and my new test confirming that if this did happen, the current implementation would fail withTimeoutExpired
. So this seems like the right thing to do regardless. But it's entirely possible this isn't the problem (and it's actually something where the scheduler/worker isn't responding to SIGINT well and isn't shutting down).c4737b6 is the important change.
In 6a8ad6e, I refactored our
popen
helper to not even capture stdout/stderr if we weren't going to use it (very few tests do). This may not be strictly necessary, but it just seems much simpler and more reliable.Previously, we were launching
Popen.communicate
in a background thread to flush the pipe. This is complicated, and may not have actually worked reliably.Popen.communicate
, like all interactions withPopen
or file objects, is not thread-safe. Tests likedistributed.cli.tests.test_dask_scheduler.test_hostport
were timing out despite usingflush_output=True
, which should in principle have made them immune to this problem. So I'm wondering ifPopen.communicate
in one thread andPopen.wait
in another could intermittently cause some internalPopen
state to break.I'm hoping this will fix the flakiness in:
distributed.cli.tests.test_dask_scheduler.test_hostport
distributed.cli.tests.test_dask_scheduler.test_preload_command
distributed.cli.tests.test_dask_scheduler.test_preload_command_default
distributed.cli.tests.test_dask_scheduler.test_preload_module
distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero
#6395distributed.cli.tests.test_dask_worker.test_error_during_startup[--nanny]
distributed.cli.tests.test_client.test_quiet_close_process[True]
pre-commit run --all-files
cc @crusaderky @fjetter @graingert