gh-117293: Fix race condition in run_workers.py #117298

colesbury · 2024-03-27T19:04:08Z

The worker thread may still be alive after it enqueues it's last result, which can lead to a delay of 30 seconds after the test finishes while waiting on an empty queue. This happens much more frequently in the free-threaded build with the GIL disabled.

This changes run_workers.py to track of live workers by enqueueing a WorkerExited() instance before the worker exits and using those instances to decrement the count of live workers.

Issue: test.libregrtest race condition in runtest_mp leads to 30 second delay in free-threaded build #117293

The worker thread may still be alive after it enqueues it's last result, which can lead to a delay of 30 seconds after the test finishes. This happens much more frequently in the free-threaded build with the GIL disabled. This changes run_workers.py to track of live workers by enqueueing a `WorkerExited()` instance before the worker exits.

AlexWaygood

Nice! Feels quite Rusty.

Lib/test/libregrtest/run_workers.py

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>

vstinner

Overall, it's a nice fix, thanks. Here is a my review.

vstinner · 2024-03-28T08:27:25Z

Lib/test/libregrtest/run_workers.py

@@ -511,14 +518,18 @@ def _get_result(self) -> QueueOutput | None:

        # bpo-46205: check the status of workers every iteration to avoid
        # waiting forever on an empty queue.
-        while any(worker.is_alive() for worker in self.workers):
+        while self.live_worker_count > 0:


worker.is_alive() is alive is used in other places, I would prefer to use the same logic to list alive workers in all places. I would prefer that you remove this live_worker_count attribute.

If any(worker.is_alive() for worker in self.workers) is inefficient, we can design something else, like a list of alive workers and trim this list when a worker exits. But there number of workers should be less than 1000, so it should be fine.

I don't understand what design you'd like. We can't continue to rely on the is_alive() method for the termination condition because that is the source of the race condition: it may happen before or after the processing of the results queue.

There are only two other uses of is_alive(), but it makes sense in those places, but doesn't make sense here.

In __repr__, where use it for displaying the worker status.

cpython/Lib/test/libregrtest/run_workers.py

Lines 118 to 119 in 9a1e55b

if self.is_alive():

info.append("running")

In wait_stopped() it's paired with join(). Only is_alive() makes sense here because of the use of join().

cpython/Lib/test/libregrtest/run_workers.py

Lines 428 to 430 in 9a1e55b

self.join(1.0)

if not self.is_alive():

break

You can ignore my comment, your change is fine.

I'm thinking at get_running() function, but I was wrong: this function doesn't call the is_alive() method but only relies on worker.test_name attribute to decide if a thread is "running" or not. There is already a nested finally: self.test_name = None which makes sure that the attribute is cleared in all cases.

Lib/test/libregrtest/run_workers.py

vstinner

LGTM. Nice enhancement.

vstinner · 2024-04-05T08:05:12Z

Lib/test/libregrtest/run_workers.py

@@ -511,14 +518,18 @@ def _get_result(self) -> QueueOutput | None:

        # bpo-46205: check the status of workers every iteration to avoid
        # waiting forever on an empty queue.
-        while any(worker.is_alive() for worker in self.workers):
+        while self.live_worker_count > 0:


You can ignore my comment, your change is fine.

I'm thinking at get_running() function, but I was wrong: this function doesn't call the is_alive() method but only relies on worker.test_name attribute to decide if a thread is "running" or not. There is already a nested finally: self.test_name = None which makes sure that the attribute is cleared in all cases.

The worker thread may still be alive after it enqueues it's last result, which can lead to a delay of 30 seconds after the test finishes. This happens much more frequently in the free-threaded build with the GIL disabled. This changes run_workers.py to track of live workers by enqueueing a `WorkerExited()` instance before the worker exits.

colesbury requested review from vstinner and AlexWaygood March 27, 2024 19:04

bedevere-app bot added the awaiting core review label Mar 27, 2024

bedevere-app bot mentioned this pull request Mar 27, 2024

test.libregrtest race condition in runtest_mp leads to 30 second delay in free-threaded build #117293

Closed

colesbury added the skip news label Mar 27, 2024

colesbury mentioned this pull request Mar 27, 2024

Make the Python test suite pass with the GIL disabled #116749

Closed

AlexWaygood approved these changes Mar 27, 2024

View reviewed changes

Lib/test/libregrtest/run_workers.py Outdated Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting core review labels Mar 27, 2024

Update Lib/test/libregrtest/run_workers.py

eb50687

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>

vstinner reviewed Mar 28, 2024

View reviewed changes

Rename WorkerExited to WorkerThreadExited

561016c

erlend-aasland approved these changes Mar 31, 2024

View reviewed changes

vstinner approved these changes Apr 5, 2024

View reviewed changes

colesbury merged commit 26a680a into python:main Apr 8, 2024
38 checks passed

colesbury deleted the gh-117293-runtest-mp branch April 8, 2024 14:47

bedevere-app bot removed the awaiting merge label Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-117293: Fix race condition in run_workers.py #117298

gh-117293: Fix race condition in run_workers.py #117298

colesbury commented Mar 27, 2024 •

edited by bedevere-app bot

Loading

AlexWaygood left a comment •

edited

Loading

vstinner left a comment

vstinner Mar 28, 2024

colesbury Mar 28, 2024

vstinner Apr 5, 2024

vstinner left a comment

vstinner Apr 5, 2024

gh-117293: Fix race condition in run_workers.py #117298

gh-117293: Fix race condition in run_workers.py #117298

Conversation

colesbury commented Mar 27, 2024 • edited by bedevere-app bot Loading

AlexWaygood left a comment • edited Loading

Choose a reason for hiding this comment

vstinner left a comment

Choose a reason for hiding this comment

vstinner Mar 28, 2024

Choose a reason for hiding this comment

colesbury Mar 28, 2024

Choose a reason for hiding this comment

vstinner Apr 5, 2024

Choose a reason for hiding this comment

vstinner left a comment

Choose a reason for hiding this comment

vstinner Apr 5, 2024

Choose a reason for hiding this comment

colesbury commented Mar 27, 2024 •

edited by bedevere-app bot

Loading

AlexWaygood left a comment •

edited

Loading