Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-117293: Fix race condition in run_workers.py #117298

Merged
merged 3 commits into from
Apr 8, 2024

Conversation

colesbury
Copy link
Contributor

@colesbury colesbury commented Mar 27, 2024

The worker thread may still be alive after it enqueues it's last result, which can lead to a delay of 30 seconds after the test finishes while waiting on an empty queue. This happens much more frequently in the free-threaded build with the GIL disabled.

This changes run_workers.py to track of live workers by enqueueing a WorkerExited() instance before the worker exits and using those instances to decrement the count of live workers.

The worker thread may still be alive after it enqueues it's last result,
which can lead to a delay of 30 seconds after the test finishes. This
happens much more frequently in the free-threaded build with the GIL
disabled.

This changes run_workers.py to track of live workers by enqueueing a
`WorkerExited()` instance before the worker exits.
Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Feels quite Rusty.

Lib/test/libregrtest/run_workers.py Outdated Show resolved Hide resolved
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it's a nice fix, thanks. Here is a my review.

@@ -511,14 +518,18 @@ def _get_result(self) -> QueueOutput | None:

# bpo-46205: check the status of workers every iteration to avoid
# waiting forever on an empty queue.
while any(worker.is_alive() for worker in self.workers):
while self.live_worker_count > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker.is_alive() is alive is used in other places, I would prefer to use the same logic to list alive workers in all places. I would prefer that you remove this live_worker_count attribute.

If any(worker.is_alive() for worker in self.workers) is inefficient, we can design something else, like a list of alive workers and trim this list when a worker exits. But there number of workers should be less than 1000, so it should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what design you'd like. We can't continue to rely on the is_alive() method for the termination condition because that is the source of the race condition: it may happen before or after the processing of the results queue.

There are only two other uses of is_alive(), but it makes sense in those places, but doesn't make sense here.

In __repr__, where use it for displaying the worker status.

if self.is_alive():
info.append("running")

In wait_stopped() it's paired with join(). Only is_alive() makes sense here because of the use of join().

self.join(1.0)
if not self.is_alive():
break

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore my comment, your change is fine.

I'm thinking at get_running() function, but I was wrong: this function doesn't call the is_alive() method but only relies on worker.test_name attribute to decide if a thread is "running" or not. There is already a nested finally: self.test_name = None which makes sure that the attribute is cleared in all cases.

Lib/test/libregrtest/run_workers.py Outdated Show resolved Hide resolved
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice enhancement.

@@ -511,14 +518,18 @@ def _get_result(self) -> QueueOutput | None:

# bpo-46205: check the status of workers every iteration to avoid
# waiting forever on an empty queue.
while any(worker.is_alive() for worker in self.workers):
while self.live_worker_count > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore my comment, your change is fine.

I'm thinking at get_running() function, but I was wrong: this function doesn't call the is_alive() method but only relies on worker.test_name attribute to decide if a thread is "running" or not. There is already a nested finally: self.test_name = None which makes sure that the attribute is cleared in all cases.

@colesbury colesbury merged commit 26a680a into python:main Apr 8, 2024
38 checks passed
@colesbury colesbury deleted the gh-117293-runtest-mp branch April 8, 2024 14:47
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
The worker thread may still be alive after it enqueues it's last result,
which can lead to a delay of 30 seconds after the test finishes. This
happens much more frequently in the free-threaded build with the GIL
disabled.

This changes run_workers.py to track of live workers by enqueueing a
`WorkerExited()` instance before the worker exits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants