-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-117293: Fix race condition in run_workers.py #117298
Conversation
The worker thread may still be alive after it enqueues it's last result, which can lead to a delay of 30 seconds after the test finishes. This happens much more frequently in the free-threaded build with the GIL disabled. This changes run_workers.py to track of live workers by enqueueing a `WorkerExited()` instance before the worker exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Feels quite Rusty.
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it's a nice fix, thanks. Here is a my review.
@@ -511,14 +518,18 @@ def _get_result(self) -> QueueOutput | None: | |||
|
|||
# bpo-46205: check the status of workers every iteration to avoid | |||
# waiting forever on an empty queue. | |||
while any(worker.is_alive() for worker in self.workers): | |||
while self.live_worker_count > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worker.is_alive() is alive is used in other places, I would prefer to use the same logic to list alive workers in all places. I would prefer that you remove this live_worker_count
attribute.
If any(worker.is_alive() for worker in self.workers) is inefficient, we can design something else, like a list of alive workers and trim this list when a worker exits. But there number of workers should be less than 1000, so it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what design you'd like. We can't continue to rely on the is_alive()
method for the termination condition because that is the source of the race condition: it may happen before or after the processing of the results queue.
There are only two other uses of is_alive()
, but it makes sense in those places, but doesn't make sense here.
In __repr__
, where use it for displaying the worker status.
cpython/Lib/test/libregrtest/run_workers.py
Lines 118 to 119 in 9a1e55b
if self.is_alive(): | |
info.append("running") |
In wait_stopped()
it's paired with join()
. Only is_alive()
makes sense here because of the use of join()
.
cpython/Lib/test/libregrtest/run_workers.py
Lines 428 to 430 in 9a1e55b
self.join(1.0) | |
if not self.is_alive(): | |
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ignore my comment, your change is fine.
I'm thinking at get_running() function, but I was wrong: this function doesn't call the is_alive() method but only relies on worker.test_name
attribute to decide if a thread is "running" or not. There is already a nested finally: self.test_name = None
which makes sure that the attribute is cleared in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice enhancement.
@@ -511,14 +518,18 @@ def _get_result(self) -> QueueOutput | None: | |||
|
|||
# bpo-46205: check the status of workers every iteration to avoid | |||
# waiting forever on an empty queue. | |||
while any(worker.is_alive() for worker in self.workers): | |||
while self.live_worker_count > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ignore my comment, your change is fine.
I'm thinking at get_running() function, but I was wrong: this function doesn't call the is_alive() method but only relies on worker.test_name
attribute to decide if a thread is "running" or not. There is already a nested finally: self.test_name = None
which makes sure that the attribute is cleared in all cases.
The worker thread may still be alive after it enqueues it's last result, which can lead to a delay of 30 seconds after the test finishes. This happens much more frequently in the free-threaded build with the GIL disabled. This changes run_workers.py to track of live workers by enqueueing a `WorkerExited()` instance before the worker exits.
The worker thread may still be alive after it enqueues it's last result, which can lead to a delay of 30 seconds after the test finishes while waiting on an empty queue. This happens much more frequently in the free-threaded build with the GIL disabled.
This changes run_workers.py to track of live workers by enqueueing a
WorkerExited()
instance before the worker exits and using those instances to decrement the count of live workers.