-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert idle classification when worker-saturation is set #7278
Revert idle classification when worker-saturation is set #7278
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files + 12 15 suites +12 6h 26m 31s ⏱️ + 5h 32m 51s For more details on these failures, see this check. Results for commit 7215427. ± Comparison against base commit 88515db. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update test_queued_paused_new_worker
and test_queued_paused_unpaused
in test_scheduler.py
, and any other tests which make explicit assertions about Scheduler.idle
.
saturated.discard(ws) | ||
if self.is_unoccupied(ws, occ, p): | ||
if ws.status == Status.running: | ||
idle[ws.address] = ws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saturated.discard(ws) | |
if self.is_unoccupied(ws, occ, p): | |
if ws.status == Status.running: | |
idle[ws.address] = ws | |
if self.is_unoccupied(ws, occ, p): | |
if ws.status == Status.running: | |
idle[ws.address] = ws | |
saturated.discard(ws) |
This is more consistent with previous behavior. Notice that before, if a worker was occupied, but not saturated, it wouldn't be removed from the saturated
set. This is probably not intentional or correct, but we're trying to match previous behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- saturated.discard was always called unless the worker was truly classified as such, see
distributed/distributed/scheduler.py
Lines 2602 to 2616 in 0dd00be
idle = self.idle saturated = self.saturated if p < nc or occ < nc * avg / 2: idle[ws.address] = ws saturated.discard(ws) else: idle.pop(ws.address, None) if p > nc: pending: float = occ * (p - nc) / (p * nc) if 0.4 < pending > 1.9 * avg: saturated.add(ws) return saturated.discard(ws) - Other than dashboard visuals, saturated is only used in stealing to avoid sorting over all workers, https://github.com/fjetter/distributed/blob/a5d686572e3289e9d7ce71c063205cc35d4a06c2/distributed/stealing.py#L422-L431 and I'm not too concerned about this since stealing is a bit erratic either way
else not _worker_full(ws, self.WORKER_SATURATION) | ||
): | ||
saturated.discard(ws) | ||
if self.is_unoccupied(ws, occ, p): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite concerned that we're now calling is_unoccupied
every time, even when queuing is enabled. This significantly slows down the scheduler: #7256. The urgency of fixing that was diminished by queuing being on by default and getting to skip that slow code path.
I'm not sure that a known and large scheduler performance degradation is worth avoiding hypothetical small changes to work-stealing behavior due to the changed definition of idle
when queuing is on.
If we can fix #7256 before a release, then I'm happy with this change, otherwise I'd be concerned by this tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running some benchmarks, it looks like occupancy might not have as much of an effect on end-to-end runtime as I'd expected: #7256 (comment). So I'm happy with this if we want to go with it.
For performance reasons and practicality though, I'd like to consider #7280 as another solution to #7085.
Edit: that uses occupancy too, so there's a similar performance cost. I think doing both PRs would be a good idea.
The performance regression around occupancy was already introduced in release |
This restores the behavior of the idle set to the behavior before #6614 when worker-saturation is set.
All code paths in the new queuing path will use a different idleness set based on a different definition of idle.
Closes #7085
ref #7191