-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
worker-saturation
impacts balancing in work-stealing
#7085
Comments
FYI @pytest.mark.parametrize("queue", [True, False])
@pytest.mark.parametrize(
"inp,expected",
[
(
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]],
[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]],
), # balance many tasks
],
)
def test_balance_interacts_with_worker_saturation(inp, expected, queue):
async def test_balance_(*args, **kwargs):
await assert_balanced(inp, expected, *args, **kwargs)
config = {
"distributed.scheduler.default-task-durations": {str(i): 1 for i in range(10)},
"distributed.scheduler.worker-saturation": 1.0 if queue else float("inf"),
}
gen_cluster(client=True, nthreads=[("", 1)] * len(inp), config=config)(
test_balance_
)() |
Yeah, looks like what's happening here is that because both workers are single-threaded and have processing tasks, Without queuing, the worker with 1 task is still considered idle because its occupancy is less than half the average occupancy, even though all its threads are in use. Stepping back though, how realistic is this situation? What you've created here is sort of root task overproduction: many more tasks are in I feel like this test is trying to simulate a scale-up case: you start with one worker, submit lots of tasks, then another one joins right at the end of submission. If queuing were on (and there were no worker restrictions), the first worker wouldn't have gotten all the tasks in the first place. There'd be no need to rebalance, since most tasks would be in the scheduler-side queue and would naturally be assigned to the new worker evenly. (Modulo #7274 and #7273 of course.) So yes, if someone had exactly this use case—using worker restrictions but wanting tasks to get stolen anyway, starting with 1 worker and scaling up—then it won't rebalance when queuing is on. But if we care about the broad case—what if you submit lots of tasks to a 1-worker cluster, then scale up—I don't think this is a relevant test with queuing on. Here's a test showing queuing balances when scaling up: #7284. |
When
worker-saturation
is notinf
, then workers are only classified as idle if they are not full:distributed/distributed/scheduler.py
Lines 2899 to 2903 in 482941e
While this behavior is desired for withholding root-tasks (it was introduced in #6614), work-stealing also relies on the classification of idle tasks to identify thieves. Limiting this to workers that are not saturated according to
worker-saturation
delays balancing decisions until workers are almost out of work and reduces our ability to interleave computation of remaining tasks with gathering dependencies of stolen ones.Reproducer
Add the following test case to
test_steal.py
cc @gjoseph92
The text was updated successfully, but these errors were encountered: