-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock fetching key from retiring worker, when scheduler thinks we already have the key #6244
Comments
I'm curious, what is the end state of this? How does this present to the user? Or rather, what does state look like that makes things seem odd? A worker task in a fetch state, and in some while loop asking the scheduler for who has it? There is some unhandled exception? I think that the intent of my question is to ask "is there an obvious failure on the worker or scheduler side where we can just throw up our hands and say 'screw it, worker.close'" |
Something like this is all I saw in worker logs:
This presents to the user just like any other deadlock: tasks in
diff --git a/distributed/worker.py b/distributed/worker.py
index e5563ed0..35c5e800 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -3232,6 +3232,9 @@ class Worker(ServerNode):
)
# Do not mutate the input dict. That's rude
workers = set(workers) - {self.address}
+ if not workers:
+ # something is broken, shut down desperately!
+ sys.exit(1)
dep_ts.who_has.update(workers)
for worker in workers: would probably do it. But there might be a better place for it (like in If you're asking, "could we |
Re-posting here from #6245 to consolidate Similar circumstances to #6223 and #6198 but different symptoms. In particular the workload is identical but this is running with the changes from #6234, which seems to perhaps have addressed that specific symptom but uncovered another. Now the stuck workers (in this case there are 2) are responsive and show up as "running", but their tasks never execute.
We were having issues pulling an actual cluster dump but @gjoseph92 suggested this alternative:
which I've uploaded here. |
@bnaul I hope that you're having a good time with this. If not, you could consider a WorkerPlugin that tracked |
Based on the above story, two things are happening here
which should be instead
TLDR The worker is being told to compute something but ignores the request and instead forgets everything about the key
I should be able to reproduce the first transition problem |
What version did this occur on? |
I suspect that it was run on #6234 |
^ correct |
got a reproducer of the issue w/ the same story @gen_cluster(client=True)
async def test_fetch_via_amm_to_compute(c, s, a, b):
# Block ensure_communicating to ensure we indeed know that the task is in
# fetch and doesn't leave it accidentally
old_out_connections, b.total_out_connections = b.total_out_connections, 0
old_comm_threshold, b.comm_threshold_bytes = b.comm_threshold_bytes, 0
f1 = c.submit(inc, 1, workers=[a.address], key="f1", allow_other_workers=True)
await f1
s.request_acquire_replicas(b.address, [f1.key], stimulus_id=f"test")
await wait_for_state(f1.key, "fetch", b)
await a.close()
b.total_out_connections = old_out_connections
b.comm_threshold_bytes = old_comm_threshold
await f1 |
This is definitely not right and it will cause a race condition.
This is deliberate, and designed specifically to avoid race conditions. The worker will lose the key soon, but the scheduler can't know when, so it should not rely on it. If for whatever reason the worker refuses, it will readd the key. See |
Correction: this is not true. There is nothing in
If this whole round-trip doesn't happen within 2 seconds, the AMM will repeat the suggestion - which is somewhat problematic if the transfer is genuinely slow, as described in #5759, but won't cause a deadlock. |
This is somewhat conjecture based on a partial cluster dump from @bnaul (full one wasn't working for some reason). Hopefully we can attach the dump at some point so others can see. I only got to see the dump of a worker. If we could see the scheduler state, we could confirm this theory.
This was the worker story of the bad key
In #6112, we fixed
gather_dep
logic to transition all keys to missing that a particular worker holds, if communication with that worker fails. We assume thefind_missing
PeriodicCallback will talk to the scheduler and ask where to fetch them from next.In the case of other errors, we would inform the scheduler of this:
distributed/distributed/worker.py
Lines 3156 to 3167 in b837003
However, in the OSError case, we don't send the
missing-data
message (because we've already added recommendations for those keys tomissing
).Simultaneously, AMM is now used for
retire_workers
. When retiring a worker, we try to move all its keys onto other workers. To do this, AMM callsScheduler.request_acquire_replicas
:distributed/distributed/scheduler.py
Lines 7209 to 7226 in b837003
request_acquire_replicas
optimistically adds the key to the worker'swho_has
, before the worker has actually confirmed it's received it. So imagine this flow:key-foo
request_acquire_replicas("worker-b-addr", ["key-foo"])
.key-foo
is immediately assigned to worker B from the scheduler's perspectivekey-foo
from worker A"key-foo
now" (not actually true). "I'll tell worker A to drop its copy."request_remove_replicas
also eagerly updates scheduler state so worker A no longer holds the key (this is also not true, but it's more reasonable).key-foo
from worker A until too late. Worker A is already shut down. The key is lost.OSError
happens trying to fetch the key. The key goes tomissing
on worker B.find_missing
, Worker B asks the scheduler, "who haskey-foo
"?I can't quite figure out what happens from here. I would have hoped that eventually worker B would try to
gather_dep
from itself, it would reply to itself "I don't have that key", and then it would finally send themissing-data
signal to the scheduler, letting the scheduler realize the key is actually missing. That doesn't seem to be happening, from the cluster dump I'm looking at where this situation happened.A few things to consider changing:
request_acquire_replicas
should not be so optimistic, and assume the key is on a new worker before we've confirmed it's there. This creates a race condition, where we may drop the key from the sender before the receiver has gotten it. This meansRetireWorker
(and all AMM policies) may lead to lost data, when they shouldn't. cc @crusaderkymissing-data
to the scheduler here-ish (not in a for-loop though)? cc @mrocklinWorker.find_missing
should definitely take action if the scheduler reports nobody (besides it) holds the key that's missing. That's a clear state mismatch, and we shouldn't assume anything else will resolve it. cc @fjetter @mrocklinWorker._missing_dep_flight
in cluster dumps #6243I think maybe a change like this would be a good approach (not this doesn't plumb
stimulus_id
through in other calls toupdate_who_has
, so it will fail as is. Also not sure ifupdate_who_has
is always the right place to do this.):The text was updated successfully, but these errors were encountered: