-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError in Worker.handle_compute_task
(causes deadlock)
#5482
Comments
Probably. What you're describing is basically what motivated #5046, see specifically #5046 (comment) and to some extend #4413 (comment) although the latter is a bit outdated. I'll need to update both (xref #5413) The code should be much more robust to this situation but I guess I missed an edge case. Therefore, a transition log of the deadlocked worker would be very valuable. |
Explanation below
distributed/distributed/scheduler.py Lines 7977 to 7980 in 11c41b5
distributed/distributed/worker.py Lines 1907 to 1916 in 7649596
distributed/distributed/worker.py Line 1937 in 7649596
distributed/distributed/worker.py Lines 1943 to 1944 in 7649596
Since 1.) ensures The transition logic for that broken key could be either of A) released -> waiting -> ready -> executing -> cancelled(executing) Above 2.) should then be a no-op 3.) should then trigger a A) I think this is some not well covered edge case in https://github.com/dask/distributed/blame/76495965cf8d3fb5f54bb4b8d20279ae402e0957/distributed/worker.py#L2965-L2966 As I said, a transition log would be helpful since it would remove some of the hypothetical of this argumentation. It should basically just show us the series of decisions leading up to the event. |
At the request of @gjoseph92 I am posting my transition logs as a result of a deadlock: |
^ For background @bennnym ran into this same
This error then caused a deadlock (#5480). We confirmed it was also due to #5481 by checking for queued messages in the worker's
So we then got transition logs for the key in question, which Ben has attached above. @bennnym I forgot, could you also post the output of |
@gjoseph92 my code was running in a notebook and my kernell died, sorry. That is all I have |
Do you know offhand what versions of dask and distributed you were running? It may show in your Coiled software environment. |
I was just installing |
I may be able to reproduce this if necessary. I was running a stackstac example notebook on binder against a Coiled cluster over
wss
, where the particular versions of things were causing a lot of errors (unrelated to dask). So I was frequently rerunning the same tasks, cancelling them, restarting the client, rerunning, etc. Perhaps this cancelling, restarting, rerunning is related?@fjetter says
Relevant code, ending at the line where the error occurs:
distributed/distributed/worker.py
Lines 1852 to 1937 in 11c41b5
Scheduler code producing the message which causes this error:
distributed/distributed/scheduler.py
Lines 7953 to 7994 in 11c41b5
The text was updated successfully, but these errors were encountered: