-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker <-> Worker Communication Failures bring Cluster in inconsistent State #5951
Comments
Just wanted to confirm that I am also seeing this with version 2022.3.0 (even though the links to the code are slightly off in the description above, but the overall function @fjetter (as I was seeing you did some changes to this logic :-)): do you think it would make sense to either not mark all of the dependencies of a worker as removed in the I am wondering if there is something special in our setup (large scale, many memory-hungry tasks) which triggers this bug so often. |
Indeed, gather_dep was a very frequent source of deadlocks due to it's exception handling. Thank you for the reproducer. I can confirm this is happening on my end as well.
Maybe. The system should be robust enough to recover from us removing too many tasks. I suspect something else is not working as intended but this may be a hotfix. We're currently investing in some larger refactoring which should allow us to test these edge cases more thoroughly, see also #5736 |
Thanks for the reproducer and explanation @nils-braun. I think I've somewhat minimized this to import dask
import distributed
from distributed.deploy.local import LocalCluster
class BreakingWorker(distributed.Worker):
broke_once = False
def get_data(self, comm, **kwargs):
if not self.broke_once:
self.broke_once = True
raise OSError("fake error")
return super().get_data(comm, **kwargs)
if __name__ == "__main__":
print(distributed.__version__)
df = dask.datasets.timeseries(
start="2000-01-01",
end="2000-01-15",
)
s = df.shuffle("id", shuffle="tasks")
cluster = LocalCluster(
n_workers=2, threads_per_worker=1, processes=True, worker_class=BreakingWorker
)
client = distributed.Client(cluster)
f = s.persist()
distributed.wait(f, timeout=5) A Interestingly, this also makes the reproducer succeed (not that I think it's a real fix though): diff --git a/distributed/worker.py b/distributed/worker.py
index 13e5adef..f1e1563d 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2960,14 +2960,7 @@ class Worker(ServerNode):
except OSError:
logger.exception("Worker stream died during communication: %s", worker)
- has_what = self.has_what.pop(worker)
self.pending_data_per_worker.pop(worker)
- self.log.append(
- ("receive-dep-failed", worker, has_what, stimulus_id, time())
- )
- for d in has_what:
- ts = self.tasks[d]
- ts.who_has.remove(worker)
except Exception as e:
logger.exception(e) |
A few thoughts: There's a fundamental question here: does an But regardless that, as @nils-braun has pointed out, we're not correctly dealing with the TaskStates we mutate. If we end up making a task's diff --git a/distributed/worker.py b/distributed/worker.py
index 7a062876..5e72f007 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2991,6 +2991,9 @@ class Worker(ServerNode):
for d in has_what:
ts = self.tasks[d]
ts.who_has.remove(worker)
+ if not ts.who_has:
+ # TODO send `missing-data` to scheduler?
+ recommendations[ts] = "missing"
except Exception as e:
logger.exception(e) However, you'll then get a Should we just add a transition function for that? Or is there a good reason why that isn't currently a valid transition? Another thought: in I think these assertions are the assumptions diff --git a/distributed/worker.py b/distributed/worker.py
index 7a062876..f8401a80 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2699,6 +2699,9 @@ class Worker(ServerNode):
)
for el in skipped_worker_in_flight:
+ # Assuming something else has already dealt with making these `missing`
+ assert el.state == "missing", el.state
+ assert el in self._missing_dep_flight
self.data_needed.push(el) Of course, in the reproducer, they fail. But arguably, I think it comes down to:
|
For the record, @fjetter and I discussed offline, and we're going to add back a |
Thank you for the detailed summary @gjoseph92 |
This is done here: #6112
My apologies for being absent with historical context here. We went through this decision several years ago. It's not a correct assumption, but it is an assumption which prioritizes stability over performance, and so is a win. We had odd issues when we tried to get smart around this, and found that it was just better to let clusters think that workers were gone, and celebrate when they magically returned. (I may not be fully understanding the situation though) |
Maybe that was the case then, but these days I think it actually prioritizes performance over stability. It's short-circuiting other tasks from trying and failing to fetch from that worker in the future. It assumes the current error is predictive of what will happen if we try again, so we preemptively choose not to try again. This adds complexity compared to just letting those fetches fail sometime in the future, and just dealing with the failure per-key when it happens. |
Thanks @gjoseph92, @mrocklin and @fjetter! |
Due to network issues, overloaded workers or just bad luck, the worker-to-worker communication needed for getting task dependencies from other workers might fail (
distributed.worker.get_data_from_worker
). This is (in principle) successfully caught by the worker and scheduler and reacted on. During this however a race-condition can be triggered, which brings the cluster in an inconsistent and stuck state.Minimal Complete Verifiable Example:
The following code will introduce a random failure in
distributed.worker.get_data_from_worker
with higher probability than a real-world use-case might have, just to demonstrate:Running this code, will sometimes not finish the computation, but one (or multiple) workers are stuck while waiting in fetching the dependencies for a specific task.
Note: it is a race-condition, so you might need to run the code multiple times until it is stuck...
Here are some observations I have made using the worker's and scheduler's properties and the transition log.
Let's say worker
A
is stuck while processing taskT1
, which depends on taskT2
owned by workerB
.T2
is present onB
B
has the task in its memory (dask_worker.data
)A
however thinks, that no-one owns the taskdask_worker.tasks[T1].who_owns == {}
, soA
does not even start askingB
for the data.From the transition log, this is what I think that happens (but I am happy if someone with more knowledge on the way the worker works could confirm my observations):
A
wants to process another task,T3
, which needs a dependencyT4
also fromB
gather_dep
, which callsget_data_from_worker
. This fails (either due to a real network issue or due to our patched function above).A
to doT1
, which depends onT2
(owned byB
).OSError
ofgather_dep
, the local state of the workerA
is changed in a way, so that all tasks owned byB
are marked as not owned byB
anymore. In our case, that isT2
andT4
.T4
. (see here)The final state is, that the worker
A
thinks no one would own the data forT2
, while the scheduler will not re-distributed the task (as it was never marked as missing).One last comment: using the setting
DASK_DISTRIBUTED__COMM__RETRY__COUNT
it is possible to make the failuresof the
get_data_from_worker
function less likely. But unfortunately, this will just decrease the probability, not fix the problem.Environment:
The text was updated successfully, but these errors were encountered: