-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update who has can now remove workers #6435
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but the timings can be very slow on a busy cluster:
- a task which loses its last worker from who_has will not be included in the find_missing query until it reaches the very top of data_needed. On a busy node, this can take tens of seconds.
- a task which gains a worker in who_has while all other workers are in flight or busy will not transition from fetch to flight until something else kicks off ensure_communicating. The realistic worst case is that nothing will until a worker goes out of flight, which may take tens of seconds.
if not ts.who_has: | ||
recommendations[ts] = "missing" | ||
continue | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must remove assertion (and add a comment why the assertion isn't there) in validate_task_fetch
"which is not true.", | ||
self.address, | ||
ts, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must indent one more level
for worker in del_workers: | ||
self.has_what[worker].discard(key) | ||
# Can't remove from self.data_needed_per_worker; there is logic | ||
# in _select_keys_for_gather to deal with this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing the mentioned logic from my PR
diff --git a/distributed/worker.py b/distributed/worker.py
index 1f086df6e..1b1c412ab 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2679,7 +2679,7 @@ class Worker(ServerNode):
assert not args
finish, *args = finish # type: ignore
- if ts is None or ts.state == finish:
+ if ts is None:
return {}, [] makes I don't know, yet, if this has any unforeseen side effects, though |
It doesn't work properly - the task is transitioned to missing and is "rescued" 1 second later by find_missing. |
Alternative to #6342
This includes all the logic of #6342 in
update_who_has
that reduces the information in has_whas/who_has, i.e. ensure that the state aligns with the most recent information the scheduler provided but it does not include the transitions as part of update_who_has. This allows for an overall much less invasive change.It also involved the tests proposed over there. The only modifications are that I removed the update-who-has log and reduced the story events in the log a bit.
@crusaderky You still have a few other cleanup things in your PR. I was merely curious to get your thoughts on this because I would prefer having a smaller footprint of this change and it seems to be doable