Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update who has can now remove workers #6435

Closed
wants to merge 1 commit into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented May 24, 2022

Alternative to #6342

This includes all the logic of #6342 in update_who_has that reduces the information in has_whas/who_has, i.e. ensure that the state aligns with the most recent information the scheduler provided but it does not include the transitions as part of update_who_has. This allows for an overall much less invasive change.

It also involved the tests proposed over there. The only modifications are that I removed the update-who-has log and reduced the story events in the log a bit.

@crusaderky You still have a few other cleanup things in your PR. I was merely curious to get your thoughts on this because I would prefer having a smaller footprint of this change and it seems to be doable

@github-actions
Copy link
Contributor

Unit Test Results

       15 files  ±  0         15 suites  ±0   6h 18m 47s ⏱️ - 14m 56s
  2 814 tests +  4    2 734 ✔️ +  6    79 💤 ±0  1  - 2 
20 861 runs  +28  19 932 ✔️ +35  928 💤  - 4  1  - 3 

For more details on these failures, see this check.

Results for commit bd8b883. ± Comparison against base commit d32f4b0.

Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but the timings can be very slow on a busy cluster:

  1. a task which loses its last worker from who_has will not be included in the find_missing query until it reaches the very top of data_needed. On a busy node, this can take tens of seconds.
  2. a task which gains a worker in who_has while all other workers are in flight or busy will not transition from fetch to flight until something else kicks off ensure_communicating. The realistic worst case is that nothing will until a worker goes out of flight, which may take tens of seconds.

if not ts.who_has:
recommendations[ts] = "missing"
continue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must remove assertion (and add a comment why the assertion isn't there) in validate_task_fetch

"which is not true.",
self.address,
ts,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must indent one more level

for worker in del_workers:
self.has_what[worker].discard(key)
# Can't remove from self.data_needed_per_worker; there is logic
# in _select_keys_for_gather to deal with this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the mentioned logic from my PR

@fjetter
Copy link
Member Author

fjetter commented May 25, 2022

diff --git a/distributed/worker.py b/distributed/worker.py
index 1f086df6e..1b1c412ab 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2679,7 +2679,7 @@ class Worker(ServerNode):
             assert not args
             finish, *args = finish  # type: ignore

-        if ts is None or ts.state == finish:
+        if ts is None:
             return {}, []

makes test_new_replica_while_all_workers_in_flight of #6342 pass by allowing a transition chain fetch->released->fetch

I don't know, yet, if this has any unforeseen side effects, though

@crusaderky
Copy link
Collaborator

diff --git a/distributed/worker.py b/distributed/worker.py
index 1f086df6e..1b1c412ab 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2679,7 +2679,7 @@ class Worker(ServerNode):
             assert not args
             finish, *args = finish  # type: ignore

-        if ts is None or ts.state == finish:
+        if ts is None:
             return {}, []

makes test_new_replica_while_all_workers_in_flight of #6342 pass by allowing a transition chain fetch->released->fetch

I don't know, yet, if this has any unforeseen side effects, though

It doesn't work properly - the task is transitioned to missing and is "rescued" 1 second later by find_missing.
I'm also very anxious about this change - it introduces potential new transitions more or less everywhere.

@fjetter fjetter closed this Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants