Update who has can now remove workers #6435

fjetter · 2022-05-24T17:03:36Z

Alternative to #6342

This includes all the logic of #6342 in update_who_has that reduces the information in has_whas/who_has, i.e. ensure that the state aligns with the most recent information the scheduler provided but it does not include the transitions as part of update_who_has. This allows for an overall much less invasive change.

It also involved the tests proposed over there. The only modifications are that I removed the update-who-has log and reduced the story events in the log a bit.

@crusaderky You still have a few other cleanup things in your PR. I was merely curious to get your thoughts on this because I would prefer having a smaller footprint of this change and it seems to be doable

github-actions · 2022-05-24T20:17:03Z

Unit Test Results

      15 files ±  0       15 suites ±0 6h 18m 47s ⏱️ - 14m 56s
  2 814 tests +  4   2 734 ✔️ +  6   79 💤 ±0 1 ❌ - 2
20 861 runs +28 19 932 ✔️ +35 928 💤 - 4 1 ❌ - 3

For more details on these failures, see this check.

Results for commit bd8b883. ± Comparison against base commit d32f4b0.

crusaderky

This works, but the timings can be very slow on a busy cluster:

a task which loses its last worker from who_has will not be included in the find_missing query until it reaches the very top of data_needed. On a busy node, this can take tens of seconds.
a task which gains a worker in who_has while all other workers are in flight or busy will not transition from fetch to flight until something else kicks off ensure_communicating. The realistic worst case is that nothing will until a worker goes out of flight, which may take tens of seconds.

crusaderky · 2022-05-25T09:34:49Z

distributed/worker.py

+            if not ts.who_has:
+                recommendations[ts] = "missing"
+                continue
+


Must remove assertion (and add a comment why the assertion isn't there) in validate_task_fetch

crusaderky · 2022-05-25T09:35:09Z

distributed/worker.py

+                    "which is not true.",
+                    self.address,
+                    ts,
+                )


Must indent one more level

crusaderky · 2022-05-25T09:41:16Z

distributed/worker.py

+            for worker in del_workers:
+                self.has_what[worker].discard(key)
+                # Can't remove from self.data_needed_per_worker; there is logic
+                # in _select_keys_for_gather to deal with this


missing the mentioned logic from my PR

fjetter · 2022-05-25T12:56:31Z

diff --git a/distributed/worker.py b/distributed/worker.py
index 1f086df6e..1b1c412ab 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2679,7 +2679,7 @@ class Worker(ServerNode):
             assert not args
             finish, *args = finish  # type: ignore

-        if ts is None or ts.state == finish:
+        if ts is None:
             return {}, []

makes test_new_replica_while_all_workers_in_flight of #6342 pass by allowing a transition chain fetch->released->fetch

I don't know, yet, if this has any unforeseen side effects, though

crusaderky · 2022-05-25T13:52:45Z

diff --git a/distributed/worker.py b/distributed/worker.py
index 1f086df6e..1b1c412ab 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2679,7 +2679,7 @@ class Worker(ServerNode):
             assert not args
             finish, *args = finish  # type: ignore

-        if ts is None or ts.state == finish:
+        if ts is None:
             return {}, []

makes test_new_replica_while_all_workers_in_flight of #6342 pass by allowing a transition chain fetch->released->fetch

I don't know, yet, if this has any unforeseen side effects, though

It doesn't work properly - the task is transitioned to missing and is "rescued" 1 second later by find_missing.
I'm also very anxious about this change - it introduces potential new transitions more or less everywhere.

Update who has can now remove workers

bd8b883

fjetter mentioned this pull request May 25, 2022

Possible memory leak when using LocalCluster #5960

Open

crusaderky reviewed May 25, 2022

View reviewed changes

fjetter mentioned this pull request May 25, 2022

What are the usecases for Scheduler.handle_missing_data? #6445

Closed

fjetter closed this Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update who has can now remove workers #6435

Update who has can now remove workers #6435

fjetter commented May 24, 2022 •

edited

Loading

github-actions bot commented May 24, 2022

crusaderky left a comment •

edited

Loading

crusaderky May 25, 2022

crusaderky May 25, 2022

crusaderky May 25, 2022

fjetter commented May 25, 2022

crusaderky commented May 25, 2022

Update who has can now remove workers #6435

Update who has can now remove workers #6435

Conversation

fjetter commented May 24, 2022 • edited Loading

github-actions bot commented May 24, 2022

Unit Test Results

crusaderky left a comment • edited Loading

Choose a reason for hiding this comment

crusaderky May 25, 2022

Choose a reason for hiding this comment

crusaderky May 25, 2022

Choose a reason for hiding this comment

crusaderky May 25, 2022

Choose a reason for hiding this comment

fjetter commented May 25, 2022

crusaderky commented May 25, 2022

fjetter commented May 24, 2022 •

edited

Loading

crusaderky left a comment •

edited

Loading