Remove EnsureCommunicatingAfterTransitions #6462

crusaderky · 2022-05-26T16:18:16Z

Partially closes #6497

#6165 introduced this hack, for the sake of being functionally identical to the previous code.
This is a cleaner redesign which is conceptually the same.

distributed/tests/test_worker.py

crusaderky · 2022-05-26T16:36:25Z

[EDIT] this comment refers to a previous version of the PR. The condition described below is still there, but only triggered by multiple events in short sequence.

This PR introduces a O(n^2*logn) condition where

task y has 100 dependencies x1, ... x100 all from the same few workers (nworkers << ntasks)
x1 transitions released->fetch->_ensure_communicating->flight. len(data_needed) == 0.
x2 transitions released->fetch->_ensure_communicating-> remain in fetch. len(data_needed) == 1; we had to skip 0 tasks before it in data_needed.
...x100 transitions released->fetch->_ensure_communicating-> remain in fetch. len(data_needed) == 99; we had to skip 98 tasks before it in data_needed.
So we had ~100^2/2 ~=5000 pure-CPU iterations that get out of data_needed, write to skipped_worker_in_flight_or_busy, and then add back to data_needed at the end of _ensure_communicating. As data_needed is a heap, each push to it is O(logn), where n is the number of tasks contained within.

This should be negligible most of the time. I'll write another PR later on to make running _ensure_communicating twice in a row truly negligible (blocked by #6388).

To clarify: this condition is already there when you have many compute-task and acquire-replicas requests, which cause the data_needed queue to expand rapidly. This PR specifically extends the condition to tasks fetched within the same event.

github-actions · 2022-05-26T20:08:20Z

Unit Test Results

      15 files +      12       15 suites +12 6h 40m 57s ⏱️ + 5h 53m 11s
  2 831 tests +  1 635   2 748 ✔️ +  1 586   81 💤 +  47 2 ❌ +2
20 979 runs +17 394 20 032 ✔️ +16 549 945 💤 +843 2 ❌ +2

For more details on these failures, see this check.

Results for commit e1058fe. ± Comparison against base commit 69b798d.

♻️ This comment has been updated with latest results.

crusaderky · 2022-05-27T10:56:12Z

[EDIT] this comment refers to a previous version of the PR and is now obsolete.

The transition log has changed from

           - ('x', 'ensure-task-exists', 'released')
           - ('x', 'released', 'fetch', 'fetch', {})
           - ('gather-dependencies', 'tcp://127.0.0.1:53985', {'x'})
           - ('x', 'fetch', 'flight', 'flight', {})

to

           - ('x', 'ensure-task-exists', 'released'),
           - ('gather-dependencies', 'tcp://127.0.0.1:53985', {'x'}),
           - ('x', 'released', 'fetch', 'fetch', {'x': ('flight', 'tcp://127.0.0.1:53985')}),
           - ('x', 'fetch', 'flight', 'flight', {}),

This... is correct, but it's very counter-intuitive.
What's happening is that

transition_released_fetch starts. It sets ts.state = "fetch", adds it to data_needed, and internally invokes _ensure_communicating.
_ensure_communicating removes ts from data_needed, prints its own log line gather-dependencies, and returns a recommendation to transition to flight, together with a GatherDep instruction
transition_released_fetch returns
_transition logs the released->fetch transition and returns the recommendations, instructions generated by _ensure_communicating
_transitions calls _transition again for fetch->flight and logs the outcome.

Again, all this is correct, but it's confusing; it took me an unhealthy amount of time to figure out why the gather-dependencies log line appeared before ('x', 'released', 'fetch', 'fetch'). Not sure if we can/want to do anything about this?

fjetter · 2022-06-03T08:17:45Z

This... is correct, but it's very counter-intuitive.
What's happening is that

As I'm arguing in #6442 (comment) I believe this log should simply be removed

fjetter · 2022-06-03T10:22:31Z

I could reproduce the issue about the assumptions in the test about which keys to be fetched. This is not entirely obvious from the tests and I consider this a bit concerning.

It's also not about any priorities, ordering, etc. but rather that we're using an unordered set for ts.dependencies which then triggers transitions here

distributed/distributed/worker.py

Lines 2125 to 2129 in 6d85a85

    
           for dep_ts in ts.dependencies: 
        
               if dep_ts.state != "memory": 
        
                   ts.waiting_for_data.add(dep_ts) 
        
                   dep_ts.waiters.add(ts) 
        
                   recommendations[dep_ts] = "fetch"

This randomness was previously buffered by the delayed _ensure_communicating. I don't feel great about introducing randomness to our scheduling (e.g. there is not even a key-based tie breaker). This entire refactoring effort is supposed to help us make things more deterministic.

fjetter · 2022-06-03T11:30:30Z

I opened #6497 with a suggestion on how to move forward with ensure_communicating. Maybe it's worth putting this PR on ice until #6497 is settled

crusaderky · 2022-06-09T17:07:05Z

This has been parked until a new design is agreed upon in #6497.

github-actions · 2022-06-16T15:42:58Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 10h 10m 51s ⏱️ +27s
  2 896 tests +  2   2 811 ✔️ +  3   84 💤 ±0 1 ❌ - 1
21 451 runs +14 20 486 ✔️ +15 964 💤 ±0 1 ❌ - 1

For more details on these failures, see this check.

Results for commit 9051170. ± Comparison against base commit 88e1fe0.

♻️ This comment has been updated with latest results.

crusaderky · 2022-06-17T15:49:38Z

The PR has been rewritten from scratch and is now ready for review and merge.
It exacerbates the O(n^2*logn) condition described above; this is fixed in #6587.

fjetter

This is great!

fjetter · 2022-06-24T16:44:52Z

distributed/worker_state_machine.py

-        return merge_recs_instructions(
-            (recommendations, []),
-            self._ensure_communicating(stimulus_id=ev.stimulus_id),
-        )


I love that this is not everywhere anymore ❤️

crusaderky self-assigned this May 26, 2022

crusaderky commented May 26, 2022

View reviewed changes

distributed/tests/test_worker.py Outdated Show resolved Hide resolved

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch 3 times, most recently from 2c32233 to b7a4538 Compare May 30, 2022 14:30

This was referenced May 30, 2022

Rework some tests related to gather_dep #6472

Merged

Yank state machine out of Worker class #6476

Closed

crusaderky linked an issue May 30, 2022 that may be closed by this pull request

Yank state machine out of Worker class #6476

Closed

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch from 201bd67 to 17ce3ef Compare June 1, 2022 13:12

crusaderky marked this pull request as ready for review June 2, 2022 18:37

fjetter mentioned this pull request Jun 3, 2022

Alternatives for current ensure_communicating #6497

Closed

crusaderky removed a link to an issue Jun 6, 2022

Yank state machine out of Worker class #6476

Closed

jrbourbeau mentioned this pull request Jun 7, 2022

Release 2022.6.0 dask/community#252

Closed

9 tasks

crusaderky marked this pull request as draft June 10, 2022 16:32

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch 5 times, most recently from aa5273d to 1190bcc Compare June 16, 2022 14:12

crusaderky mentioned this pull request Jun 16, 2022

Deduplicate data_needed #6587

Merged

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch from 1190bcc to ab0e9a1 Compare June 17, 2022 12:00

Remove EnsureCommunicatingAfterTransitions

5715261

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch from ab0e9a1 to 5715261 Compare June 17, 2022 12:30

tweak test

57168cc

crusaderky marked this pull request as ready for review June 17, 2022 15:39

jsignell mentioned this pull request Jun 20, 2022

Release 2022.6.1 dask/community#258

Closed

9 tasks

crusaderky added 3 commits June 22, 2022 11:01

Merge branch 'main' into WSMR/EnsureCommunicatingAfterTransitions

14adc6a

Merge branch 'main' into WSMR/EnsureCommunicatingAfterTransitions

f8911dd

Use ws fixture

dbd2dc1

crusaderky mentioned this pull request Jun 22, 2022

Adding replicas to a task in fetch now sends it to flight immediately #6594

Merged

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 22, 2022

Remove EnsureCommunicatingAfterTransitions (dask#6462)

9578538

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022

Remove EnsureCommunicatingAfterTransitions (dask#6462)

083a908

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022

Remove EnsureCommunicatingAfterTransitions (dask#6462)

e3b70da

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022

Remove EnsureCommunicatingAfterTransitions (dask#6462)

7c40e1b

crusaderky mentioned this pull request Jun 24, 2022

Benchmark WorkerState._ensure_communicating dask/dask-benchmarks#50

Merged

Merge branch 'main' into WSMR/EnsureCommunicatingAfterTransitions

7fd7b36

fjetter approved these changes Jun 24, 2022

View reviewed changes

Merge branch 'main' into WSMR/EnsureCommunicatingAfterTransitions

9051170

crusaderky merged commit 4b24753 into dask:main Jun 26, 2022

crusaderky deleted the WSMR/EnsureCommunicatingAfterTransitions branch June 26, 2022 08:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove EnsureCommunicatingAfterTransitions #6462

Remove EnsureCommunicatingAfterTransitions #6462

crusaderky commented May 26, 2022 •

edited

Loading

crusaderky commented May 26, 2022 •

edited

Loading

github-actions bot commented May 26, 2022 •

edited

Loading

crusaderky commented May 27, 2022 •

edited

Loading

fjetter commented Jun 3, 2022

fjetter commented Jun 3, 2022

fjetter commented Jun 3, 2022

crusaderky commented Jun 9, 2022 •

edited

Loading

github-actions bot commented Jun 16, 2022 •

edited

Loading

crusaderky commented Jun 17, 2022

fjetter left a comment

fjetter Jun 24, 2022

Remove EnsureCommunicatingAfterTransitions #6462

Remove EnsureCommunicatingAfterTransitions #6462

Conversation

crusaderky commented May 26, 2022 • edited Loading

crusaderky commented May 26, 2022 • edited Loading

github-actions bot commented May 26, 2022 • edited Loading

Unit Test Results

crusaderky commented May 27, 2022 • edited Loading

fjetter commented Jun 3, 2022

fjetter commented Jun 3, 2022

fjetter commented Jun 3, 2022

crusaderky commented Jun 9, 2022 • edited Loading

github-actions bot commented Jun 16, 2022 • edited Loading

Unit Test Results

crusaderky commented Jun 17, 2022

fjetter left a comment

Choose a reason for hiding this comment

fjetter Jun 24, 2022

Choose a reason for hiding this comment

crusaderky commented May 26, 2022 •

edited

Loading

crusaderky commented May 26, 2022 •

edited

Loading

github-actions bot commented May 26, 2022 •

edited

Loading

crusaderky commented May 27, 2022 •

edited

Loading

crusaderky commented Jun 9, 2022 •

edited

Loading

github-actions bot commented Jun 16, 2022 •

edited

Loading