Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove EnsureCommunicatingAfterTransitions #6462

Merged

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented May 26, 2022

Partially closes #6497

#6165 introduced this hack, for the sake of being functionally identical to the previous code.
This is a cleaner redesign which is conceptually the same.

@crusaderky crusaderky self-assigned this May 26, 2022
@crusaderky
Copy link
Collaborator Author

crusaderky commented May 26, 2022

[EDIT] this comment refers to a previous version of the PR. The condition described below is still there, but only triggered by multiple events in short sequence.

This PR introduces a O(n^2*logn) condition where

  1. task y has 100 dependencies x1, ... x100 all from the same few workers (nworkers << ntasks)
  2. x1 transitions released->fetch->_ensure_communicating->flight. len(data_needed) == 0.
  3. x2 transitions released->fetch->_ensure_communicating-> remain in fetch. len(data_needed) == 1; we had to skip 0 tasks before it in data_needed.
  4. ...x100 transitions released->fetch->_ensure_communicating-> remain in fetch. len(data_needed) == 99; we had to skip 98 tasks before it in data_needed.
    So we had ~100^2/2 ~=5000 pure-CPU iterations that get out of data_needed, write to skipped_worker_in_flight_or_busy, and then add back to data_needed at the end of _ensure_communicating. As data_needed is a heap, each push to it is O(logn), where n is the number of tasks contained within.

This should be negligible most of the time. I'll write another PR later on to make running _ensure_communicating twice in a row truly negligible (blocked by #6388).

To clarify: this condition is already there when you have many compute-task and acquire-replicas requests, which cause the data_needed queue to expand rapidly. This PR specifically extends the condition to tasks fetched within the same event.

@github-actions
Copy link
Contributor

github-actions bot commented May 26, 2022

Unit Test Results

       15 files  +       12         15 suites  +12   6h 40m 57s ⏱️ + 5h 53m 11s
  2 831 tests +  1 635    2 748 ✔️ +  1 586    81 💤 +  47  2 +2 
20 979 runs  +17 394  20 032 ✔️ +16 549  945 💤 +843  2 +2 

For more details on these failures, see this check.

Results for commit e1058fe. ± Comparison against base commit 69b798d.

♻️ This comment has been updated with latest results.

@crusaderky
Copy link
Collaborator Author

crusaderky commented May 27, 2022

[EDIT] this comment refers to a previous version of the PR and is now obsolete.

The transition log has changed from

           - ('x', 'ensure-task-exists', 'released')
           - ('x', 'released', 'fetch', 'fetch', {})
           - ('gather-dependencies', 'tcp://127.0.0.1:53985', {'x'})
           - ('x', 'fetch', 'flight', 'flight', {})

to

           - ('x', 'ensure-task-exists', 'released'),
           - ('gather-dependencies', 'tcp://127.0.0.1:53985', {'x'}),
           - ('x', 'released', 'fetch', 'fetch', {'x': ('flight', 'tcp://127.0.0.1:53985')}),
           - ('x', 'fetch', 'flight', 'flight', {}),

This... is correct, but it's very counter-intuitive.
What's happening is that

  1. transition_released_fetch starts. It sets ts.state = "fetch", adds it to data_needed, and internally invokes _ensure_communicating.
  2. _ensure_communicating removes ts from data_needed, prints its own log line gather-dependencies, and returns a recommendation to transition to flight, together with a GatherDep instruction
  3. transition_released_fetch returns
  4. _transition logs the released->fetch transition and returns the recommendations, instructions generated by _ensure_communicating
  5. _transitions calls _transition again for fetch->flight and logs the outcome.

Again, all this is correct, but it's confusing; it took me an unhealthy amount of time to figure out why the gather-dependencies log line appeared before ('x', 'released', 'fetch', 'fetch'). Not sure if we can/want to do anything about this?

@crusaderky crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch 3 times, most recently from 2c32233 to b7a4538 Compare May 30, 2022 14:30
@crusaderky crusaderky linked an issue May 30, 2022 that may be closed by this pull request
@crusaderky crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch from 201bd67 to 17ce3ef Compare June 1, 2022 13:12
@crusaderky crusaderky marked this pull request as ready for review June 2, 2022 18:37
@fjetter
Copy link
Member

fjetter commented Jun 3, 2022

This... is correct, but it's very counter-intuitive.
What's happening is that

As I'm arguing in #6442 (comment) I believe this log should simply be removed

@fjetter
Copy link
Member

fjetter commented Jun 3, 2022

I could reproduce the issue about the assumptions in the test about which keys to be fetched. This is not entirely obvious from the tests and I consider this a bit concerning.

It's also not about any priorities, ordering, etc. but rather that we're using an unordered set for ts.dependencies which then triggers transitions here

for dep_ts in ts.dependencies:
if dep_ts.state != "memory":
ts.waiting_for_data.add(dep_ts)
dep_ts.waiters.add(ts)
recommendations[dep_ts] = "fetch"

This randomness was previously buffered by the delayed _ensure_communicating. I don't feel great about introducing randomness to our scheduling (e.g. there is not even a key-based tie breaker). This entire refactoring effort is supposed to help us make things more deterministic.

@fjetter
Copy link
Member

fjetter commented Jun 3, 2022

I opened #6497 with a suggestion on how to move forward with ensure_communicating. Maybe it's worth putting this PR on ice until #6497 is settled

@crusaderky
Copy link
Collaborator Author

crusaderky commented Jun 9, 2022

This has been parked until a new design is agreed upon in #6497.

@crusaderky crusaderky marked this pull request as draft June 10, 2022 16:32
@crusaderky crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch 5 times, most recently from aa5273d to 1190bcc Compare June 16, 2022 14:12
@github-actions
Copy link
Contributor

github-actions bot commented Jun 16, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   10h 10m 51s ⏱️ +27s
  2 896 tests +  2    2 811 ✔️ +  3    84 💤 ±0  1  - 1 
21 451 runs  +14  20 486 ✔️ +15  964 💤 ±0  1  - 1 

For more details on these failures, see this check.

Results for commit 9051170. ± Comparison against base commit 88e1fe0.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch from 1190bcc to ab0e9a1 Compare June 17, 2022 12:00
@crusaderky crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch from ab0e9a1 to 5715261 Compare June 17, 2022 12:30
@crusaderky crusaderky marked this pull request as ready for review June 17, 2022 15:39
@crusaderky
Copy link
Collaborator Author

The PR has been rewritten from scratch and is now ready for review and merge.
It exacerbates the O(n^2*logn) condition described above; this is fixed in #6587.

@jsignell jsignell mentioned this pull request Jun 20, 2022
9 tasks
crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 22, 2022
crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022
crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022
crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022
Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

Comment on lines -2550 to -2560
return merge_recs_instructions(
(recommendations, []),
self._ensure_communicating(stimulus_id=ev.stimulus_id),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that this is not everywhere anymore ❤️

@crusaderky crusaderky merged commit 4b24753 into dask:main Jun 26, 2022
@crusaderky crusaderky deleted the WSMR/EnsureCommunicatingAfterTransitions branch June 26, 2022 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alternatives for current ensure_communicating
2 participants