Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate ensure_communicating transitions to new WorkerState event mechanism #5896

Closed
6 of 8 tasks
Tracked by #5895 ...
fjetter opened this issue Mar 3, 2022 · 2 comments · Fixed by #6165, #6332, #6347, #6371 or #6385
Closed
6 of 8 tasks
Tracked by #5895 ...

Migrate ensure_communicating transitions to new WorkerState event mechanism #5896

fjetter opened this issue Mar 3, 2022 · 2 comments · Fixed by #6165, #6332, #6347, #6371 or #6385
Assignees

Comments

@fjetter
Copy link
Member

fjetter commented Mar 3, 2022

The ensure_communicating transitions should be migrated to the WorkerState event mechanism as outlined in #5736 (comment)

  • The Worker._gather_dep method is broken up into
    • A method Worker._gather_remote_data which accepts an instruction indicating what keys to fetch, the remote worker address, etc. It performs the actual communication and returns a list of events about the outcome of the remote fetch (e.g. Success, DataMissing, RemoteBusy, Error, ...)
    • Handlers, e.g. Worker(State)._handle_gather_remote_busy that deal with the result of the fetch attempt
  • The ensure_communicating method is removed from the list of handle_comm every_cycle callbacks
  • The logic of ensure_communicating is moved to a private method (perspectively a method of the new class WorkerState)
  • The new private method will not perform any transitions but rather return a list of recommendations and/or instructions as necessary
  • This private method is then called as part of the transition system, e.g.
    • transition_flight_fetch
    • unpause
  • The implicit goal is to remove all invocations of self.loop.add_callback(self.gather_dep, ...) and replace this with the new callback method

Subtasks

Blocked by

Related work

@crusaderky crusaderky assigned crusaderky and unassigned fjetter Apr 19, 2022
mrocklin pushed a commit that referenced this issue Apr 22, 2022
Apparently some of these tests are not rock solid. 

E.g. in #6166

https://github.com/dask/distributed/runs/6101034638?check_suite_focus=true

Based on the traceback, it looks like the mock.patch was not properly removed after leaving the context manager. To rule this out and see if there is something else ongoing, I removed the mocks in favour of some event/lock/subclass magic.

I realise that these tests are not great and are extremely difficult to maintain. We can have a conversation about removing some of them but I would like to not have this conversation right now. I consider this new approach easier to reason about and at least slightly easier to maintain. 

As part of #5896 I very much hope that we can remove some, if not all of these tests and replace them with some much easier unit tests. I would like to have the conversation about removing code once we are in a position to replace them with something simpler. This is not in scope for this PR
@gjoseph92
Copy link
Collaborator

Is cancellation of gather_deps in scope for this? How does that fit in? Just curious.

@crusaderky
Copy link
Collaborator

crusaderky commented May 1, 2022

Is cancellation of gather_deps in scope for this? How does that fit in? Just curious.

It's out of scope.
gather_dep will become a task that is explicitly tracked by the worker. It should be straightforward to cancel the task when that happens. Of course many other things need to happen (sender-side cancellation, ignore spurious incoming data, etc. etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment