Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yank state machine out of Worker class #6566

Merged
merged 3 commits into from
Jun 14, 2022

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Jun 11, 2022

I've decided to break the PR in two: there will be a follow-up after this that enables the deprecation warning in the attributes, which in turn will require repetitive changes to thousands of unit test.

This PR is divided into three commits:

  1. Trivial cut-paste of methods and attributes from worker.py to worker_state_machine.py. Nothing intelligent happening here - everything was lifted as-is. You don't need to squint. Either a method/attribute has been moved as-is or it's untouched.
  2. Non-trivial changes to the same two modules. You should review this commit carefully.
  3. All other modules. You should review this as usual through the 'Files changed' tab.

Changes of note

  • All WorkerState data attributes are labelled as implementation details, meaning they may change without a deprecation cycle. Please discuss if you believe this is unreasonable for third-party developers.
  • Added new attribute WorkerState.running, which is a simplified mirror of Worker.status. WorkerState.running == (Worker.status == Status.running). Added new PauseEvent, symmetrical to the already existing UnpauseEvent.

Out of scope

@crusaderky crusaderky force-pushed the WSMR/the_great_yank branch 2 times, most recently from 62ffa83 to 4e882ab Compare June 11, 2022 23:04
@crusaderky crusaderky self-assigned this Jun 11, 2022
@crusaderky crusaderky force-pushed the WSMR/the_great_yank branch from 4e882ab to f953262 Compare June 11, 2022 23:35
@github-actions
Copy link
Contributor

github-actions bot commented Jun 12, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 1m 54s ⏱️ - 16m 13s
  2 865 tests +  2    2 785 ✔️ +  2    80 💤 ±0  0 ±0 
21 224 runs  +14  20 284 ✔️ +15  940 💤  - 1  0 ±0 

Results for commit 82bcf8b. ± Comparison against base commit 0fcc724.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky force-pushed the WSMR/the_great_yank branch 4 times, most recently from 0eeffe1 to d45916f Compare June 13, 2022 13:28
@crusaderky crusaderky force-pushed the WSMR/the_great_yank branch from d45916f to ac28a32 Compare June 13, 2022 13:34
}
info["data"] = list(self.data)
info = {k: v for k, v in self.__dict__.items() if not k.startswith("_")}
info["data"] = dict.fromkeys(self.data)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For coherence with various equivalent operations in worker_state_machine:

  • WorkerState._to_dict
  • GatherDepSuccessEvent
  • ExecuteSuccessEvent

Comment on lines +165 to +174
.. currentmodule:: distributed.worker_state_machine

.. autoclass:: distributed.worker_state_machine.TaskState
:members:

.. autoclass:: distributed.worker_state_machine.WorkerState
:members:

.. autoclass:: distributed.worker_state_machine.BaseWorker
:members:
Copy link
Collaborator Author

@crusaderky crusaderky Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be moved to a separate document as part of #5413.

@@ -609,7 +545,7 @@ def __init__(
profile_cycle_interval = parse_timedelta(profile_cycle_interval, default="ms")
assert profile_cycle_interval

self._setup_logging(logger)
self._setup_logging(logger, wsm_logger)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also add distributed.worker_memory.logger?

@@ -609,7 +545,7 @@ def __init__(
profile_cycle_interval = parse_timedelta(profile_cycle_interval, default="ms")
assert profile_cycle_interval

self._setup_logging(logger)
self._setup_logging(logger, wsm_logger)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also add distributed.worker_memory.logger?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a follow up PR. I have thoughts on logging and would prefer this to be in a dedicated PR

if TYPE_CHECKING:
# TODO import from typing (requires Python >=3.10)
from typing_extensions import TypeAlias

# Circular imports
from distributed.actor import Actor
from distributed.diagnostics.plugin import WorkerPlugin
from distributed.worker import Worker
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only for the sake of the deprecation machinery


.. note::
The data attributes of this class are implementation details and may be
changed without a deprecation cycle.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMPORTANT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this but if we encounter too many problems with this, we may need to reconsider

# assert self.waiting_for_data_count == waiting_for_data_count
for worker, keys in self.has_what.items():
def validate_state(self) -> None:
assert len(self.executing) >= 0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this method are just a de-indentation.
Error control has remained in Worker.validate_state.

@crusaderky crusaderky force-pushed the WSMR/the_great_yank branch from ac28a32 to 82bcf8b Compare June 13, 2022 14:29
@crusaderky crusaderky marked this pull request as ready for review June 13, 2022 14:29
@crusaderky
Copy link
Collaborator Author

Ready for review and merge!

distributed/node.py Show resolved Hide resolved
Comment on lines +544 to +546
@dataclass
class PauseEvent(StateMachineEvent):
__slots__ = ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up topic: I'm wondering if we should move the definition of the events into another file to keep file sizes smaller. The event definition + import is already at almost 1k lines of code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?
Instructions + events definition amounts to 525 lines.
I explored changing the imports in worker.py to

import distributed.worker_state_machine as wsm

and add wsm. in front of everything, but I found it rather unpleasing to the eye.

Copy link
Collaborator Author

@crusaderky crusaderky Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we could do is move all the @_handle_event.register methods of WorkerState to methods of their respective events and then have them self-register to the WorkerState and Worker classes in a plugin style:

class StateMachineEvent:
    @abc.abstractmethod
    def handle(self, state: Workerstate) -> RecsInstrs:
        ...

class WorkerState:
    def _handle_event(self, ev: StateMachineEvent) -> RecsInstrs:
        return ev.handle(self)

At that point it would totally make sense to move the events to a separate file.
Should I investigate it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instructions + events definition amounts to 525 lines.
My counting also included the TaskState, exceptions, imports, etc. never mind

Should I investigate it?

I don't hate the idea but I am not sure if it is worth it. After all, the handlers are modifying the WorkerState more or less directly. I think we should not overdo it and stop at this point and see how it feels for a while before engaging in the next iteration of refactoring.
Right now, the events are mostly dataclasses and I think that's good for now


.. note::
The data attributes of this class are implementation details and may be
changed without a deprecation cycle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this but if we encounter too many problems with this, we may need to reconsider

async def set_resources(self, **resources) -> None:
for r, quantity in resources.items():
if r in self.total_resources:
self.available_resources[r] += quantity - self.total_resources[r]
self.state.available_resources[r] += quantity - self.total_resources[r]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up: this should be encapsulated and communicated via an event. After all, this could / should trigger transitions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
if self.status not in WORKER_ANY_RUNNING:
return None

try:
self.log.append(("request-dep", worker, to_gather, stimulus_id, time()))
self.state.log.append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up: I don't think the worker class should log anythign on the state. I think these messages should be removed entirely, see also #6442 (comment)

Comment on lines +93 to +141
expect = {
"address": "127.0.0.1.1234",
"busy_workers": [],
"constrained": [],
"data": {"y": None},
"data_needed": ["x"],
"data_needed_per_worker": {"127.0.0.1:1235": ["x"]},
"executing": [],
"in_flight_tasks": [],
"in_flight_workers": {},
"log": [
["x", "ensure-task-exists", "released", "s1"],
["x", "released", "fetch", "fetch", {}, "s1"],
["y", "put-in-memory", "s2"],
["y", "receive-from-scatter", "s2"],
],
"long_running": [],
"nthreads": 8,
"ready": [],
"running": True,
"stimulus_log": [
{
"cls": "AcquireReplicasEvent",
"stimulus_id": "s1",
"who_has": {"x": ["127.0.0.1:1235"]},
},
{
"cls": "UpdateDataEvent",
"data": {"y": None},
"report": False,
"stimulus_id": "s2",
},
],
"tasks": {
"x": {
"key": "x",
"priority": [1],
"state": "fetch",
"who_has": ["127.0.0.1:1235"],
},
"y": {
"key": "y",
"nbytes": 16,
"state": "memory",
},
},
"transition_counter": 1,
}
assert actual == expect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to change this to be less verbose. no need to block the PR, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this test though. It gives the reader a very good impression of what a much larger real-life dump looks like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but we'll need to change it for many unrelated reasons. I would like it better to have a developer documentation rendered by sphinx that would print this example instead of having it hard coded in place.

@fjetter fjetter merged commit 344868a into dask:main Jun 14, 2022
@crusaderky crusaderky deleted the WSMR/the_great_yank branch June 14, 2022 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Yank state machine out of Worker class
2 participants