Yank state machine out of Worker class #6566

crusaderky · 2022-06-11T22:59:13Z

Partially closes Yank state machine out of Worker class #6476

I've decided to break the PR in two: there will be a follow-up after this that enables the deprecation warning in the attributes, which in turn will require repetitive changes to thousands of unit test.

This PR is divided into three commits:

Trivial cut-paste of methods and attributes from worker.py to worker_state_machine.py. Nothing intelligent happening here - everything was lifted as-is. You don't need to squint. Either a method/attribute has been moved as-is or it's untouched.
Non-trivial changes to the same two modules. You should review this commit carefully.
All other modules. You should review this as usual through the 'Files changed' tab.

Changes of note

All WorkerState data attributes are labelled as implementation details, meaning they may change without a deprecation cycle. Please discuss if you believe this is unreasonable for third-party developers.
Added new attribute WorkerState.running, which is a simplified mirror of Worker.status. WorkerState.running == (Worker.status == Status.running). Added new PauseEvent, symmetrical to the already existing UnpauseEvent.

Out of scope

Enable deprecations (read above; in scope for Yank state machine out of Worker class #6476)
High level documentation (in scope for [DEV DOCS] Documentation of Scheduler and Worker state machine #5413)

github-actions · 2022-06-12T01:08:22Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 6h 1m 54s ⏱️ - 16m 13s
  2 865 tests +  2   2 785 ✔️ +  2   80 💤 ±0 0 ❌ ±0
21 224 runs +14 20 284 ✔️ +15 940 💤 - 1 0 ❌ ±0

Results for commit 82bcf8b. ± Comparison against base commit 0fcc724.

♻️ This comment has been updated with latest results.

crusaderky · 2022-06-13T13:48:12Z

distributed/worker_memory.py

-        }
-        info["data"] = list(self.data)
+        info = {k: v for k, v in self.__dict__.items() if not k.startswith("_")}
+        info["data"] = dict.fromkeys(self.data)


For coherence with various equivalent operations in worker_state_machine:

WorkerState._to_dict

GatherDepSuccessEvent

ExecuteSuccessEvent

crusaderky · 2022-06-13T13:49:30Z

docs/source/worker.rst

+.. currentmodule:: distributed.worker_state_machine
+
 .. autoclass:: distributed.worker_state_machine.TaskState
   :members:

+.. autoclass:: distributed.worker_state_machine.WorkerState
+   :members:
+
+.. autoclass:: distributed.worker_state_machine.BaseWorker
+   :members:


These should probably be moved to a separate document as part of #5413.

crusaderky · 2022-06-13T14:11:50Z

distributed/worker.py

@@ -609,7 +545,7 @@ def __init__(
        profile_cycle_interval = parse_timedelta(profile_cycle_interval, default="ms")
        assert profile_cycle_interval

-        self._setup_logging(logger)
+        self._setup_logging(logger, wsm_logger)


Should I also add distributed.worker_memory.logger?

crusaderky · 2022-06-13T14:12:15Z

distributed/worker.py

@@ -609,7 +545,7 @@ def __init__(
        profile_cycle_interval = parse_timedelta(profile_cycle_interval, default="ms")
        assert profile_cycle_interval

-        self._setup_logging(logger)
+        self._setup_logging(logger, wsm_logger)


Should I also add distributed.worker_memory.logger?

I suggest a follow up PR. I have thoughts on logging and would prefer this to be in a dedicated PR

crusaderky · 2022-06-13T14:22:04Z

distributed/worker_state_machine.py

 if TYPE_CHECKING:
    # TODO import from typing (requires Python >=3.10)
    from typing_extensions import TypeAlias

    # Circular imports
    from distributed.actor import Actor
    from distributed.diagnostics.plugin import WorkerPlugin
+    from distributed.worker import Worker


this is only for the sake of the deprecation machinery

crusaderky · 2022-06-13T14:22:41Z

distributed/worker_state_machine.py

+
+    .. note::
+       The data attributes of this class are implementation details and may be
+       changed without a deprecation cycle.


IMPORTANT

I'm fine with this but if we encounter too many problems with this, we may need to reconsider

crusaderky · 2022-06-13T14:27:20Z

distributed/worker_state_machine.py

-            # assert self.waiting_for_data_count == waiting_for_data_count
-            for worker, keys in self.has_what.items():
+    def validate_state(self) -> None:
+        assert len(self.executing) >= 0


Changes in this method are just a de-indentation.
Error control has remained in Worker.validate_state.

crusaderky · 2022-06-13T14:30:35Z

Ready for review and merge!

distributed/node.py

fjetter · 2022-06-13T15:51:51Z

distributed/worker_state_machine.py

+@dataclass
+class PauseEvent(StateMachineEvent):
+    __slots__ = ()


follow up topic: I'm wondering if we should move the definition of the events into another file to keep file sizes smaller. The event definition + import is already at almost 1k lines of code

?
Instructions + events definition amounts to 525 lines.
I explored changing the imports in worker.py to

import distributed.worker_state_machine as wsm

and add wsm. in front of everything, but I found it rather unpleasing to the eye.

What we could do is move all the @_handle_event.register methods of WorkerState to methods of their respective events and then have them self-register to the WorkerState and Worker classes in a plugin style:

class StateMachineEvent: @abc.abstractmethod def handle(self, state: Workerstate) -> RecsInstrs: ... class WorkerState: def _handle_event(self, ev: StateMachineEvent) -> RecsInstrs: return ev.handle(self)

At that point it would totally make sense to move the events to a separate file.
Should I investigate it?

Instructions + events definition amounts to 525 lines.
My counting also included the TaskState, exceptions, imports, etc. never mind

Should I investigate it?

I don't hate the idea but I am not sure if it is worth it. After all, the handlers are modifying the WorkerState more or less directly. I think we should not overdo it and stop at this point and see how it feels for a while before engaging in the next iteration of refactoring.
Right now, the events are mostly dataclasses and I think that's good for now

fjetter · 2022-06-13T15:53:06Z

distributed/worker_state_machine.py

+
+    .. note::
+       The data attributes of this class are implementation details and may be
+       changed without a deprecation cycle.


I'm fine with this but if we encounter too many problems with this, we may need to reconsider

fjetter · 2022-06-13T15:58:17Z

distributed/worker.py

    async def set_resources(self, **resources) -> None:
        for r, quantity in resources.items():
            if r in self.total_resources:
-                self.available_resources[r] += quantity - self.total_resources[r]
+                self.state.available_resources[r] += quantity - self.total_resources[r]


follow up: this should be encapsulated and communicated via an event. After all, this could / should trigger transitions

fjetter · 2022-06-13T16:02:04Z

distributed/worker.py

        """
        if self.status not in WORKER_ANY_RUNNING:
            return None

        try:
-            self.log.append(("request-dep", worker, to_gather, stimulus_id, time()))
+            self.state.log.append(


follow up: I don't think the worker class should log anythign on the state. I think these messages should be removed entirely, see also #6442 (comment)

fjetter · 2022-06-14T08:41:14Z

distributed/tests/test_worker_state_machine.py

+    expect = {
+        "address": "127.0.0.1.1234",
+        "busy_workers": [],
+        "constrained": [],
+        "data": {"y": None},
+        "data_needed": ["x"],
+        "data_needed_per_worker": {"127.0.0.1:1235": ["x"]},
+        "executing": [],
+        "in_flight_tasks": [],
+        "in_flight_workers": {},
+        "log": [
+            ["x", "ensure-task-exists", "released", "s1"],
+            ["x", "released", "fetch", "fetch", {}, "s1"],
+            ["y", "put-in-memory", "s2"],
+            ["y", "receive-from-scatter", "s2"],
+        ],
+        "long_running": [],
+        "nthreads": 8,
+        "ready": [],
+        "running": True,
+        "stimulus_log": [
+            {
+                "cls": "AcquireReplicasEvent",
+                "stimulus_id": "s1",
+                "who_has": {"x": ["127.0.0.1:1235"]},
+            },
+            {
+                "cls": "UpdateDataEvent",
+                "data": {"y": None},
+                "report": False,
+                "stimulus_id": "s2",
+            },
+        ],
+        "tasks": {
+            "x": {
+                "key": "x",
+                "priority": [1],
+                "state": "fetch",
+                "who_has": ["127.0.0.1:1235"],
+            },
+            "y": {
+                "key": "y",
+                "nbytes": 16,
+                "state": "memory",
+            },
+        },
+        "transition_counter": 1,
+    }
+    assert actual == expect


I think we'll need to change this to be less verbose. no need to block the PR, though.

I really like this test though. It gives the reader a very good impression of what a much larger real-life dump looks like.

yes, but we'll need to change it for many unrelated reasons. I would like it better to have a developer documentation rendered by sphinx that would print this example instead of having it hard coded in place.

crusaderky force-pushed the WSMR/the_great_yank branch 2 times, most recently from 62ffa83 to 4e882ab Compare June 11, 2022 23:04

crusaderky mentioned this pull request Jun 11, 2022

Yank state machine out of Worker class #6476

Closed

crusaderky self-assigned this Jun 11, 2022

crusaderky force-pushed the WSMR/the_great_yank branch from 4e882ab to f953262 Compare June 11, 2022 23:35

crusaderky force-pushed the WSMR/the_great_yank branch 4 times, most recently from 0eeffe1 to d45916f Compare June 13, 2022 13:28

Trivial cut-paste changes to worker and worker_state_machine

d757215

crusaderky force-pushed the WSMR/the_great_yank branch from d45916f to ac28a32 Compare June 13, 2022 13:34

crusaderky commented Jun 13, 2022

View reviewed changes

crusaderky added 2 commits June 13, 2022 15:28

Non-trivial changes to worker and worker_state_machine

c2e9448

Everything else

82bcf8b

crusaderky force-pushed the WSMR/the_great_yank branch from ac28a32 to 82bcf8b Compare June 13, 2022 14:29

crusaderky marked this pull request as ready for review June 13, 2022 14:29

crusaderky mentioned this pull request Jun 13, 2022

Cosmetic review of story() #6442

Merged

fjetter mentioned this pull request Jun 13, 2022

Alternatives for current ensure_communicating #6497

Closed

fjetter reviewed Jun 13, 2022

View reviewed changes

fjetter approved these changes Jun 14, 2022

View reviewed changes

fjetter reviewed Jun 14, 2022

View reviewed changes

fjetter merged commit 344868a into dask:main Jun 14, 2022

crusaderky deleted the WSMR/the_great_yank branch June 14, 2022 10:52

orf mentioned this pull request Jun 16, 2022

TypeError: not enough arguments for format string #6588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yank state machine out of Worker class #6566

Yank state machine out of Worker class #6566

crusaderky commented Jun 11, 2022 •

edited

Loading

github-actions bot commented Jun 12, 2022 •

edited

Loading

crusaderky Jun 13, 2022

crusaderky Jun 13, 2022 •

edited

Loading

crusaderky Jun 13, 2022

crusaderky Jun 13, 2022

fjetter Jun 13, 2022

crusaderky Jun 13, 2022

crusaderky Jun 13, 2022

fjetter Jun 13, 2022

crusaderky Jun 13, 2022

crusaderky commented Jun 13, 2022

fjetter Jun 13, 2022

crusaderky Jun 14, 2022

crusaderky Jun 14, 2022 •

edited

Loading

fjetter Jun 14, 2022

fjetter Jun 13, 2022

fjetter Jun 13, 2022

crusaderky Jun 14, 2022

fjetter Jun 13, 2022

fjetter Jun 14, 2022

crusaderky Jun 14, 2022

fjetter Jun 14, 2022

Yank state machine out of Worker class #6566

Yank state machine out of Worker class #6566

Conversation

crusaderky commented Jun 11, 2022 • edited Loading

Changes of note

Out of scope

github-actions bot commented Jun 12, 2022 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

crusaderky Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jun 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jun 11, 2022 •

edited

Loading

github-actions bot commented Jun 12, 2022 •

edited

Loading

crusaderky Jun 13, 2022 •

edited

Loading

crusaderky Jun 14, 2022 •

edited

Loading