-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Worker) State Machine determinism and replayability #5736
Comments
A next step on this iteration with a bit of pseudo code. The expectation should be that the worker state is indeed its own synchronous subclass. Pulling all the state out into its own class allows us to write targeted unit tests without invoking any concurrent or asynchronous code. It also allows us to write more specific fakes or mocks as outlined above. Interface to WorkerStateThis will require us to specify the API in which the server and the state communicate since the state will make decisions (execute/compute a result or fetch data from remote) the server needs to act on. Similarly, either remote servers or the result of an execution/fetch need to be fed into the state to trigger transitions and calculate a new decision/instruction. In the following I will mark these events with classes called The following uses dataclasses to denote these events to improve readability on the signatures but this is not a requirement for the implementation. Selection of Instructions/Events@dataclass
class Instruction:
instruction_id: str
@dataclass
class GatherDep(Instruction):
worker: str
to_gather: Collection[TaskState]
@dataclass
class FindMissing(Instruction):
...
@dataclass
class Execute(Instruction):
...
@dataclass
class SendMsg(Instruction):
payload: dict
# One of the following remote actions
# Messages emitted by the executor
# - task-erred; This is technically not emitted by the state machine but by the executor
# - reschedule; This one is also emitted by the executor
# - task-finished; Executor
# - release-worker-data; State machine / remove ACK/confirm
# - long-running; Basically a user trigger during execution
# - add-keys; Manual update_data, fix scheduler state if remove-replica fails (StateMachine),
@dataclass
class StateMachineEvent:
stimulus_id: str
@dataclass
class HandleComputeTask(StateMachineEvent):
key: str
who_has: dict
priority: tuple
@dataclass
class RemoveReplicas(StateMachineEvent):
keys: Collection[str]
@dataclass
class GatherDepSuccess(StateMachineEvent):
data: object
@dataclass
class GatherDepError(StateMachineEvent):
worker: str
exc: Exception
@dataclass
class GatherDepMissing(StateMachineEvent):
key: str
@dataclass
class Pause(StateMachineEvent):
...
@dataclass
class UnPause(StateMachineEvent):
...
@dataclass
class GatherBusy(StateMachineEvent):
attempt: int
@dataclass
class RemoteDead(StateMachineEvent):
worker: str
exception: Exception
@dataclass
class ExecuteSuccess(StateMachineEvent):
ts: TaskState
data: object
@dataclass
class ExecuteFailure(StateMachineEvent):
ts: TaskState
exception: Exception
@dataclass
class Reschedule(StateMachineEvent):
... Embedding of WorkerState into Worker server / Asyncio dispatchGiven the set of WorkerState pseudo code_TRANSITION_RETURN = tuple[dict, Collection[Instruction]]
class WorkerState:
data: SpillBuffer
state_event_log: list[StateMachineEvent]
tasks: dict[str, TaskState]
def handle_stimulus(self, event: StateMachineEvent) -> Collection[Instruction]:
self.state_event_log.append(event)
if isinstance(event, RemoveReplicas):
return self._handle_remove_replicas(event)
else:
raise RuntimeError(f"Unknown stimulus {event}")
def _handle_remove_replicas(self, event: RemoveReplicas) -> Collection[Instruction]:
# This is where the current transition logic would reside and the
# transition chain would be triggered. No new logic other than
# unwrapping the event dataclasses if we were to use them
...
def handle_compute_task(self, event: HandleComputeTask) -> Collection[Instruction]:
...
# This is effectively the logic of ``ensure_communicating`` but the
# transition logic is separated from coroutine scheduling. It is also only
# called during transition OR when unpausing
def _ensure_communicating(self, stimulus_id: str) -> _TRANSITION_RETURN:
recommendations = {}
instructions = []
# This is pseudo/simplified code of what is currently in
# ensure_communicating
while self.data_needed and (
len(self.in_flight_workers) < self.total_out_connections
or self.comm_nbytes < self.comm_threshold_bytes
):
next_ = self.data_needed.pop()
worker = self.worker_to_fetch_from(next_)
to_gather = self.select_keys_for_gather(worker, next_)
recommendations.update({k: ("flight", worker) for k in to_gather})
instructions.append(GatherDep(stimulus_id, worker, to_gather))
return recommendations, instructions
def transition_released_fetch(
self, ts: TaskState, *, stimulus_id: str
) -> _TRANSITION_RETURN:
for w in ts.who_has:
self.pending_data_per_worker[w].push(ts)
ts.state = "fetch"
ts.done = False
self.data_needed.push(ts)
# The current released->fetch does never return any content
return self._ensure_communicating(stimulus_id)
async def memory_monitor(self):
def check_pause(memory):
if unpaused: # type: ignore
self.handle_stimulus(UnPause())
check_pause(42) As can be seen by the pseudo code above, the logic currently implemented would only slightly be shifted around but not fundamentally changed. Primarily the Since every event is logged and this is the only way to modify the state, this allows us to deterministically reconstruct any state given the input events. Worker server codeEmbedding this API into a working server class will require us to mix sync and async tasks and requires us to deal with state API calling and dispatching the received The ambition should be to isolate any interaction with an asynchronous actor as well as possible and hide it behind an internal API to allow for easier mocking/faking, e.g. by defining Psuedo code for Worker server code (base class)class WorkerBase(abc.ABC):
batched_stream: BatchedSend
state: WorkerState
instruction_history: list[StateMachineEvent]
def _handle_stimulus_from_future(self, fut):
try:
stim = fut.result()
except Exception:
# This must never happen and the handlers must implement exception handling.
# If we implement error handling here, this should raise some exception that
# can be immediately filed as a bug report
raise
for s in stim:
self._handle_stimulus(s)
def _handle_stimulus(self, stim: StateMachineEvent):
self.instruction_history.append(stim)
instructions = self.state.handle_stimulus(stim)
for inst in instructions:
task = None
# TODO: collect all futures and await/cancel when closing?
if isinstance(inst, GatherDep):
task = asyncio.create_task(self._gather_data(inst))
elif isinstance(inst, Execute):
task = asyncio.create_task(self.execute(inst))
elif isinstance(inst, SendMsg):
self.batched_stream.send(inst.payload)
else:
raise RuntimeError("Unknown instruction")
if task:
task.add_done_callback(self._handle_stimulus_from_future)
@abc.abstractmethod
async def execute(self, inst: Execute) -> Collection[StateMachineEvent]:
raise NotImplementedError
@abc.abstractmethod
async def _gather_data(self, inst: GatherDep) -> Collection[StateMachineEvent]:
raise NotImplementedError Given the above base class, we can then implement the remote interaction or the fake that emulates the test condition we're looking at class Worker(WorkerBase):
async def _gather_data(self, inst: GatherDep) -> Collection[StateMachineEvent]:
try:
result = await get_data_from_worker(inst)
# We might want to update *server* related state, like bandwidth measurements, etc. here
# *no* task state specific updates are allowed here, i.e. no interaction to `WorkerBase.state`
if response["status"] == "busy":
return [GatherBusy("new-stim", inst.worker)]
except Exception as exc:
return [GatherDepError("new-stim", inst.worker, exc)]
return [GatherDepSuccess("new-stim", inst.worker, response)]
class FakeWorker(WorkerBase):
"""This class can be used for testing to simulate a busy worker A"""
async def _gather_data(self, inst: GatherDep) -> Collection[StateMachineEvent]:
if inst.worker == "A":
return [GatherBusy("foo", 2)]
else:
return [GatherDepSuccess("foo", "bar") for _ in inst.to_gather] Reconstruct state based on historyGiven all of the above, it should also easily be possible to reconstruct a async def get_worker_from_history(history: Collection[StateMachineEvent]):
state = WorkerState()
for event in history:
_ = state.handle_stimulus(event) # ignore instructions
# This _should_ now be a functional worker at the given time
return await Worker(..., state=state) # type: ignore |
A few thoughts:
|
If we are going to do something in the scheduler, it would be interesting to think about designing both at the same time, or at least making sure that what happens with one can also be applied to the other. It would be nice to have fewer systems if that's possible. |
My intention is to establish such a protocol / interface. Right now we are forced to work with mocks. For instance, below is a quite horrible but necessary example since we do not have a clean interface. Locks, events, mocks, fragile waits. All of this structure is attempting to create a situation where the internals of distributed/distributed/tests/test_worker.py Lines 3579 to 3633 in 363d2bc
With this proposal we can instead provide a list of input events and assert on output events. I do not want to advocate for wide spread mocking. Instead I want to establish an internal, narrow API which can be used to implement instrumentation and, if necessary, mocking. Mocking is harmful if done for unstable or ill-defined APIs but as long as the behaviour of this API is well understood, mocking is relatively safe. FWIW, the gather dep implementation was connected to most deadlocks I debugged over the past months.
I agree that I haven't properly put down requirements, constraints, etc. but I would argue that my pseudo code in the drop downs of #5736 (comment) is close to working assuming the logic was filled in. The primary reason why this is not a PR is because of the many tests we have that currently mock internals and this will require a bit of time.
What I'm proposing is that the worker state machine has the public interface class WorkerState:
all_possible_events: list[StateMachinEvent] = [
GatherDep,
ComputeTask,
RemoveReplicas,
ExecuteSuccess,
Pause,
...
]
all_instructions: list[Instructions] = [
Execute,
GatherDep,
SendMsg,
FindMissing,
...
]
data: SpillBuffer
state_event_log: list[StateMachineEvent]
tasks: dict[str, TaskState] # Maybe private?
def handle_stimulus(self, event: StateMachineEvent) -> Collection[Instruction]:
... All current transitions, simulus handlers, etc. will be encapsulated. We might want expose a bit more for instrumentation but in terms of functionality everything else should be a black box. In fact, I'm not even sure if the rest of the worker even needs to know about the tasks dict. An important thing I might not have stressed enough, this design would effectively allow us to replace big parts of the cluster_dump functionality in favour of a dump of the
The control flow of the scheduler is already much better organized and implemented using the stimulus/event handler pattern. That's basically the requirement of this proposed design. I haven't thought much more about the internals but I assume we will stick to the transitions/recommendations approach and the worker/scheduler may share this piece of logic if we want to. However, everything else will very likely need to be different since the state is fundamentally differently described. |
I'm happy to iterate further. Do you have suggestions on how, i.e. format? Would you prefer a PR? |
Ah, I missed the summary blocks before. Another question, I was suprised to see |
Edit: I think keeping a synchronous interface is possible for setitem since a setter should not care if the data is on disk or in memory. Having the same for getitem is not possible for obvious reasons. we'd still need to unblock the event loop 🤦 xref #4424 |
sync setitem might also not work if you want backpressure If all you're doing is contains checks though then that could make sense. No need to block on this. It just seemed off. I agree though that this isn't likely to block 99% of the design and so we should move forward. |
If we want to implement something like this incrementally, my suggestion on how to split this effort up is as follows
|
A bunch of observations in advance of our meeting later today: Rebuild from logI don't think it's possible to rebuild a WorkerState from a list of StateMachineEvent, unless such a list never expires (which would be a memory leak). For Worker-wide events, this is not a trivial operation. Even for single-key events, this is far from trivial. Example:
The only way I can think to track the above is, in the handler of RemoveReplicas, to note down that the existence of b changed the decision for a, therefore from now on all the events of b are relevant to a for the purpose of rebuilding the WorkerState. This can quickly translate into large branches of the whole dask graph to be kept on the worker for as long as any of the keys of that graph are on the worker. The alternative to all this is to educate our users that, before their computation hangs, they have to start the cluster with a config flag memory_monitor and SpillBufferI think that memory_monitor and the construction of the SpillBuffer should be moved to a WorkerExtension and that could be a first (simple) PR as the whole thing is fairly self-contained. This would highlight all the contact points between the spill mechanism and the worker state (namely, switching between Status.paused and Status.running). After we break out the WorkerState, data should just be an opaque MutableMapping created by the Worker and passed to Regarding the possibility of asynchronous spilling: as @fjetter was pointing out, cast(SpillBuffer, self.state.data).clear_all_dangling_tasks() Info from the WorkerIn the mock above, WorkerState._ensure_communicating is accessing a wealth of attributes that have no right to be in WorkerState: len(self.in_flight_workers) < self.total_out_connections
or self.comm_nbytes < self.comm_threshold_bytes Is this info copied over from |
I understand the limitation of this reconstruction attempt. Right now, we're logging 100k transitions on worker side and I also don't mind telling users that in these exceptional situations we'll require a Also, where I consider this reconstruction to be actually used is in unit tests
To address this we chose to go for a small refactoring that moves the memory monitor and spill buffer into an extension to encapsulate the logic there, see #5891 This extension will eventually also emit events to the state machine, like
These are actually great examples where our current attributes are semantically slightly misleading. Taking as an example the tuple
They technically do not limit any connections or are tracking in "flight workers" but are controlling how many transitions we are performing. the phrasing Renaming them should already be sufficient to sharpen their meaning and definition and would allow us to define them as internal attributes to the
The same exercise can be done with the
Again, it's not actually connected to the comm or even connected to the real sizes we're submitting but rather a measure of what we estimate and make a decision on. This is a great example how this refactoring can help with readability since in the end our concurrency control is not coupled to a ThreadPool or a CommPool limit but rather exclusively based on what kind of task transitions we are allowing. |
This is now in. Thanks everyone for participating in the design and development. Special thanks to @crusaderky who did most of the heavy lifting! The primary effort was to address deadlocks. Over the course of the implementation of this, we expanded test coverage significantly. We don't expect any more deadlocks to pop up but we'll be now in a much better position to deal with them. If anybody is still struggling with this, please open a ticket and we will do our best to help you with it. We didn't implement any utilities for the actual "replayability" but for those who are interested how this looks like, see it in action coiled/benchmarks#295 (comment) ❤️ |
Connected tasks
ensure_computing
transitions to newWorkerState
event mechanism #5895ensure_communicating
transitions to new WorkerState event mechanism #5896Client.story
- Support collecting cluster-wide story for a key or stimulus ID #5872The dask scheduling logic on scheduler and worker side are using the model of a
finite state machine to calculate decisions. A few definitions about a finite
state machine first
computation describing an abstract machine that has a finite number of
distinct states. Given a stimulus
S_i
and a stateW_i
, there is a functionF
such that a new stateW_i+1
can be calculated asF(W_i, S_i) -> W_i+1
W_i
is to apply a transformationF
with astimulus
S_i
W_0
and the sequence of all stimuliS_i
, itis possible to calculate state
W_i
by applying the transformationF
sequentially for all
i
How does this model apply to us and where do we violate it?
Most of these arguments can be made for the scheduler as well but we'll restrict
ourselves to the Worker to keep the scope contained.
The worker state is defined primarily by the
Worker.tasks
dictionary includingTaskState
objects with various task specific attributes. On top of this, thereare a few
Worker
attributes which hold global or remote-worker specificattributes. A few examples include
Worker.data_needed
,Worker.has_what
,Worker.pending_data_per_worker
,Worker.ready
but the list goes on. Wecurrently do not properly distinguish the state machine attributes from the
server / networking code / other code.
The function
F
is a bit more difficult to define. Naively one would expectthis to be
Worker.transitions
but this is not the case since it does notaccept stimuli.
Worker.transitions
, in shortT
, accepts a set of alreadymade decisions we call
recommendations
. The recommendations are generated bystimuli handler
H
, likeWorker.handle_task_compute
,Worker.handle_free_keys
. Therefore, to define the state transition function weneed a combination of
H
andT
,M ~ T * H
, such thatW_i+1 = M(W_i, S_i) = T(W_i, H(W_i, S_i))
. Our implementation of handlers introduces a certain ambiguitysince it is not entirely clear whether a piece of logic should reside on side of the
handler or the transition function.
However, every decision should be the result of the stimulus and the current state
such that, given all stimuli in order and the initial state, we can reconstruct every iteration.
There are three (actually four) places where this pattern is violated
and the stimulus generation is not only tightly coupled to the handling and transition itself
but also coupled to asynchronous actors.
Specifically dependency gathering (
Worker.gather_dep
) but also to a softerextend task execution (
Worker.execute
) breaks the above pattern since theysimultaneously are generating stimuli and are handling them while interacting
with an async actor (i.e. remote worker or threadpool). There is no way to
inspect, assert or pause the state machine naturally. This prohibits writing
effective tests, increases instability and renders deterministic replayability
impossible.
Worse even are the
ensure_communicating
andensure_computing
methods whichare triggered in various places of our code to work off the queue of
read-to-compute / ready-to-be-fetched tasks. This pattern effectively delays
state transitions and performs a set of these transitions in bulk which is
benefitial to dependency fetching. However, they are called recursively
(actually rather something like pseudo recursively in the context of
ensure_communicating -> async gather_dep -> ensure_communicating
).Pseudo code below
Problems
requires us to have the infinite callback
ensure_*
to check periodically ifsomething changed that would allow us to fetch a new dependency. Instead,
this should be a stimulus to the state machine as well since this change in
something is always connected to a stimulus somehow
iteration
i
since it is always moving. We frequently need to deal with anintermediate state (one of the motivations for the states
resumed
andcancelled
)ultimately resulting in an unstable implementation
logging their outcome as part of the transition log we can sometimes only
guess what the trigger was. For some stimulus handlers we do log the
incoming stimulus as part of the transition log, e.g. the only stimulus body
of
handle_free_keys
are a list of keys which is what we append to our log.For more complicated handlers like
handle_compute_task
we do not do this. Ifwe start doing this, we should ensure not to log too much information and only
restrict the logged info to what is relevant to the state machine, e.g.
TaskState.runspec
is not relevant to the state machine and we shouldtherefore not remember it to reduce the memory footprint.
i
it is impossible towrite proper deterministic tests for the state machine. Instead, we rely on
sophisticated mocking to construct a very timing sensitive intermediate state.
Given the state machine is extended sufficiently with the information to make
the capacity decision, the
ensure_*
loops can be removed such that the entirestate machine can be calculated deterministically and synchronously. Every interaction
with an asynchronous actor will be mapped as a set of out-/inbound stimuli.
This will allow us to
i
signals, e.g.
[1] Logging all stimuli is actually cheaper than logging all transitions since one stimulus usually triggers many, many recommendations and transitions
The text was updated successfully, but these errors were encountered: