RFC Sign every compute task with a unique counter to correlated responses #7372

fjetter · 2022-12-06T14:31:39Z

This was proposed in the context of shuffle resilience, see #6105 and primarily #7353

This introduces a new unique counter in the TaskState objects that counts attempts of assigning the task to a worker. This attempt counter is unique and is used to sign the compute-task message. The Worker is expected to respond sign the response accordingly. The scheduler in turn can use this to correlate task-finished messages with the sent compute messages and can intentionally reject/accept a response.

In an ancient version of this code base we had a similar check at this position where we verified that a task-finished message would originate from the worker the task is processing_on. I don't exactly remember what happened to this check but I cannot find it any longer. This is a different take on this but practically the same functionality.

fjetter · 2022-12-06T14:35:16Z

I realize this may not be entirely sufficient for the purposes of shuffling since the attempt counter on worker side may be bumped by a ComputeTaskEvent after the task finished on the threadpool but before the response was generated. Will need to think a bit more about this

fjetter · 2022-12-06T14:55:48Z

The current implementation would not be sufficient for my shuffle requirements due to the cancelled->resumed logic. Interestingly, we may be able to drop cancelled(exceuting) and the resumed state entirely with this counter. Will follow shortly with some explanation...

github-actions · 2022-12-06T15:34:22Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      19 files -         3       19 suites - 3 13h 45m 16s ⏱️ + 3h 57m 12s
  3 295 tests +        3   2 930 ✔️ -     276     90 💤 +    5   267 ❌ +  267   8 🔥 +  7
21 781 runs - 14 352 19 116 ✔️ - 15 456 1 055 💤 - 505 1 571 ❌ +1 571 39 🔥 +38

For more details on these failures and errors, see this check.

Results for commit 4ef8975. ± Comparison against base commit 9d8e18c.

♻️ This comment has been updated with latest results.

fjetter · 2022-12-06T16:31:29Z

The gist of removing cancelled/resumed:

At the very least for the execute call this attempt counter would allow us to immediately release a task even if it is still on the threadpool. The threadpool would still be blocked but the task state can be released immediately assuming the Worker.execute verifies that the attempt counter is identical to what it was before it started executing.

If it is identical, the task would be transitioned to successful as usual. If the counter shows any drift, the worker throws away the result, regardless of success or failure and simply issues a "new" ExecuteReleaseSlot event suggesting to the state machine that something new can be scheduled. Of course, this assumes that a executing->released transition would specifically not do this.
I believe this would allow us to replace the cancelled->resumed transition branch entirely in favor of a semantically much simpler approach.

Regarding the cancelled(flight) a similar pattern could be implemented but the WSM would need to manage the same counters.

(FYI instead of counters, any arbitrary unique identifiers can be used. ordering is not important but counters are simple and fast)

fjetter · 2022-12-08T14:56:16Z

#7356 suggested to use the transition counter to sign the task. That would also work, we don't necessarily need to know about the "attempt", i.e. that's a bit less state on the scheduler

This fixes `test_deadlock_resubmit_queued_tasks_fast`. The problem: - an "old" TaskState `f-0` is forgotten and removed from `queued` HeapSet. Internally, this removes it from a set, but a weakref is still on the heap - a new TaskState `f-0` is added to the HeapSet. It's pushed onto the heap, but since its priority is higher (newer generation), it comes after the old `f-0` object that's still on the front of the heap - `pop` pops old `f-0` off the heap. The weakref is still alive (for whatever reason). `value in self._data` (the set) is True, because the _hash_ of old `f-0` and new `f-0` are the same (same key). So we return the stale, old TaskState object. Much like `WorkerState`s, there should be exactly 1 `TaskState` instance per task. If there are multiple instances with the same key, they are different tasks, and have different state. xref dask#7372, dask#7356.

hendrikmakait · 2023-01-04T17:23:14Z

distributed/scheduler.py

@@ -1354,6 +1362,7 @@ def __init__(self, key: str, run_spec: object, state: TaskStateState):
        self.metadata = {}
        self.annotations = {}
        self.erred_on = set()
+        self._attempt = 0


IIUC, this means that if we re-create a task, e.g. by re-running a workload after it has been forgotten, its attempt count would be reset to 0. This would mean that the attempt counter is not unique. We should use a global counter instead that we also actively increment. transition_counter would be one possible candidate if we make sure to increment it every time we update _attempt.

sure but I'd use a different name for this then since it's no longer an attempt

fjetter mentioned this pull request Dec 6, 2022

Automatically restart P2P shuffles that were aborted due to leaving workers #7353

Closed

fjetter mentioned this pull request Dec 8, 2022

Issues with tasks completing on workers after being released and re-submitted #7356

Open

hendrikmakait mentioned this pull request Jan 4, 2023

Reset state of ShuffleSchedulerExtension on restart #7446

Merged

2 tasks

fjetter added 2 commits January 4, 2023 16:39

Sign every compute task with a unique counter to correlated responses

a6d9cc7

Fix attempt before dispatching to threadpool

4ef8975

fjetter force-pushed the compute_task_unique_counter branch from 79d9e65 to 4ef8975 Compare January 4, 2023 15:39

hendrikmakait self-assigned this Jan 4, 2023

hendrikmakait reviewed Jan 4, 2023

View reviewed changes

hendrikmakait mentioned this pull request Jan 9, 2023

Sign every compute task with run ID to correlate response #7463

Merged

2 tasks

fjetter closed this Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC Sign every compute task with a unique counter to correlated responses #7372

RFC Sign every compute task with a unique counter to correlated responses #7372

fjetter commented Dec 6, 2022

fjetter commented Dec 6, 2022

fjetter commented Dec 6, 2022

github-actions bot commented Dec 6, 2022 •

edited

Loading

fjetter commented Dec 6, 2022

fjetter commented Dec 8, 2022

hendrikmakait Jan 4, 2023

fjetter Jan 5, 2023

RFC Sign every compute task with a unique counter to correlated responses #7372

RFC Sign every compute task with a unique counter to correlated responses #7372

Conversation

fjetter commented Dec 6, 2022

fjetter commented Dec 6, 2022

fjetter commented Dec 6, 2022

github-actions bot commented Dec 6, 2022 • edited Loading

Unit Test Results

fjetter commented Dec 6, 2022

fjetter commented Dec 8, 2022

hendrikmakait Jan 4, 2023

Choose a reason for hiding this comment

fjetter Jan 5, 2023

Choose a reason for hiding this comment

github-actions bot commented Dec 6, 2022 •

edited

Loading