-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Scheduler and Worker state machine #6948
Conversation
6751e13
to
47e775f
Compare
47e775f
to
3ff0cfc
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 17m 26s ⏱️ - 11m 42s For more details on these failures, see this check. Results for commit cdbcac5. ± Comparison against base commit 6a1b089. ♻️ This comment has been updated with latest results. |
@martindurant @jakirkham you might be interested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Very clear and useful documentation; this would be great to read as a new developer interested in working on the Worker. All comments are just naming/grammar nits.
@@ -5,15 +5,12 @@ digraph{ | |||
]; | |||
released1 [label=released]; | |||
released2 [label=released]; | |||
new -> released1; | |||
released1 -> waiting; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, merge conflict with #6614 for this file
and only when the message reaches the worker it will be released there too. | ||
|
||
|
||
Flow control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this section of the docs makes me so happy with the design of the worker state machine. Having state transformation strongly separated out from IO and concurrency like this is so nice, and such a big improvement. Nice work!
This is a great document to have available for reference. I have a couple of high-level thoughts before getting into detail. None of these mean I am requesting changes in the implementation.
Some diagrams are disjoint. It makes it confusing to follow. For example, the big diagram at the top of Computing shows rescheduled->released->forgotten, but two diagrams later we see that ERROR and MEMORY have exactly the same paths. I would change some names to make them clearer, if more verbose is allowed. Something like I would add a clear and specific definition and consequence of every state. Some of this is in TaskStates, but I would add specific details about the data structures affected.
Why is the initial state of any task apparently RELEASED? There is no mention anywhere of Actors, even though a lot of code is dedicated to them. |
I also just realized I don't think you mention the |
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
I overhauled the docstring of the flag.
Because you'd have a task in Additionally, when such a task fails, you need to try doing what the scheduler originally asked for. This is actually the norm when a worker dies and the scheduler notices before its peers:
This use case is also documented in the sphinx documents linked above.
It's a lot more efficient to have two different pipelines for tasks with resources and tasks without, so that a task without resources is not blocked by tasks with.
reschedule causes the task to be immediately forgotten on the worker and released on the scheduler, which restarts its life-cycle.
Yes, this is on purpose to highlight how reschedule immediately transitions to released and forgotten, while error/memory won't transition to released until the scheduler asks to. I updated the diagrams and the "Forgetting tasks" section.
This would make things seriously hard to read considering how many times these labels appear throughout the code.
No, when a task is in
Added clarifications.
Historical reasons. There used to be two separate states,
Actually, Actors have nothing to do whatsoever with the worker state machine - they're just a task like any other. They are handled exclusively in |
All review comments have been addressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
Closes #5413
Rendered preview:
https://distributed--6948.org.readthedocs.build/en/6948/scheduling-state.html
https://distributed--6948.org.readthedocs.build/en/6948/worker-state.html