-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utility functions to debug cluster_dump #5873
Comments
In general, a function to see where scheduler state doesn't match with worker state (tasks have differing states, workers have different status, etc.) could be a helpful starting point. Basically, an overall "diff" between the scheduler's view of the world, and the actual world as recored by workers. But particularly for tasks, since that's usually what you're interested in. |
Maybe. There is always a diff. States don't map exactly 1to1 and there is a delay. I'm concerned that writing this "diff" is not worth the cost tbh |
True. A full diff probably isn't necessary. I'm thinking more about task states specifically, and especially Really, I'm looking for tools to give you a starting point in a large dump, and highlight potential problems so you have an idea where to look with more targeted tools like |
Every dump I investigate starts with
I think this pattern can be sufficiently covered by the function I proposed above, From here on out, it depends on what I see A) Is the worker indeed trying to execute it? If so, why isn't it? (e.g. missing dependency) The most relevant attributes I typically look at are
In the past, every specific script I wrote to debug this became stale after I fixed a particular issue. Therefore, I kept the most obvious, elementary methods in my scripts like the ones mentioned above. In my experience, anything more advanced requires way too much knowledge about the internals and is highly dependent on what you're looking for. An example why I think the "diff" is too hard and maybe not even helpful.
In fact, even Of course, this set is much more restricted given that we're only looking at the one worker where the task is supposed to be processing on. However, this is a very specific query and we'll need much more flexibility. |
I'm currently taking a look at where the above functions should be implemented. I don't think they should be implemented on the |
In #5863 we're about to introduce a |
This issue describes utility functions for debugging cluster state stored in a disk artifact. If the purpose is for generalised debugging, it occurred to me that would be useful to encapsulate the state in a class and offer the functionality afforded in these functions as class methods. For e.g. class DumpInspector:
def __init__(self, url_or_state: str | dict, context="scheduler"):
if isinstance(url_or_state, str):
self.dump = load_cluster_dump(url_or_state)
elif isinstance(url_or_state, dict):
self.dump = url_or_state
self.context = context
def tasks_in_state(self, state=""):
if state:
return {k, v for k, v in self.dump[self.context].items() if v["state"] == state}
return self.dump[self.context] Then the following is possible: inspect = DumpInspector("dump.msgpack.gz")
released = inspect.tasks_in_state("released)
memory = inspect.tasks_in_state("memory") |
Yes, this is about debugging only. I don't even expect this to be used by a wide range of users but this is probably mostly for developers.
What would be the benefits of doing this? I think a lot of this simply comes down to API design and taste. I probably would've started with a few functions and mostly builtins, e.g. def load_cluster_dump(url_or_path: str) -> dict:
...
def tasks_in_state(dump: dict, state: Collection[str]) -> Collection[dict]:
... In terms of usage, this would boil down to mostly the same as for the inspector state = load_cluster_dump("path/to/my/dump")
tasks_in_state(state, ["processing", "memory") compared to inspector = DumpInspector("path/to/my/dump")
inspector.tasks_in_state(["processing", "memory"]) Re context If you choose to go for an inspector class, it should not be bound by worker/scheduler context. We typically want to investigate both but with different calls, e.g. tasks_in_state_on_scheduler("processing")
tasks_in_state_on_workers(["executing", "ready"], workers=["addrA", "addrB"]) |
One can simply import the class which provides access to all the associated methods, rather importing all the functions that one would use to inspect the state. I think this might save some typing. |
I don't have a strong opinion here. |
I'm interested in defining the above in more detail. One way of defining this in terms of the cluster state might be the following pseudo-code: scheduler_workers = state["scheduler"]["workers"] # Workers known to the scheduler?
workers = state["workers"] # Actual workers
missing = set()
for sw in scheduler_workers.keys():
if sw not in workers or not workers[sw]["log"]:
missing.add(sw) However, its not immediately clear to me if |
More broadly speaking, in this function one could also look into other state like |
Client.dump_cluster_state
offers a way to dump the entire cluster state of all workers and scheduler. For active clusters this typically includes hundreds of thousands of lines of state (assuming it is printed human readable).To work with this state dump artifact, one typically needs to write custom scripts, grep the logs, etc. Many of these operations can be standardized and we should keep a few functions around to help us analyze this.
get_tasks_in_state(state: str, worker: bool=False)
Return all TaskState objects who are currently in a given state on the scheduler (or the Workers if worker isTrue
)story
- Get a global story for a key or stimulus ID, similar toClient.story
- Support collecting cluster-wide story for a key or stimulus ID #5872missing_workers
- Get names and addresses of all workers who are known to the scheduler but we are missing logs forThe above functions should accept the cluster dump artifact in both yaml and msgpack format.
The text was updated successfully, but these errors were encountered: