Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility functions to debug cluster_dump #5873

Closed
Tracked by #5736
fjetter opened this issue Feb 28, 2022 · 12 comments · Fixed by #5920
Closed
Tracked by #5736

Utility functions to debug cluster_dump #5873

fjetter opened this issue Feb 28, 2022 · 12 comments · Fixed by #5920

Comments

@fjetter
Copy link
Member

fjetter commented Feb 28, 2022

Client.dump_cluster_state offers a way to dump the entire cluster state of all workers and scheduler. For active clusters this typically includes hundreds of thousands of lines of state (assuming it is printed human readable).

To work with this state dump artifact, one typically needs to write custom scripts, grep the logs, etc. Many of these operations can be standardized and we should keep a few functions around to help us analyze this.

  • get_tasks_in_state(state: str, worker: bool=False) Return all TaskState objects who are currently in a given state on the scheduler (or the Workers if worker is True)
  • story - Get a global story for a key or stimulus ID, similar to Client.story - Support collecting cluster-wide story for a key or stimulus ID #5872
  • missing_workers - Get names and addresses of all workers who are known to the scheduler but we are missing logs for
  • ...

The above functions should accept the cluster dump artifact in both yaml and msgpack format.

@gjoseph92
Copy link
Collaborator

In general, a function to see where scheduler state doesn't match with worker state (tasks have differing states, workers have different status, etc.) could be a helpful starting point. Basically, an overall "diff" between the scheduler's view of the world, and the actual world as recored by workers. But particularly for tasks, since that's usually what you're interested in.

@fjetter
Copy link
Member Author

fjetter commented Mar 1, 2022

In general, a function to see where scheduler state doesn't match with worker state (tasks have differing states, workers have different status, etc.) could be a helpful starting point. Basically, an overall "diff" between the scheduler's view of the world, and the actual world as recored by workers. But particularly for tasks, since that's usually what you're interested in.

Maybe. There is always a diff. States don't map exactly 1to1 and there is a delay. I'm concerned that writing this "diff" is not worth the cost tbh

@gjoseph92
Copy link
Collaborator

True. A full diff probably isn't necessary. I'm thinking more about task states specifically, and especially processing on the scheduler vs ready+executing on workers.

Really, I'm looking for tools to give you a starting point in a large dump, and highlight potential problems so you have an idea where to look with more targeted tools like story. And since typically, a deadlock ultimately happens because the scheduler's "view of the world" (task states) doesn't match with the workers'—and what we're looking for is the root cause why those got out of sync—seeing where the state differs seemed to me like good place to start.

@fjetter
Copy link
Member Author

fjetter commented Mar 4, 2022

Really, I'm looking for tools to give you a starting point in a large dump

Every dump I investigate starts with

  1. "get all tasks in state processing on the scheduler"
  2. "Pick one of the tasks"
  3. "Show we on which worker this task is supposed to be executing"

I think this pattern can be sufficiently covered by the function I proposed above, get_tasks_in_state.

From here on out, it depends on what I see

A) Is the worker indeed trying to execute it? If so, why isn't it? (e.g. missing dependency)
B) The worker is not aware that it is supposed to execute it
C) Multiple workers know about this task and disagree
D) ...

The most relevant attributes I typically look at are

  • Scheduler.transition_log
  • Scheduler.tasks
  • Stealing (basicalyl everyhing stored in the extension)
  • Scheduler.events
  • Worker.tasks
  • Worker.log
  • A few more attributes on worker side (len(self.ready), len(self.data_needed), n_executing, etc.)

In the past, every specific script I wrote to debug this became stale after I fixed a particular issue. Therefore, I kept the most obvious, elementary methods in my scripts like the ones mentioned above. In my experience, anything more advanced requires way too much knowledge about the internals and is highly dependent on what you're looking for.


An example why I think the "diff" is too hard and maybe not even helpful. processing on scheduler side can mean one of many different things on the worker which are all valid states and all can pop up in a deadlocked situation, depending on many other variables

  • released
  • forgotten
  • waiting (only temporarily)
  • ready / constrained
  • resumed
  • executing

In fact, even flight, fetch and cancelled are possible if one of the workers is drifting a bit (e.g. if the task was recently lost and recomputed). That would cover every state now, excluding long-running for the sake of the argument, that one is weird anyhow.

Of course, this set is much more restricted given that we're only looking at the one worker where the task is supposed to be processing on. However, this is a very specific query and we'll need much more flexibility.

@sjperkins
Copy link
Member

Client.dump_cluster_state offers a way to dump the entire cluster state of all workers and scheduler. For active clusters this typically includes hundreds of thousands of lines of state (assuming it is printed human readable).

To work with this state dump artifact, one typically needs to write custom scripts, grep the logs, etc. Many of these operations can be standardized and we should keep a few functions around to help us analyze this.

* `get_tasks_in_state(state: str, worker: bool=False)` Return all TaskState objects who are currently in a given state on the scheduler (or the Workers if worker is `True`)

* `story` - Get a global story for a key or stimulus ID, similar to [`Client.story` - Support collecting cluster-wide story for a key or stimulus ID #5872](https://github.com/dask/distributed/issues/5872)

* `missing_workers` - Get names and addresses of all workers who are known to the scheduler but we are missing logs for

* ...

The above functions should accept the cluster dump artifact in both yaml and msgpack format.

I'm currently taking a look at where the above functions should be implemented. I don't think they should be implemented on the Client object as they depend on the dumped state. Would a utils_client.py or utils_state.py module be appropriate?

@fjetter
Copy link
Member Author

fjetter commented Mar 9, 2022

In #5863 we're about to introduce a cluster_dump.py module for this purpose and some other shared code

@sjperkins sjperkins mentioned this issue Mar 9, 2022
3 tasks
@sjperkins
Copy link
Member

sjperkins commented Mar 10, 2022

This issue describes utility functions for debugging cluster state stored in a disk artifact.

If the purpose is for generalised debugging, it occurred to me that would be useful to encapsulate the state in a class and offer the functionality afforded in these functions as class methods. For e.g.

class DumpInspector:
  def __init__(self, url_or_state: str | dict, context="scheduler"):
    if isinstance(url_or_state, str):
      self.dump = load_cluster_dump(url_or_state)
    elif isinstance(url_or_state, dict):
      self.dump = url_or_state

    self.context = context

 def tasks_in_state(self, state=""):
   if state:
     return {k, v for k, v in self.dump[self.context].items() if v["state"] == state}
  
   return self.dump[self.context]     

Then the following is possible:

inspect = DumpInspector("dump.msgpack.gz")
released = inspect.tasks_in_state("released)
memory = inspect.tasks_in_state("memory")

@fjetter
Copy link
Member Author

fjetter commented Mar 10, 2022

Yes, this is about debugging only. I don't even expect this to be used by a wide range of users but this is probably mostly for developers.

If the purpose is for generalised debugging, it occurred to me that would be useful to encapsulate the state in a class and

What would be the benefits of doing this?

I think a lot of this simply comes down to API design and taste. I probably would've started with a few functions and mostly builtins, e.g.

def load_cluster_dump(url_or_path: str) -> dict:
    ...

def tasks_in_state(dump: dict, state: Collection[str]) -> Collection[dict]:
    ...

In terms of usage, this would boil down to mostly the same as for the inspector

state = load_cluster_dump("path/to/my/dump")
tasks_in_state(state, ["processing", "memory")

compared to

inspector = DumpInspector("path/to/my/dump")
inspector.tasks_in_state(["processing", "memory"])

Re context

If you choose to go for an inspector class, it should not be bound by worker/scheduler context. We typically want to investigate both but with different calls, e.g.

tasks_in_state_on_scheduler("processing")
tasks_in_state_on_workers(["executing", "ready"], workers=["addrA", "addrB"])

@sjperkins
Copy link
Member

Yes, this is about debugging only. I don't even expect this to be used by a wide range of users but this is probably mostly for developers.

If the purpose is for generalised debugging, it occurred to me that would be useful to encapsulate the state in a class and

What would be the benefits of doing this?

One can simply import the class which provides access to all the associated methods, rather importing all the functions that one would use to inspect the state. I think this might save some typing.

@fjetter
Copy link
Member Author

fjetter commented Mar 10, 2022

One can simply import the class which provides access to all the associated methods, rather importing all the functions that one would use to inspect the state. I think this might save some typing.

I don't have a strong opinion here.

@sjperkins
Copy link
Member

sjperkins commented Mar 10, 2022

  • missing_workers - Get names and addresses of all workers who are known to the scheduler but we are missing logs for

I'm interested in defining the above in more detail. One way of defining this in terms of the cluster state might be the following pseudo-code:

scheduler_workers = state["scheduler"]["workers"]  # Workers known to the scheduler?
workers = state["workers"]                                        # Actual workers
missing = set()

for sw in scheduler_workers.keys():
  if sw not in workers or not workers[sw]["log"]:
    missing.add(sw)

However, its not immediately clear to me if state["scheduler']["workers"] and state["workers"] are constructed in an independent fashion during a cluster dump, as state["workers"] is derived from a broadcast to all workers.

@fjetter
Copy link
Member Author

fjetter commented Mar 10, 2022

However, its not immediately clear to me if state["scheduler']["workers"] and state["workers"] are constructed in an independent fashion during a cluster dump, as the state["workers"] is derived from a broadcast to all workers.

  • state["scheduler']["workers"] is the view of the scheduler
  • state["workers"] are the workers that actually replied to the broadcast. They may timeout

More broadly speaking, in this function one could also look into other state like Scheduler.events where we have a record of workers of the past. I’m not entirely sure how valuable this is, though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants