Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine performance metrics: Store data on the Scheduler #7666

Closed
crusaderky opened this issue Mar 17, 2023 · 0 comments · Fixed by #7701
Closed

Fine performance metrics: Store data on the Scheduler #7666

crusaderky opened this issue Mar 17, 2023 · 0 comments · Fixed by #7701
Assignees

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Mar 17, 2023

There are a two main options with the data stored in Worker.digests_total: either we keep it there and publish it to Prometheus from the workers, or we aggregate it on the scheduler first and move from there.

Pushing the data to the scheduler opens to displaying the data on Bokeh and Jupyter, instead of just Prometheus. It also gets rid of unwanted per-worker cardinality, which would be very overwhelming for Prometheus to store.
Finally, it opens to enriching the data with scheduler-only information (e.g. #7672).

Low level design

Add an extra defaultdict, Worker.digests_total_new, which is emptied and sent to the scheduler at every heartbeat.
This will be collected on the scheduler in Scheduler.cumulative_worker_metrics, without the worker information.

This means that:

  • At every heartbeat, workers will send a relatively small dict that is just about recently finished tasks
  • if you lose a worker, you lose only the data since the latest heartbeat
  • it works just fine with adaptive clusters
  • you can trivially capture end-to-end metrics of a run from the client (Fine performance metrics: client context manager #7667)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants