Fine performance metrics: Store data on the Scheduler #7666

crusaderky · 2023-03-17T16:31:08Z

Part of Fine performance metrics meta-issue #7665

There are a two main options with the data stored in Worker.digests_total: either we keep it there and publish it to Prometheus from the workers, or we aggregate it on the scheduler first and move from there.

Pushing the data to the scheduler opens to displaying the data on Bokeh and Jupyter, instead of just Prometheus. It also gets rid of unwanted per-worker cardinality, which would be very overwhelming for Prometheus to store.
Finally, it opens to enriching the data with scheduler-only information (e.g. #7672).

Low level design

Add an extra defaultdict, Worker.digests_total_new, which is emptied and sent to the scheduler at every heartbeat.
This will be collected on the scheduler in Scheduler.cumulative_worker_metrics, without the worker information.

This means that:

At every heartbeat, workers will send a relatively small dict that is just about recently finished tasks
if you lose a worker, you lose only the data since the latest heartbeat
it works just fine with adaptive clusters
you can trivially capture end-to-end metrics of a run from the client (Fine performance metrics: client context manager #7667)

The text was updated successfully, but these errors were encountered:

crusaderky added the diagnostics label Mar 17, 2023

milesgranger mentioned this issue Mar 23, 2023

Store performance metrics on scheduler #7701

Merged

2 tasks

crusaderky assigned milesgranger Mar 27, 2023

crusaderky closed this as completed in #7701 Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine performance metrics: Store data on the Scheduler #7666

Fine performance metrics: Store data on the Scheduler #7666

crusaderky commented Mar 17, 2023 •

edited

Loading

Fine performance metrics: Store data on the Scheduler #7666

Fine performance metrics: Store data on the Scheduler #7666

Comments

crusaderky commented Mar 17, 2023 • edited Loading

Low level design

crusaderky commented Mar 17, 2023 •

edited

Loading