You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a two main options with the data stored in Worker.digests_total: either we keep it there and publish it to Prometheus from the workers, or we aggregate it on the scheduler first and move from there.
Pushing the data to the scheduler opens to displaying the data on Bokeh and Jupyter, instead of just Prometheus. It also gets rid of unwanted per-worker cardinality, which would be very overwhelming for Prometheus to store.
Finally, it opens to enriching the data with scheduler-only information (e.g. #7672).
Low level design
Add an extra defaultdict, Worker.digests_total_new, which is emptied and sent to the scheduler at every heartbeat.
This will be collected on the scheduler in Scheduler.cumulative_worker_metrics, without the worker information.
This means that:
At every heartbeat, workers will send a relatively small dict that is just about recently finished tasks
if you lose a worker, you lose only the data since the latest heartbeat
There are a two main options with the data stored in
Worker.digests_total
: either we keep it there and publish it to Prometheus from the workers, or we aggregate it on the scheduler first and move from there.Pushing the data to the scheduler opens to displaying the data on Bokeh and Jupyter, instead of just Prometheus. It also gets rid of unwanted per-worker cardinality, which would be very overwhelming for Prometheus to store.
Finally, it opens to enriching the data with scheduler-only information (e.g. #7672).
Low level design
Add an extra defaultdict,
Worker.digests_total_new
, which is emptied and sent to the scheduler at every heartbeat.This will be collected on the scheduler in
Scheduler.cumulative_worker_metrics
, without the worker information.This means that:
The text was updated successfully, but these errors were encountered: