You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR: all of the puma gauges should use aggregation: :most_recent to be accurate.
When using the DirectFileStore to support multi-process mode (aka clustered, multiple workers), the pid labels included by default lead to incorrect and stale information. This is related to #8, yabeda-prometheus#10, and others.
As-is, each process reports info about all processes, giving some multiplicative data, and workers that are terminated are not pruned from the metrics (at least until a total restart cleans all of the bin files). As an example, say you run with 4 workers and 8 threads, and three have crashed/restarted. With the pid labels, these will appear as 8x4 threads for every worker, giving 7x32 (4 live workers, and 3 expired), or 224 threads, rather than the actual 32.
The main issue is that prometheus-client cannot predict the correct type of aggregation for gauge metrics, so they do the safe default of including the metric with a pid label. In the clustered puma scenario, this is not helpful, as each process gets all of the data, yielding some cross-talk. The pid is volatile, as well as irrelevant. What we want is only the index label to identify each worker and to export only the most recent data.
Fortunately, the design of prometheus-client allows these aggregation settings to pass through and no-op for the Synchronized and SingleThreaded stores, and the MOST_RECENT aggregation is appropriate for all 7 of the gauges. This makes the patch simple and unconditional.
The text was updated successfully, but these errors were encountered:
This change updates all of the gauges to include "most recent" aggregation mode introduced in prometheus-client 2.1.0. This is safe for both single- and multi-process mode because only the DirectFileStore from prometheus-client handles the aggregation option -- there is no aggregation to be done in single-process mode or with other adapters.
Fixes#15
TLDR: all of the puma gauges should use
aggregation: :most_recent
to be accurate.When using the DirectFileStore to support multi-process mode (aka clustered, multiple workers), the pid labels included by default lead to incorrect and stale information. This is related to #8, yabeda-prometheus#10, and others.
As-is, each process reports info about all processes, giving some multiplicative data, and workers that are terminated are not pruned from the metrics (at least until a total restart cleans all of the bin files). As an example, say you run with 4 workers and 8 threads, and three have crashed/restarted. With the pid labels, these will appear as 8x4 threads for every worker, giving 7x32 (4 live workers, and 3 expired), or 224 threads, rather than the actual 32.
The main issue is that prometheus-client cannot predict the correct type of aggregation for gauge metrics, so they do the safe default of including the metric with a pid label. In the clustered puma scenario, this is not helpful, as each process gets all of the data, yielding some cross-talk. The pid is volatile, as well as irrelevant. What we want is only the index label to identify each worker and to export only the most recent data.
Fortunately, the design of prometheus-client allows these aggregation settings to pass through and no-op for the Synchronized and SingleThreaded stores, and the MOST_RECENT aggregation is appropriate for all 7 of the gauges. This makes the patch simple and unconditional.
The text was updated successfully, but these errors were encountered: