Metrics are not aggregated when in clustered mode #15

botimer · 2021-02-05T02:41:50Z

TLDR: all of the puma gauges should use aggregation: :most_recent to be accurate.

When using the DirectFileStore to support multi-process mode (aka clustered, multiple workers), the pid labels included by default lead to incorrect and stale information. This is related to #8, yabeda-prometheus#10, and others.

As-is, each process reports info about all processes, giving some multiplicative data, and workers that are terminated are not pruned from the metrics (at least until a total restart cleans all of the bin files). As an example, say you run with 4 workers and 8 threads, and three have crashed/restarted. With the pid labels, these will appear as 8x4 threads for every worker, giving 7x32 (4 live workers, and 3 expired), or 224 threads, rather than the actual 32.

The main issue is that prometheus-client cannot predict the correct type of aggregation for gauge metrics, so they do the safe default of including the metric with a pid label. In the clustered puma scenario, this is not helpful, as each process gets all of the data, yielding some cross-talk. The pid is volatile, as well as irrelevant. What we want is only the index label to identify each worker and to export only the most recent data.

Fortunately, the design of prometheus-client allows these aggregation settings to pass through and no-op for the Synchronized and SingleThreaded stores, and the MOST_RECENT aggregation is appropriate for all 7 of the gauges. This makes the patch simple and unconditional.

The text was updated successfully, but these errors were encountered:

This change updates all of the gauges to include "most recent" aggregation mode introduced in prometheus-client 2.1.0. This is safe for both single- and multi-process mode because only the DirectFileStore from prometheus-client handles the aggregation option -- there is no aggregation to be done in single-process mode or with other adapters. Fixes #15

Envek · 2021-02-05T10:04:57Z

Wow! Thank you for very detailed explanation!

My bad, I wasn't tracking Prometheus Ruby client development closely and missed addition of new :most_recent aggregation mode in version 2.1.0.

I believe that it should fix #8

Please try version 0.6.0 of this gem that has your PR merged and share your experience.

botimer mentioned this issue Feb 5, 2021

Aggregate all metrics with "most recent" #16

Merged

Envek closed this as completed in #16 Feb 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics are not aggregated when in clustered mode #15

Metrics are not aggregated when in clustered mode #15

botimer commented Feb 5, 2021

Envek commented Feb 5, 2021

Metrics are not aggregated when in clustered mode #15

Metrics are not aggregated when in clustered mode #15

Comments

botimer commented Feb 5, 2021

Envek commented Feb 5, 2021