Many prometheus metrics are duplicated #11106

richvdh · 2021-10-18T14:05:46Z

A number of the reported Prometheus metrics are duplicated. This can be a problem for Prometheus instances configured to monitor a large number of Synapse instances.

Examples:

we have two copies of each counter, one with and one without a _total suffix
synapse_util_caches_cache:hits is duplicated as synapse_util_caches_cache_hits.

The text was updated successfully, but these errors were encountered:

reivilibre · 2022-07-28T15:16:29Z

Which ones would we want to remove?

We use synapse_util_caches_cache:hits on the dashboard so I'd be in favour of removing synapse_util_caches_cache_hits (N.B. no hits in the dashboard JSON).

Equally, we don't use many _total-suffixed counters, so I'd be in favour of removing those.

Though I will bring to your attention that there are a few where we do:

rei@lithium ~/work/synapse/contrib $ rg _total grafana/synapse.json 
430:          "expr": "rate(process_cpu_seconds_total{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])",
797:              "expr": "rate(process_cpu_system_seconds_total{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])",
806:              "expr": "rate(process_cpu_user_seconds_total{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])",
4668:              "expr": "sum(rate(synapse_federation_soft_failed_events_total{instance=\"$instance\"}[$bucket_size]))",
7436:              "expr": "rate(python_gc_unreachable_total{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])/rate(python_gc_time_count{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])",
8124:              "expr": "synapse_replication_tcp_resource_total_connections{job=~\"$job\",index=~\"$index\",instance=\"$instance\"}",

A good first step may be to correct our dashboard JSON in preparation for the change.

I also imagine this would want to be communicated as a breaking change.

Suggested steps as follows:

Fix dashboard JSON to not use _total at all
Remove *_total and synapse_util_caches_cache_* from the published metrics.

richvdh · 2022-07-28T16:38:26Z

Equally, we don't use many _total-suffixed counters, so I'd be in favour of removing those.

This is a bit complicated. Let me try and give some history:

There is a standard called OpenMetrics, which mandates that the names of counters end in _total (see https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#counter-1). Back in prometheus-client 0.4.0, the Prometheus maintainers changed prometheus-client to automatically add _total to the names of any counters that don't already have it, for compliance with OpenMetrics (see prometheus/client_python#300).

Of course, that broke everything for us (cf prometheus/client_python#317), and we worked around it by writing a custom exposition.py which basically forks a bunch of prometheus-client to expose both the _total and non-_total variants. Our intention was that we would get rid of the non-_total variants soon after, but here we are four years on.

Sooo... where does that leave us? There's no real reason we have to use the OpenMetrics-compliant names (the prometheus server seems happy either way), but it would still be quite nice to get rid of that custom exposition.py (and future versions of prometheus, or other OpenMetrics server implementations, might be less tolerant). So basically: I'd suggest switching to the _total variants if possible.

richvdh · 2022-07-28T21:06:00Z

We use synapse_util_caches_cache:hits on the dashboard so I'd be in favour of removing synapse_util_caches_cache_hits (N.B. no hits in the dashboard JSON).

According to https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels, we're not supposed to use colons:

Note: The colons are reserved for user defined recording rules. They should not be used by exporters or direct instrumentation.

reivilibre · 2022-08-16T14:56:23Z

reivilibre · 2022-11-24T10:26:43Z

Finished up in #14358 ... as far as I'm aware!

richvdh mentioned this issue Oct 18, 2021

synapse_storage_transaction_time_bucket prometheus metric has too high cardinality due to desc label #11081

Closed

DMRobertson added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Oct 18, 2021

MadLittleMods added A-Metrics metrics, measures, stuff we put in Prometheus z-maintenance labels Nov 10, 2021

DMRobertson added this to the Server Density milestone Jul 21, 2022

reivilibre mentioned this issue Aug 16, 2022

Add experimental configuration option to allow disabling legacy Prometheus metric names. #13540

Merged

reivilibre self-assigned this Aug 17, 2022

DMRobertson added S-Minor Blocks non-critical functionality, workarounds exist. O-Uncommon Most users are unlikely to come across this or unexpected workflow and removed z-maintenance labels Aug 25, 2022

This was referenced Sep 1, 2022

Update the Grafana dashboard that is included with Synapse in the contrib directory. #13697

Merged

Update Grafana dashboard to not use legacy metric names. #13714

Merged

Fix Prometheus recording rules to not use legacy metric names. #13718

Merged

reivilibre mentioned this issue Oct 3, 2022

Announce that legacy metric names are deprecated, will be turned off by default in Synapse v1.71.0 and removed altogether in Synapse v1.73.0. #14024

Merged

reivilibre mentioned this issue Nov 2, 2022

Disable legacy Prometheus metric names by default. They can still be re-enabled for now, but they will be removed altogether in Synapse 1.73.0. #14353

Merged

reivilibre mentioned this issue Nov 10, 2022

Remove legacy metric names in v1.73.0. #14407

Closed

KitsuneRal mentioned this issue Nov 16, 2022

contrib/grafana still uses legacy metrics #14465

Closed

This was referenced Nov 17, 2022

Update forgotten references to legacy metrics in the included Grafana dashboard. #14477

Merged

Remove legacy Prometheus metrics names. They were deprecated in Synapse v1.69.0 and disabled by default in Synapse v1.71.0. #14538

Merged

reivilibre closed this as completed Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many prometheus metrics are duplicated #11106

Many prometheus metrics are duplicated #11106

richvdh commented Oct 18, 2021 •

edited

Loading

reivilibre commented Jul 28, 2022

richvdh commented Jul 28, 2022

richvdh commented Jul 28, 2022

reivilibre commented Aug 16, 2022 •

edited

Loading

reivilibre commented Nov 24, 2022

Many prometheus metrics are duplicated #11106

Many prometheus metrics are duplicated #11106

Comments

richvdh commented Oct 18, 2021 • edited Loading

reivilibre commented Jul 28, 2022

richvdh commented Jul 28, 2022

richvdh commented Jul 28, 2022

reivilibre commented Aug 16, 2022 • edited Loading

reivilibre commented Nov 24, 2022

richvdh commented Oct 18, 2021 •

edited

Loading

reivilibre commented Aug 16, 2022 •

edited

Loading