Add the ability to expire internal metrics. #11995

tobz · 2022-03-25T20:55:13Z

This epic represents the work required to complete the ability to control the cardinality of our internal metrics through expiring old ones.

Context

Currently, internal metrics are captured via the metrics crate, to which we utilize a customized Recorder implementation that forwards them to a base-bones Registry that stores them, and then on an interval, we collect the registered metrics and forward them into the topology.

While this works well from a performance/simplicity standpoint, we currently do not expire metrics that are no longer used. This can lead to issues, such as the one described in #11821, where high cardinality label values can generate many unique metrics in the registry over time, which comes together as a slow memory leak that requires restarting Vector itself.

Additionally, this can also result in ever-growing outputs from metrics sinks, where now-stale-and-no-longer-used metrics continue to be emitted needlessly. We've addressed this in the past for specific sinks, such as the prometheus_exporter sink in #9769, but there should be a more general solution.

Proposal

We should upgrade our usage of metrics in order to provide the ability to expire internal metrics such that old/stale metrics eventually get removed from the registry entirely, and are no longer reported.

To do so involves a few steps that should be achievable in an incremental fashion:

upgrade to metrics-0.18.x/metrics-util-0.12.x to bring in support for metric handles and a more customizable Registry
bolt on the Recency module (part of metrics-util) so that we can begin tracking the generation of each metric to properly understand when a metric has or hasn't been updated since a particular observation of it
create our own storage implementation that melds the required generational tracking of Recency with the ability to consume the value of counters (gauges and histograms need no change here) such that we can switch to emitting incremental counter metrics rather than absolute ones

Upgrading `metrics`/`metrics-util`

This is required due to the introduction of "metric handles" and the more flexible Registry/Storage primitives, which will allow us to interact with the metric values such that we can mutate them rather than simply observe them.

Bolting on `Recency` support

This one is sort of self-explanatory: Recency provides the metric storage wrapper and logic to detect metrics that have truly not been updated within a given time period. This is required because gauges could technically be the same value when observed at two different times, even if it was in fact changed in between those two observations. Recency deals with this by tracking a "generation" for a metric, such that any operation on the metric at all increments an internal counter -- the generation -- so that you can tell, for both of those two hypothetical observations mentioned above: "oh, the generation is identical, so it hasn't been updated" or "oh, the generations are different, so it's definitely still being updated".

Creating customized storage for the metrics

This one feeds into being able to emit incremental counters.

Currently, we emit absolute metrics for anything we scrape from the internal metrics registry. This is fine for gauges, and even for histograms, we end up merging that stuff together. This is bad for counters, though.

It's bad because for many metrics sinks, they only emit incrementalized metrics downstream, so in order to incrementalize an absolute metric, you must observe it at least twice in order to begin tracking the delta between observations. This means that if we expired a counter, and then it came back to life -- maybe it's updated very infrequently -- we would potentially lose some of the increments to that counter between the first time we see it when it comes back to life and the second time we observe it.

By writing our own customized storage, we could provide a hook to consume a counter's value -- essentially, swap the value to 0 and take the previous value -- such that we could emit the consumed value as an incremental update. This would ensure that even if a metric is expired but comes back, we're always consuming its entire value so we never miss an update.

This would work, specifically, because we would only do expiration when iterating over the metrics and observing them, not automatically in the background, so we would always have a chance to observe the metric (and potentially consume its value if it's a counter) before removing it from the registry.

Caveats

One main thing we'll want to pay attention to is the change, if any, in performance. The changes in metrics-0.18.x were designed to be primarily beneficial to static metric keys, and those who could utilize metric handles directly, both of which we currently do not and cannot do.

This would likely mean some potential performance regressions in areas where we emit many metrics, or emit metrics tied to the rate of events flowing through the topology or specific component.

We've handled these types of issues before by doing local aggregation or some other change to reduce the impact, but this would be at a high level across all metrics that we currently emit, so we'd want to pay attention to changes i.e. as surfaced by soak tests, etc.

Tasks

Future Work

Switch to emitting incremental counter metrics through a new storage implementation #13863

The text was updated successfully, but these errors were encountered:

nabokihms · 2022-05-23T09:32:58Z

@tobz, I'd like to help execute this plan step by step because our Kubernetes cluster installation suffers from the problem of metrcs number growth.

I have experience doing the same thing by customizing metrics storage for the Prometheus client_go, and I believe we'll cope with this issue under your guidance. What do you think? Is it ok for me to put the plan into action?

tobz · 2022-05-23T13:57:31Z

Yeah, we're happy to have someone interested in working on this, and I'll be happy to mentor/guide you where I can.

Happy to chat on Vector's Discord (text or voice, alias is tobz) to explain things more in depth, etc.

leandrojmp · 2022-07-19T17:31:21Z

Hello,

Anyone know if there is any workaround to deal with this issue? I have a memory leak reported in this issue, which was closed and referenced here, and it is blocking the move of our data pipeline in Logstash to Vector.

Currently I'm dealing with it by restarting the vector process every week, but I'm not sure that if I add more pipeline into vector it will leak the memory faster and requires more frequently restarts, so this is blocking the further adoption of vector.

I'm not using the internal metrics for anything, is there any possibility to disable it and avoid this memory leak?

fpytloun · 2022-07-19T17:54:56Z

Yes, this issue is really painful 😞

@jszwedko has branch with removed problematic metrics which I am using now: https://github.com/vectordotdev/vector/tree/custom-build-without-high-cardinality-metrics

nabokihms · 2022-07-19T18:11:35Z

I'm about to open a PR, but apparently it will only be a part of the future v0.24 release.

bruceg · 2022-07-21T15:42:30Z

@nabokihms This issue is on our roadmap to pursue in the near future. As you have indicated you have some changes in hand to help with this effort. Before we duplicate your efforts, could you let us know what approach you are taking on this problem?

jszwedko · 2022-07-28T16:58:54Z

Hi @nabokihms ! Apologies for pestering. We just wanted to check in to see if you still had plans to work on this or had already completed some work towards this so we can decide whether to move forward with this as we want to include it in the upcoming 0.24.0 release if we can.

nabokihms · 2022-07-28T17:59:03Z

@jszwedko @bruceg I did not manage to go far with it and stuck with the way to detect staleness (and I also lost time to update dependencies, what Bruce've already committed).

Sorry for being a snail 😅 It looks like I cannot keep the pace here. Hope I will be able to help the project the other way.

jszwedko · 2022-07-28T21:43:50Z

Hey @nabokihms ! No worries 😄 We are happy to pick it up, we just didn't want to step on your toes. We appreciate all of your contributions to Vector thus far.

leandrojmp · 2022-10-01T17:05:49Z

Hello @jszwedko,

I saw that a new option was implemented, expire_metrics_secs, would it help solve the memory issue leak from this issue? Currently I need to reboot vector twice a week to avoid it being killed by an OOM error.

Also the expire_metrics_secs has this note:

Be careful to set this value high enough to avoid expiring critical but infrequently updated internal counters.

What would be the impact? What are those internal metrics used for? I do not have any monitoring on the vector process, just use it to ship data to Kafka.

jszwedko · 2022-10-04T17:12:05Z

Be careful to set this value high enough to avoid expiring critical but infrequently updated internal counters.

What would be the impact? What are those internal metrics used for? I do not have any monitoring on the vector process, just use it to ship data to Kafka.

Hi @leandrojmp ! The internal metrics are exported by internal_metrics and used to drive vector tap (via Vector's API). If you don't use either of those you can safely set that number pretty low. It will resolve the issue mentioned by #11025

As you note, this issue is resolved so I'll close it. Thanks for bringing my attention to it.

leandrojmp · 2022-10-08T15:24:40Z

Hello @jszwedko,

Unfortunatelly this change in 0.24.1 did not solve the memory leak issue from #11025.

In reality I think it looks worse, I have to restart the service daily now to avoid a OutOfMemory killing the process.

tobz added domain: observability Anything related to monitoring/observing Vector domain: metrics Anything related to Vector's metrics events source: internal_metrics Anything `internal_metrics` source related labels Mar 25, 2022

This was referenced Mar 31, 2022

Internal metrics registry grows unbounded #11225

Closed

Possible Memory Leak with File source and Kafka sink #11025

Closed

nabokihms mentioned this issue May 30, 2022

Leaking Prometheus metrics in kubernetes_logs source #12897

Closed

bruceg self-assigned this Jul 19, 2022

bruceg added the Epic Larger, user-centric issue that contains multiple sub-issues label Jul 19, 2022

tuananhnguyen-ct mentioned this issue Aug 2, 2022

default agent config causes metric cardinality explosion in Prometheus vectordotdev/helm-charts#229

Closed

bruceg mentioned this issue Aug 5, 2022

Switch to emitting incremental counter metrics through a new storage implementation #13863

Open

leandrojmp mentioned this issue Aug 29, 2022

Re-add support for amazonlinux:1, centos:7, and ubuntu:14.04 #13183

Closed

bruceg mentioned this issue Sep 20, 2022

fix(observability): Skip expiring registered metrics #14489

Merged

jszwedko closed this as completed Oct 4, 2022

spencergilbert mentioned this issue Oct 6, 2022

Possible Memory Leak for K8s logs #14750

Open

leandrojmp mentioned this issue Oct 10, 2022

Metrics are not being expired even with expire_metrics_secs set. #14789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to expire internal metrics. #11995

Add the ability to expire internal metrics. #11995

tobz commented Mar 25, 2022 •

edited by bruceg

Loading

nabokihms commented May 23, 2022

tobz commented May 23, 2022

leandrojmp commented Jul 19, 2022

fpytloun commented Jul 19, 2022

nabokihms commented Jul 19, 2022

bruceg commented Jul 21, 2022

jszwedko commented Jul 28, 2022

nabokihms commented Jul 28, 2022

jszwedko commented Jul 28, 2022

leandrojmp commented Oct 1, 2022

jszwedko commented Oct 4, 2022

leandrojmp commented Oct 8, 2022 •

edited

Loading

Add the ability to expire internal metrics. #11995

Add the ability to expire internal metrics. #11995

Comments

tobz commented Mar 25, 2022 • edited by bruceg Loading

Context

Proposal

Upgrading metrics/metrics-util

Bolting on Recency support

Creating customized storage for the metrics

Caveats

Tasks

Future Work

nabokihms commented May 23, 2022

tobz commented May 23, 2022

leandrojmp commented Jul 19, 2022

fpytloun commented Jul 19, 2022

nabokihms commented Jul 19, 2022

bruceg commented Jul 21, 2022

jszwedko commented Jul 28, 2022

nabokihms commented Jul 28, 2022

jszwedko commented Jul 28, 2022

leandrojmp commented Oct 1, 2022

jszwedko commented Oct 4, 2022

leandrojmp commented Oct 8, 2022 • edited Loading

tobz commented Mar 25, 2022 •

edited by bruceg

Loading

Upgrading `metrics`/`metrics-util`

Bolting on `Recency` support

leandrojmp commented Oct 8, 2022 •

edited

Loading