-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to expire internal metrics. #11995
Comments
@tobz, I'd like to help execute this plan step by step because our Kubernetes cluster installation suffers from the problem of metrcs number growth. I have experience doing the same thing by customizing metrics storage for the Prometheus client_go, and I believe we'll cope with this issue under your guidance. What do you think? Is it ok for me to put the plan into action? |
Yeah, we're happy to have someone interested in working on this, and I'll be happy to mentor/guide you where I can. Happy to chat on Vector's Discord (text or voice, alias is |
Hello, Anyone know if there is any workaround to deal with this issue? I have a memory leak reported in this issue, which was closed and referenced here, and it is blocking the move of our data pipeline in Logstash to Vector. Currently I'm dealing with it by restarting the vector process every week, but I'm not sure that if I add more pipeline into vector it will leak the memory faster and requires more frequently restarts, so this is blocking the further adoption of vector. I'm not using the internal metrics for anything, is there any possibility to disable it and avoid this memory leak? |
Yes, this issue is really painful 😞 @jszwedko has branch with removed problematic metrics which I am using now: https://github.com/vectordotdev/vector/tree/custom-build-without-high-cardinality-metrics |
I'm about to open a PR, but apparently it will only be a part of the future v0.24 release. |
@nabokihms This issue is on our roadmap to pursue in the near future. As you have indicated you have some changes in hand to help with this effort. Before we duplicate your efforts, could you let us know what approach you are taking on this problem? |
Hi @nabokihms ! Apologies for pestering. We just wanted to check in to see if you still had plans to work on this or had already completed some work towards this so we can decide whether to move forward with this as we want to include it in the upcoming 0.24.0 release if we can. |
Hey @nabokihms ! No worries 😄 We are happy to pick it up, we just didn't want to step on your toes. We appreciate all of your contributions to Vector thus far. |
Hello @jszwedko, I saw that a new option was implemented, Also the
What would be the impact? What are those internal metrics used for? I do not have any monitoring on the vector process, just use it to ship data to Kafka. |
Hi @leandrojmp ! The internal metrics are exported by As you note, this issue is resolved so I'll close it. Thanks for bringing my attention to it. |
This epic represents the work required to complete the ability to control the cardinality of our internal metrics through expiring old ones.
Context
Currently, internal metrics are captured via the
metrics
crate, to which we utilize a customizedRecorder
implementation that forwards them to a base-bonesRegistry
that stores them, and then on an interval, we collect the registered metrics and forward them into the topology.While this works well from a performance/simplicity standpoint, we currently do not expire metrics that are no longer used. This can lead to issues, such as the one described in #11821, where high cardinality label values can generate many unique metrics in the registry over time, which comes together as a slow memory leak that requires restarting Vector itself.
Additionally, this can also result in ever-growing outputs from metrics sinks, where now-stale-and-no-longer-used metrics continue to be emitted needlessly. We've addressed this in the past for specific sinks, such as the
prometheus_exporter
sink in #9769, but there should be a more general solution.Proposal
We should upgrade our usage of
metrics
in order to provide the ability to expire internal metrics such that old/stale metrics eventually get removed from the registry entirely, and are no longer reported.To do so involves a few steps that should be achievable in an incremental fashion:
metrics-0.18.x
/metrics-util-0.12.x
to bring in support for metric handles and a more customizableRegistry
Recency
module (part ofmetrics-util
) so that we can begin tracking the generation of each metric to properly understand when a metric has or hasn't been updated since a particular observation of itRecency
with the ability to consume the value of counters (gauges and histograms need no change here) such that we can switch to emitting incremental counter metrics rather than absolute onesUpgrading
metrics
/metrics-util
This is required due to the introduction of "metric handles" and the more flexible
Registry
/Storage
primitives, which will allow us to interact with the metric values such that we can mutate them rather than simply observe them.Bolting on
Recency
supportThis one is sort of self-explanatory:
Recency
provides the metric storage wrapper and logic to detect metrics that have truly not been updated within a given time period. This is required because gauges could technically be the same value when observed at two different times, even if it was in fact changed in between those two observations.Recency
deals with this by tracking a "generation" for a metric, such that any operation on the metric at all increments an internal counter -- the generation -- so that you can tell, for both of those two hypothetical observations mentioned above: "oh, the generation is identical, so it hasn't been updated" or "oh, the generations are different, so it's definitely still being updated".Creating customized storage for the metrics
This one feeds into being able to emit incremental counters.
Currently, we emit absolute metrics for anything we scrape from the internal metrics registry. This is fine for gauges, and even for histograms, we end up merging that stuff together. This is bad for counters, though.
It's bad because for many metrics sinks, they only emit incrementalized metrics downstream, so in order to incrementalize an absolute metric, you must observe it at least twice in order to begin tracking the delta between observations. This means that if we expired a counter, and then it came back to life -- maybe it's updated very infrequently -- we would potentially lose some of the increments to that counter between the first time we see it when it comes back to life and the second time we observe it.
By writing our own customized storage, we could provide a hook to consume a counter's value -- essentially, swap the value to 0 and take the previous value -- such that we could emit the consumed value as an incremental update. This would ensure that even if a metric is expired but comes back, we're always consuming its entire value so we never miss an update.
This would work, specifically, because we would only do expiration when iterating over the metrics and observing them, not automatically in the background, so we would always have a chance to observe the metric (and potentially consume its value if it's a counter) before removing it from the registry.
Caveats
One main thing we'll want to pay attention to is the change, if any, in performance. The changes in
metrics-0.18.x
were designed to be primarily beneficial to static metric keys, and those who could utilize metric handles directly, both of which we currently do not and cannot do.This would likely mean some potential performance regressions in areas where we emit many metrics, or emit metrics tied to the rate of events flowing through the topology or specific component.
We've handled these types of issues before by doing local aggregation or some other change to reduce the impact, but this would be at a high level across all metrics that we currently emit, so we'd want to pay attention to changes i.e. as surfaced by soak tests, etc.
Tasks
metrics
andmetrics-util
crates #13625Future Work
The text was updated successfully, but these errors were encountered: