Steady memory leak in VPA recommender #6368

DLakin01 · 2023-12-11T21:01:13Z

Which component are you using?:

vertical-pod-autoscaler, recommender only

What version of the component are you using?

0.14.0

What k8s version are you using (kubectl version)?:

1.26

What environment is this in?:

AWS EKS, multiple clusters and accounts, multiple types of applications running on the cluster

What did you expect to happen?:

VPA recommender should run at more or less at the same memory level throughout the lifetime of a particular pod

What happened instead?:

There is a steady memory leak that is especially visible over a period of days, as seen here in a screen capture of our DataDog:

The upper lines with the steeper slope are from our large multi-tenant clusters, but the smaller clusters also experience the leak, albeit more slowly. If left alone, the memory will reach 200% of requests before the pod gets kicked. The recommender in the largest cluster is tracking 3161 PodStates at the time of creating this issue

How to reproduce it (as minimally and precisely as possible):

Not sure how reproducible the issue is outside of running VPA in a large cluster with > 3000 pods and waiting several days to see if the memory creeps up.

Anything else we need to know?:

We haven't yet created any VPA CRDs to generate recommendations, waiting until a future sprint to begin rolling those out.

The text was updated successfully, but these errors were encountered:

vkhacharia · 2024-02-26T09:50:50Z

We also face the same issue. Our version is 0.11 with k8s version 1.24. Below is grafana snippet from the last restart.

voelzmo · 2024-02-26T10:34:40Z

Hey @vkhacharia @DLakin01 thanks for bringing this up!

To some extend, this behavior is expected and given only these graphs it is hard to tell, if the behavior is normal or not.
The recommender keeps metrics for each container, regardless if that container is under VPA control or not. I guess the reasoning is that you get accurate recommendations immediately if you would decide to enable VPA for this container at a later point in time. You can switch off this default behavior by enabling memory saver mode.

Even with memory saver mode enabled, there's some grow in memory expected:

AggregateContainerState is garbage collected, whenever a sample is found to be no longer contributing to a recommendation. This check is run once per hour.
AggregateContainerStates are indexed by their name, namespace, and labelSet, so e.g. every rollout of a new Deployment version will create new AggregateContainerStates. The old AggregateContainerStates are kept for 8 days (removed only when they are found to be no longer possibly contributing, see above).

So if you're rolling approximately the same number of times per week, your memory is expected to grow for ~2 weeks. If you're adding Containers and don't have memory saver mode enabled, memory will grow with every Container.

If all of those parameters are controlled and you still see memory growth, I guess this really is a memory leak that shouldn't happen.

vkhacharia · 2024-03-05T07:46:57Z

@voelzmo Thanks for the quick response, I wanted to try it now but noticed that I am on k8s version 1.24 which has compabitility with 0.11 of vpa recommender. I dont see the parameter memory-saver in code in branch for version 0.11.

voelzmo · 2024-03-05T09:13:03Z

Hey @vkhacharia, thanks for your efforts! VPA 0.11.0 also has memory-saver mode, but the parameter is in a different place and was moved to the above section in the code with a refactoring that happened later.

So you can still turn on --memory-saver=true and see what this does for you. Hope that helps!

k8s-triage-robot · 2024-06-03T09:38:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-07-03T10:02:34Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

adrianmoisey · 2024-07-08T18:47:01Z

/area vertical-pod-autoscaler

k8s-triage-robot · 2024-08-07T19:46:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-08-07T19:46:50Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

DLakin01 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 11, 2023

nikimanoledaki mentioned this issue May 23, 2024

VPA Recommender fails to record OOM and increase resources due to KeyError #6705

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 3, 2024

k8s-ci-robot added the area/vertical-pod-autoscaler label Jul 8, 2024

voelzmo mentioned this issue Jul 15, 2024

pprof for vpa-admission-controller #6946

Open

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Steady memory leak in VPA recommender #6368

Steady memory leak in VPA recommender #6368

DLakin01 commented Dec 11, 2023

vkhacharia commented Feb 26, 2024

voelzmo commented Feb 26, 2024

vkhacharia commented Mar 5, 2024

voelzmo commented Mar 5, 2024

k8s-triage-robot commented Jun 3, 2024

k8s-triage-robot commented Jul 3, 2024

adrianmoisey commented Jul 8, 2024

k8s-triage-robot commented Aug 7, 2024

k8s-ci-robot commented Aug 7, 2024

Steady memory leak in VPA recommender #6368

Steady memory leak in VPA recommender #6368

Comments

DLakin01 commented Dec 11, 2023

vkhacharia commented Feb 26, 2024

voelzmo commented Feb 26, 2024

vkhacharia commented Mar 5, 2024

voelzmo commented Mar 5, 2024

k8s-triage-robot commented Jun 3, 2024

k8s-triage-robot commented Jul 3, 2024

adrianmoisey commented Jul 8, 2024

k8s-triage-robot commented Aug 7, 2024

k8s-ci-robot commented Aug 7, 2024