-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VPA Recommender fails to record OOM
and increase resources due to KeyError
#6705
Comments
Hey @nikimanoledaki thanks for filing this and the great investigation! We've been observing similar behavior where the recommender drops down to the I tried fixing some part of this a while ago with #5326 but that doesn't help if a |
We've been using 1.1.1 for a while and while it isn't explicitly causing us trouble, we still get KeyError on OOM kill events pretty regularly, maybe 5 times daily. |
Thank you @voelzmo for your feedback!
The VPA Recommender existed since at least 24h before the update occurred so I'm not sure that it was due to this unfortunately. A few other pieces of info that we found in the meantime:
Thank you @akloss-cibo! Do you also see a log that these are quick OOM kill events? e.g. |
/area vertical-pod-autoscaler |
No, we don't see
Right now, these don't appear to be creating a problem for us. Most of the time, it is logged (unhelpfully) for a pod that doesn't have a VPA covering it at all, but we do see it occasionally for pods that do have a VPA. |
Interesting. Have you considered using VPA in Sidenote - |
I'm not deeply familiar with VPA code, but my general purpose programming instinct isn't to ignore a KeyError like this; something isn't populating I may try memory saver mode anyway though; we do a fair amount of batch processing in Kubernetes which creates a lot of pods without VPA on occasion. (I have a VPA for vpa-recommender itself to handle this.) This leads me to wonder, why does the VPA track pods that don't have a VPA (aka why isn't memory-saver the only mode for VPA)? |
VPAs standard implementation for historic storage is VPACheckpoints, which means that if you decide to start tracking a certain workload with VPA, it takes ~8 days to get the most accurate recommendation. If you don't run in memory-saver mode, the VPA already gathers data about the containers in memory, so if you would decide to enable VPA for a new container, and the recommender was already running for 8 days, you would get accurate recommendations straight away. I absolutely think |
This is great feedback, @akloss-cibo & @voelzmo! My gut feeling was also to not avoid the See this feature request for HA VPA + my comment here: #6846 (comment) |
This is true: enabling |
Hi folks! I have some updates and questions regarding the original issue. We've seen this issue occur when VPA caps the memory req/lim to However, some workloads have a memory utilization that can vary widely e.g. How does the VPA Updater decide to cap to Secondly, for some workloads, this capping decision seems incorrect based on their previous utilization data. I'm not sure how to investigate this in the codebase. Do you think this could be due to the wide range of memory utilization for this type of workload? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Which component are you using?: vertical-pod-autoscaler
What version of the component are you using?:
Component version:
1.0.0
| https://artifacthub.io/packages/helm/fairwinds-stable/vpa/4.0.1What k8s version are you using (
kubectl version
)?:1.27
What environment is this in?: AKS
What did you expect to happen?: Expected VPA to recover from an OOM issue by raising the resource limits.
What happened instead?:
We had a rather disruptive OOM loop that lasted for an hour (until the VPA CR's auto updates were disabled). It involved the VPA Updater & Recommender trying and failing to react to a VPA target Pod being OOMKilled. VPA recommended a very low memory request and limit for a Pod, which was immediately OOM'ed, evicted, OOM'ed, evicted again, etc. VPA should have been able to end this loop by raising the resource limit but it wasn't able to do that since the Pod was non-existant.
Here is the sequence of events:
Updater accepts Pod for update.
update_priority_calculator.go:143] pod accepted for update <ns>/<pod> with priority 0.2350253824929428
Updater evicts Pod to apply resource recommendation.
updater.go:215] evicting pod <pod>-69856fc5f7-m848k
Recommender deletes Pod.
cluster_feeder.go:401] Deleting Pod {<ns> <pod>-69856fc5f7-m848k}
Recommender detects OOM in Pod.
cluster_feeder.go:445] OOM detected {Timestamp:<date> 11:08:54 +0000 UTC Memory:104857600 ContainerID:{PodID:{Namespace:<ns> PodName:<pod>} ContainerName:<container>}}
Updater detects quick OOM in Pod.
update_priority_calculator.go:114] quick OOM detected in pod <ns>/<pod>, container <container>
Recommender deletes Pod (again).
cluster_feeder.go:401] Deleting Pod {<ns> <pod>-69856fc5f7-d8hvv}
Recommender detects OOM in Pod (again).
cluster_feeder.go:445] OOM detected {Timestamp:<date> 11:08:56 +0000 UTC Memory:104857600 ContainerID:{PodID:{Namespace:<ns> PodName:<pod>} ContainerName:<container>}}
Recommender immediately fails to record OOM - Reason: KeyError.
cluster_feeder.go:447] Failed to record OOM {Timestamp:<date> 11:08:56 +0000 UTC Memory:104857600 ContainerID:{PodID:{Namespace:<ns> PodName:<pod>} ContainerName:<container>}}. Reason: KeyError: {<ns> <pod>-69856fc5f7-d8hvv}
repeat loop
From the codebase, it looks like that
KeyError
is returned when processing a non-existent pod (here). The pod didn't exist by the time the VPA Recommender tried to record the OOM. Due to this reason, the backup mechanism with the Custom memory bump-up after OOMKill wasn't able to increase the resource since VPA returned earlier with aKeyErorr
(here) and it couldn't reach that point.We took the following steps to stop the bleeding:
updatePolicy.updateMode
fromAuto
to"Off"
. - - deleted the VPA targets manually.controlledResources.minAllowed.memory
to a more suitable number to avoid the initial OOM error.KeyErorr
earlier (here):How to reproduce it (as minimally and precisely as possible):
The error could be reproduced by having a memory-intensive workload that OOMs very quickly.
The text was updated successfully, but these errors were encountered: