-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-work internal health check between vpa-updater and vpa-admission-controller #6884
Comments
/area vertical-pod-autoscaler |
(I'll create a new issue for the problem below if needed but thought I'd add it here as mentioned in SIG Autoscaling today because I think the solution is likely similar) One other vulnerable area of bad things happening due to health check + API limit is in the updater in the main loop here: autoscaler/vertical-pod-autoscaler/pkg/updater/main.go Lines 125 to 130 in e08681b
If the A few thoughts:
|
client-side throttling caused us issues in the past. The defaults were set to a very low value in the VPA, and when a large code deployment happened (spanning multiple Deployments), the admission-controller was very slow to respond, causing us to get failed deploys in our CD system. I watched the sig-autoscaling recording discussing this, and I do agree that there are other potential failure cases to consider, and removing client-side throttling doesn't solve all of them. But over all, I'm a +1 to removing the client-side throttling. |
Client-side throttling may be one reason and if removed/curtailed, that may certainly help. But it's just one reason. Tomorrow something else may fail. The health check that didn't/doesn't do its job right, is more important, because it covers more cases, but also that one is probably not a truly and "safe" solution. It isn't if it's just an "indirect proxy" for applying the right recommendations. What I mean: What we have experienced was a catastrophic failure: not only did the pods come up with super low initial requests and failed to do their job, but they were evicted continuously (infinite loop) without backoff or detection in VPA. VPA was running amok until humans/operators intervened. That's pretty concerning for a component that is used so widely and practically the de-facto standard for vertical pod scaling. |
/assign |
Today, I confirmed that the underlying rateLimiter used by the scaleClient and by the statusUpdater is the same. In a local setup I deployed VPA with the following diff under the vendor dir: diff --git a/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go b/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
index 850e57dae..7c2740a99 100644
--- a/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
+++ b/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
@@ -622,7 +622,7 @@ func (r *Request) tryThrottleWithInfo(ctx context.Context, retryInfo string) err
case len(retryInfo) > 0:
message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String())
default:
- message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String())
+ message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s, rateLimiter = %p", latency, r.verb, r.URL().String(), r.rateLimiter)
}
if latency > longThrottleLatency { I reproduced the issue again the the logs reveal the same rateLimiter (
I again looked into this area because during the incident on our side we had only 12 Another thing that I tried today is to see if the underlying Informers' will be out of sync when client-side throttling occurs. If that was the case, my idea was to couple the Informer's Long story short, this idea was based on:
It is also a common pattern in K8s controllers/webhooks to couple the has-synced property of the informers with the health endpoint of the component: for example, see https://github.com/gardener/gardener/blob/10d75787a132bf39c408b74372f5d8c045fa8f4b/cmd/gardener-admission-controller/app/app.go#L108-L110 In an internal discussion with @voelzmo and @plkokanov we discussed the option to introduce a ratio of failed vs success requests in the vpa-admission-controller. The vpa-admission-controller knows and can count when a request succeeds or fails. When then failure rate crosses a configured/configurable threshold, then the vpa-admission-controller can stop updating its lease. |
Another way of fixing/mitigating this issue would be to contextify the statusUpdater runs: autoscaler/vertical-pod-autoscaler/pkg/utils/status/status_updater.go Lines 48 to 61 in 8bc327e
Right now, there is no timeout/context passed to the statusClient. It runs at every update interval (10s) but the run itself can take forever until it succeeds/fails. On The potential fix I am thinking of is to limit the statuUpdater's Run to 10s. With this, on such WDYT? |
I created #7036, let me know what you think. |
This seems sane |
/triage accepted |
/reopen I could be wrong, but this issue seems to describe a few solutions to the client-side rate limiting issue. |
@adrianmoisey: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retitle Re-work internal health check between vpa-updater and vpa-admission-controller |
I think the original issue has been mitigated, but the conceptual question still remains: what would be a better way to check in the vpa-updater before evicting Pods that the admission-controller can actually do its job to update the resources? |
Which component are you using?:
vertical-pod-autoscaler
What version of the component are you using?:
Component version:
v1.1.2
What k8s version are you using (
kubectl version
)?:v1.29
What happened?:
There is a health checking mechanism in the vpa-updater which prevents evictions when the admission-controller is no longer healthy. The admission-controller runs a StatusUpdater and is expected to renew a lease. Lease renewal requests are done every 10s. The updater checks if the lease is recent enough (not older than 60s) and only then evicts Pods.
We just saw an incident where the admission-controller was configured with default values for client-side rate limiting and ended up being rate limited on retrieving the
/scale
subresource with consistent client-side throttling delays >> the configured webhook timeout in the kube-apiserver. During that time, the lease was still updated often enough, such that the vpa-updater still continued to evict Pods. Pods were created with their default resource settings, therefore we ended up having an endless eviction loop.The fix was to increase client-side rate limiting configuration and keep a keen eye on the metrics for the Pod admission times.
What did you expect to happen instead?:
We should have a health check mechanism in place which prevents vpa-updater from evicting Pods if the admission-controller cannot do its job correctly, such that we avoid endless eviction loops that require manual intervention.
More context
Potentially related is this discussion: Do we even want/need client-side throttling in the VPA? kubernetes/kubernetes#111880
The text was updated successfully, but these errors were encountered: