Re-work internal health check between vpa-updater and vpa-admission-controller #6884

voelzmo · 2024-06-03T12:48:23Z

Which component are you using?:
vertical-pod-autoscaler

What version of the component are you using?:

Component version:
v1.1.2

What k8s version are you using (kubectl version)?:

v1.29

What happened?:

There is a health checking mechanism in the vpa-updater which prevents evictions when the admission-controller is no longer healthy. The admission-controller runs a StatusUpdater and is expected to renew a lease. Lease renewal requests are done every 10s. The updater checks if the lease is recent enough (not older than 60s) and only then evicts Pods.

We just saw an incident where the admission-controller was configured with default values for client-side rate limiting and ended up being rate limited on retrieving the /scale subresource with consistent client-side throttling delays >> the configured webhook timeout in the kube-apiserver. During that time, the lease was still updated often enough, such that the vpa-updater still continued to evict Pods. Pods were created with their default resource settings, therefore we ended up having an endless eviction loop.

The fix was to increase client-side rate limiting configuration and keep a keen eye on the metrics for the Pod admission times.

What did you expect to happen instead?:

We should have a health check mechanism in place which prevents vpa-updater from evicting Pods if the admission-controller cannot do its job correctly, such that we avoid endless eviction loops that require manual intervention.

More context
Potentially related is this discussion: Do we even want/need client-side throttling in the VPA? kubernetes/kubernetes#111880

The text was updated successfully, but these errors were encountered:

voelzmo · 2024-06-03T12:53:59Z

/area vertical-pod-autoscaler

raywainman · 2024-06-03T17:02:11Z

(I'll create a new issue for the problem below if needed but thought I'd add it here as mentioned in SIG Autoscaling today because I think the solution is likely similar)

One other vulnerable area of bad things happening due to health check + API limit is in the updater in the main loop here:

autoscaler/vertical-pod-autoscaler/pkg/updater/main.go

Lines 125 to 130 in e08681b

    
           for range ticker { 
        
           	ctx, cancel := context.WithTimeout(context.Background(), *updaterInterval) 
        
           	updater.RunOnce(ctx) 
        
           	healthCheck.UpdateLastActivity() 
        
           	cancel() 
        
           }

If the RunOnce() function takes too long fetching VPAs, Pods to be evicted, etc... and is unable to make progress with actually evicting any pods, the health check will happily kill the updater here and the cluster will get stuck and won't actually evict any pods.

A few thoughts:

Raising the API limit (or removing it) fixes this issue as well as the RunOnce() function can move faster and avoid getting throttled.
The health check here is a bit coarse, meaning it only updates once a full RunOnce() loop completes. Perhaps the health check should be passed into the loop so that it can be updated every time incremental progress is made? Otherwise you need to be careful that your healthcheck is properly tuned to not kill the updater while it is still working.
The context that is passed here is a bit misleading because it isn't actually evaluated until we evict a pod.

adrianmoisey · 2024-06-04T07:38:42Z

Do we even want/need client-side throttling in the VPA?

client-side throttling caused us issues in the past. The defaults were set to a very low value in the VPA, and when a large code deployment happened (spanning multiple Deployments), the admission-controller was very slow to respond, causing us to get failed deploys in our CD system.

I watched the sig-autoscaling recording discussing this, and I do agree that there are other potential failure cases to consider, and removing client-side throttling doesn't solve all of them.

But over all, I'm a +1 to removing the client-side throttling.

vlerenc · 2024-06-05T11:26:47Z

Client-side throttling may be one reason and if removed/curtailed, that may certainly help. But it's just one reason. Tomorrow something else may fail.

The health check that didn't/doesn't do its job right, is more important, because it covers more cases, but also that one is probably not a truly and "safe" solution. It isn't if it's just an "indirect proxy" for applying the right recommendations. What I mean:
Ideally, the eviction part, because it relies on Kubernetes and web hooks afterwards and isn't in full control, should have some form of melt-down protection/circuit-breaker that terminates the eviction process. Sure, pods do not have a trackable identity, but evictions should only continue if pods come up with their recommended values. If either listing/watching the new pods fails or they come up with their initial/default requests instead of the recommended values, evictions should cease immediately as obviously something is off with the web hook. Relying on "indirect" indicators like the current health check (which is only a "proxy" and not the real deal here - the real deal is whether the new pods get their recommended values or not) is problematic/dangerous.

What we have experienced was a catastrophic failure: not only did the pods come up with super low initial requests and failed to do their job, but they were evicted continuously (infinite loop) without backoff or detection in VPA. VPA was running amok until humans/operators intervened. That's pretty concerning for a component that is used so widely and practically the de-facto standard for vertical pod scaling.

ialidzhikov · 2024-07-10T13:53:35Z

/assign

ialidzhikov · 2024-07-11T14:39:17Z

Today, I confirmed that the underlying rateLimiter used by the scaleClient and by the statusUpdater is the same. In a local setup I deployed VPA with the following diff under the vendor dir:

diff --git a/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go b/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
index 850e57dae..7c2740a99 100644
--- a/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
+++ b/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
@@ -622,7 +622,7 @@ func (r *Request) tryThrottleWithInfo(ctx context.Context, retryInfo string) err
        case len(retryInfo) > 0:
                message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String())
        default:
-               message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String())
+               message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s, rateLimiter = %p", latency, r.verb, r.URL().String(), r.rateLimiter)
        }

        if latency > longThrottleLatency {

I reproduced the issue again the the logs reveal the same rateLimiter (0x400011b830):

I0711 12:59:39.180582       1 request.go:629] Waited for 998.800458ms due to client-side throttling, not priority and fairness, request: PUT:https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/vpa-admission-controller, rateLimiter = 0x400011b830
I0711 12:59:53.623347       1 request.go:697] Waited for 6.98511392s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/monitoring.coreos.com/v1/namespaces/shoot--foo--bar/prometheuses/shoot/scale, rateLimiter = 0x400011b830

I again looked into this area because during the incident on our side we had only 12 client-side throttling logs for leases for 12 hour period and there were 1000 client-side throttling logs for the scale subresource for 15min interval.
It is still not clear to me why the diff in the the amount of client-side throttling is that big when both clients are using the same underlying rateLimiter.

Another thing that I tried today is to see if the underlying Informers' will be out of sync when client-side throttling occurs. If that was the case, my idea was to couple the Informer's HasSynced properly with the health endpoint (with the liveness && readiness probes) of the vpa-admission-controller. I reproduced client-side throttling for 4m+ but they Informers' HasSynced property was still true. It would be interesting what happens on client-side throttling that waits 10m+ because right now the informer's resync period is 10min. I suspect that then we could have the informers out of sync.

Long story short, this idea was based on:

Informer gets out of sync (eventually)
Probes for the vpa-admission-controller fail
vpa-admission-controller goes into CrashloopBackoff
=> the vpa-admission-controller's lease is not updated
=> vpa-updater stops evicting

It is also a common pattern in K8s controllers/webhooks to couple the has-synced property of the informers with the health endpoint of the component: for example, see https://github.com/gardener/gardener/blob/10d75787a132bf39c408b74372f5d8c045fa8f4b/cmd/gardener-admission-controller/app/app.go#L108-L110

In an internal discussion with @voelzmo and @plkokanov we discussed the option to introduce a ratio of failed vs success requests in the vpa-admission-controller. The vpa-admission-controller knows and can count when a request succeeds or fails. When then failure rate crosses a configured/configurable threshold, then the vpa-admission-controller can stop updating its lease.
The doubts I have for this options are "what amount of requests we consider as minimum baseline". Because on 2 requests, 1 failure, 1 success, it wouldn't be ok to stop updating the lease.

ialidzhikov · 2024-07-12T05:21:29Z

Another way of fixing/mitigating this issue would be to contextify the statusUpdater runs:

autoscaler/vertical-pod-autoscaler/pkg/utils/status/status_updater.go

Lines 48 to 61 in 8bc327e

    
           func (su *Updater) Run(stopCh <-chan struct{}) { 
        
           	go func() { 
        
           		for { 
        
           			select { 
        
           			case <-stopCh: 
        
           				return 
        
           			case <-time.After(su.updateInterval): 
        
           				if err := su.client.UpdateStatus(); err != nil { 
        
           					klog.Errorf("Status update by %s failed: %v", su.client.holderIdentity, err) 
        
           				} 
        
           			} 
        
           		} 
        
           	}() 
        
           }

Right now, there is no timeout/context passed to the statusClient. It runs at every update interval (10s) but the run itself can take forever until it succeeds/fails. On client-side throttling it succeeds after minutes and the lease at the end gets renewed client-side throttling this happens at the end with huge delay.

The potential fix I am thinking of is to limit the statuUpdater's Run to 10s. With this, on such client-side throttlings it won't be able to renew its lease in 10s => vpa-updater will stop evicting.

WDYT?

ialidzhikov · 2024-07-12T06:10:43Z

I created #7036, let me know what you think.

adrianmoisey · 2024-07-12T11:48:08Z

The potential fix I am thinking of is to limit the statuUpdater's Run to 10s. With this, on such client-side throttlings it won't be able to renew its lease in 10s => vpa-updater will stop evicting.

This seems sane

voelzmo · 2024-07-15T15:44:46Z

/triage accepted

adrianmoisey · 2024-09-16T19:08:39Z

/reopen

I could be wrong, but this issue seems to describe a few solutions to the client-side rate limiting issue.
I assume we're not done exploring them all, so I'll reopen.

k8s-ci-robot · 2024-09-16T19:08:44Z

@adrianmoisey: Reopened this issue.

In response to this:

/reopen

I could be wrong, but this issue seems to describe a few solutions to the client-side rate limiting issue.
I assume we're not done exploring them all, so I'll reopen.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

voelzmo · 2024-09-25T08:38:51Z

/retitle Re-work internal health check between vpa-updater and vpa-admission-controller

voelzmo · 2024-09-25T08:39:49Z

I think the original issue has been mitigated, but the conceptual question still remains: what would be a better way to check in the vpa-updater before evicting Pods that the admission-controller can actually do its job to update the resources?
Thanks @adrianmoisey for re-opening this.

voelzmo added the kind/bug Categorizes issue or PR as related to a bug. label Jun 3, 2024

k8s-ci-robot added the area/vertical-pod-autoscaler label Jun 3, 2024

ialidzhikov mentioned this issue Jun 4, 2024

vpa-admission-controller: Wire contexts #6891

Closed

ialidzhikov mentioned this issue Jun 10, 2024

vpa-admission-controller: Add observability and alerting gardener/gardener#9954

Open

3 tasks

ialidzhikov mentioned this issue Jun 19, 2024

VPA: Why using targetRef instead of Pod selector? #6925

Open

ialidzhikov mentioned this issue Jul 8, 2024

vpa-admission-controller: Improve finding the matching VPA for Pod #7010

Open

k8s-ci-robot assigned ialidzhikov Jul 10, 2024

ialidzhikov mentioned this issue Jul 12, 2024

vpa-admission-controller: Contextify status updates. Add timeout of 10s #7036

Merged

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 15, 2024

k8s-ci-robot closed this as completed in #7036 Sep 16, 2024

k8s-ci-robot reopened this Sep 16, 2024

k8s-ci-robot changed the title ~~Client-side rate limiting in admission-controller can lead to endless eviction loops~~ Re-work internal health check between vpa-updater and vpa-admission-controller Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-work internal health check between vpa-updater and vpa-admission-controller #6884

Re-work internal health check between vpa-updater and vpa-admission-controller #6884

voelzmo commented Jun 3, 2024 •

edited

Loading

voelzmo commented Jun 3, 2024

raywainman commented Jun 3, 2024 •

edited

Loading

adrianmoisey commented Jun 4, 2024

vlerenc commented Jun 5, 2024

ialidzhikov commented Jul 10, 2024

ialidzhikov commented Jul 11, 2024

ialidzhikov commented Jul 12, 2024 •

edited

Loading

ialidzhikov commented Jul 12, 2024

adrianmoisey commented Jul 12, 2024

voelzmo commented Jul 15, 2024

adrianmoisey commented Sep 16, 2024

k8s-ci-robot commented Sep 16, 2024

voelzmo commented Sep 25, 2024

voelzmo commented Sep 25, 2024

Re-work internal health check between vpa-updater and vpa-admission-controller #6884

Re-work internal health check between vpa-updater and vpa-admission-controller #6884

Comments

voelzmo commented Jun 3, 2024 • edited Loading

voelzmo commented Jun 3, 2024

raywainman commented Jun 3, 2024 • edited Loading

adrianmoisey commented Jun 4, 2024

vlerenc commented Jun 5, 2024

ialidzhikov commented Jul 10, 2024

ialidzhikov commented Jul 11, 2024

ialidzhikov commented Jul 12, 2024 • edited Loading

ialidzhikov commented Jul 12, 2024

adrianmoisey commented Jul 12, 2024

voelzmo commented Jul 15, 2024

adrianmoisey commented Sep 16, 2024

k8s-ci-robot commented Sep 16, 2024

voelzmo commented Sep 25, 2024

voelzmo commented Sep 25, 2024

voelzmo commented Jun 3, 2024 •

edited

Loading

raywainman commented Jun 3, 2024 •

edited

Loading

ialidzhikov commented Jul 12, 2024 •

edited

Loading