Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-work internal health check between vpa-updater and vpa-admission-controller #6884

Open
voelzmo opened this issue Jun 3, 2024 · 14 comments · Fixed by #7036
Open

Re-work internal health check between vpa-updater and vpa-admission-controller #6884

voelzmo opened this issue Jun 3, 2024 · 14 comments · Fixed by #7036
Assignees
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@voelzmo
Copy link
Contributor

voelzmo commented Jun 3, 2024

Which component are you using?:
vertical-pod-autoscaler

What version of the component are you using?:

Component version:
v1.1.2

What k8s version are you using (kubectl version)?:

v1.29

What happened?:

There is a health checking mechanism in the vpa-updater which prevents evictions when the admission-controller is no longer healthy. The admission-controller runs a StatusUpdater and is expected to renew a lease. Lease renewal requests are done every 10s. The updater checks if the lease is recent enough (not older than 60s) and only then evicts Pods.

We just saw an incident where the admission-controller was configured with default values for client-side rate limiting and ended up being rate limited on retrieving the /scale subresource with consistent client-side throttling delays >> the configured webhook timeout in the kube-apiserver. During that time, the lease was still updated often enough, such that the vpa-updater still continued to evict Pods. Pods were created with their default resource settings, therefore we ended up having an endless eviction loop.

The fix was to increase client-side rate limiting configuration and keep a keen eye on the metrics for the Pod admission times.

What did you expect to happen instead?:

We should have a health check mechanism in place which prevents vpa-updater from evicting Pods if the admission-controller cannot do its job correctly, such that we avoid endless eviction loops that require manual intervention.

More context
Potentially related is this discussion: Do we even want/need client-side throttling in the VPA? kubernetes/kubernetes#111880

@voelzmo voelzmo added the kind/bug Categorizes issue or PR as related to a bug. label Jun 3, 2024
@voelzmo
Copy link
Contributor Author

voelzmo commented Jun 3, 2024

/area vertical-pod-autoscaler

@raywainman
Copy link
Contributor

raywainman commented Jun 3, 2024

(I'll create a new issue for the problem below if needed but thought I'd add it here as mentioned in SIG Autoscaling today because I think the solution is likely similar)

One other vulnerable area of bad things happening due to health check + API limit is in the updater in the main loop here:

for range ticker {
ctx, cancel := context.WithTimeout(context.Background(), *updaterInterval)
updater.RunOnce(ctx)
healthCheck.UpdateLastActivity()
cancel()
}

If the RunOnce() function takes too long fetching VPAs, Pods to be evicted, etc... and is unable to make progress with actually evicting any pods, the health check will happily kill the updater here and the cluster will get stuck and won't actually evict any pods.

A few thoughts:

  • Raising the API limit (or removing it) fixes this issue as well as the RunOnce() function can move faster and avoid getting throttled.
  • The health check here is a bit coarse, meaning it only updates once a full RunOnce() loop completes. Perhaps the health check should be passed into the loop so that it can be updated every time incremental progress is made? Otherwise you need to be careful that your healthcheck is properly tuned to not kill the updater while it is still working.
  • The context that is passed here is a bit misleading because it isn't actually evaluated until we evict a pod.

@adrianmoisey
Copy link
Member

Do we even want/need client-side throttling in the VPA?

client-side throttling caused us issues in the past. The defaults were set to a very low value in the VPA, and when a large code deployment happened (spanning multiple Deployments), the admission-controller was very slow to respond, causing us to get failed deploys in our CD system.

I watched the sig-autoscaling recording discussing this, and I do agree that there are other potential failure cases to consider, and removing client-side throttling doesn't solve all of them.

But over all, I'm a +1 to removing the client-side throttling.

@vlerenc
Copy link

vlerenc commented Jun 5, 2024

Client-side throttling may be one reason and if removed/curtailed, that may certainly help. But it's just one reason. Tomorrow something else may fail.

The health check that didn't/doesn't do its job right, is more important, because it covers more cases, but also that one is probably not a truly and "safe" solution. It isn't if it's just an "indirect proxy" for applying the right recommendations. What I mean:
Ideally, the eviction part, because it relies on Kubernetes and web hooks afterwards and isn't in full control, should have some form of melt-down protection/circuit-breaker that terminates the eviction process. Sure, pods do not have a trackable identity, but evictions should only continue if pods come up with their recommended values. If either listing/watching the new pods fails or they come up with their initial/default requests instead of the recommended values, evictions should cease immediately as obviously something is off with the web hook. Relying on "indirect" indicators like the current health check (which is only a "proxy" and not the real deal here - the real deal is whether the new pods get their recommended values or not) is problematic/dangerous.

What we have experienced was a catastrophic failure: not only did the pods come up with super low initial requests and failed to do their job, but they were evicted continuously (infinite loop) without backoff or detection in VPA. VPA was running amok until humans/operators intervened. That's pretty concerning for a component that is used so widely and practically the de-facto standard for vertical pod scaling.

@ialidzhikov
Copy link
Contributor

/assign

@ialidzhikov
Copy link
Contributor

Today, I confirmed that the underlying rateLimiter used by the scaleClient and by the statusUpdater is the same. In a local setup I deployed VPA with the following diff under the vendor dir:

diff --git a/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go b/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
index 850e57dae..7c2740a99 100644
--- a/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
+++ b/vertical-pod-autoscaler/vendor/k8s.io/client-go/rest/request.go
@@ -622,7 +622,7 @@ func (r *Request) tryThrottleWithInfo(ctx context.Context, retryInfo string) err
        case len(retryInfo) > 0:
                message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String())
        default:
-               message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String())
+               message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s, rateLimiter = %p", latency, r.verb, r.URL().String(), r.rateLimiter)
        }

        if latency > longThrottleLatency {

I reproduced the issue again the the logs reveal the same rateLimiter (0x400011b830):

I0711 12:59:39.180582       1 request.go:629] Waited for 998.800458ms due to client-side throttling, not priority and fairness, request: PUT:https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/vpa-admission-controller, rateLimiter = 0x400011b830
I0711 12:59:53.623347       1 request.go:697] Waited for 6.98511392s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/monitoring.coreos.com/v1/namespaces/shoot--foo--bar/prometheuses/shoot/scale, rateLimiter = 0x400011b830

I again looked into this area because during the incident on our side we had only 12 client-side throttling logs for leases for 12 hour period and there were 1000 client-side throttling logs for the scale subresource for 15min interval.
It is still not clear to me why the diff in the the amount of client-side throttling is that big when both clients are using the same underlying rateLimiter.


Another thing that I tried today is to see if the underlying Informers' will be out of sync when client-side throttling occurs. If that was the case, my idea was to couple the Informer's HasSynced properly with the health endpoint (with the liveness && readiness probes) of the vpa-admission-controller. I reproduced client-side throttling for 4m+ but they Informers' HasSynced property was still true. It would be interesting what happens on client-side throttling that waits 10m+ because right now the informer's resync period is 10min. I suspect that then we could have the informers out of sync.

Long story short, this idea was based on:

  1. Informer gets out of sync (eventually)
  2. Probes for the vpa-admission-controller fail
  3. vpa-admission-controller goes into CrashloopBackoff
  4. => the vpa-admission-controller's lease is not updated
  5. => vpa-updater stops evicting

It is also a common pattern in K8s controllers/webhooks to couple the has-synced property of the informers with the health endpoint of the component: for example, see https://github.com/gardener/gardener/blob/10d75787a132bf39c408b74372f5d8c045fa8f4b/cmd/gardener-admission-controller/app/app.go#L108-L110


In an internal discussion with @voelzmo and @plkokanov we discussed the option to introduce a ratio of failed vs success requests in the vpa-admission-controller. The vpa-admission-controller knows and can count when a request succeeds or fails. When then failure rate crosses a configured/configurable threshold, then the vpa-admission-controller can stop updating its lease.
The doubts I have for this options are "what amount of requests we consider as minimum baseline". Because on 2 requests, 1 failure, 1 success, it wouldn't be ok to stop updating the lease.

@ialidzhikov
Copy link
Contributor

ialidzhikov commented Jul 12, 2024

Another way of fixing/mitigating this issue would be to contextify the statusUpdater runs:

func (su *Updater) Run(stopCh <-chan struct{}) {
go func() {
for {
select {
case <-stopCh:
return
case <-time.After(su.updateInterval):
if err := su.client.UpdateStatus(); err != nil {
klog.Errorf("Status update by %s failed: %v", su.client.holderIdentity, err)
}
}
}
}()
}

Right now, there is no timeout/context passed to the statusClient. It runs at every update interval (10s) but the run itself can take forever until it succeeds/fails. On client-side throttling it succeeds after minutes and the lease at the end gets renewed client-side throttling this happens at the end with huge delay.

The potential fix I am thinking of is to limit the statuUpdater's Run to 10s. With this, on such client-side throttlings it won't be able to renew its lease in 10s => vpa-updater will stop evicting.

WDYT?

@ialidzhikov
Copy link
Contributor

I created #7036, let me know what you think.

@adrianmoisey
Copy link
Member

The potential fix I am thinking of is to limit the statuUpdater's Run to 10s. With this, on such client-side throttlings it won't be able to renew its lease in 10s => vpa-updater will stop evicting.

This seems sane

@voelzmo
Copy link
Contributor Author

voelzmo commented Jul 15, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 15, 2024
@adrianmoisey
Copy link
Member

/reopen

I could be wrong, but this issue seems to describe a few solutions to the client-side rate limiting issue.
I assume we're not done exploring them all, so I'll reopen.

@k8s-ci-robot k8s-ci-robot reopened this Sep 16, 2024
@k8s-ci-robot
Copy link
Contributor

@adrianmoisey: Reopened this issue.

In response to this:

/reopen

I could be wrong, but this issue seems to describe a few solutions to the client-side rate limiting issue.
I assume we're not done exploring them all, so I'll reopen.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@voelzmo
Copy link
Contributor Author

voelzmo commented Sep 25, 2024

/retitle Re-work internal health check between vpa-updater and vpa-admission-controller

@k8s-ci-robot k8s-ci-robot changed the title Client-side rate limiting in admission-controller can lead to endless eviction loops Re-work internal health check between vpa-updater and vpa-admission-controller Sep 25, 2024
@voelzmo
Copy link
Contributor Author

voelzmo commented Sep 25, 2024

I think the original issue has been mitigated, but the conceptual question still remains: what would be a better way to check in the vpa-updater before evicting Pods that the admission-controller can actually do its job to update the resources?
Thanks @adrianmoisey for re-opening this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants