VPA: Why using `targetRef` instead of Pod selector? #6925

ialidzhikov · 2024-06-14T15:12:55Z

How to categorize this issue?
/area vertical-pod-autoscaler
/kind feature

Problem description:
Right now, in VPA it is expensive/ineffective to get the matching VPA for a given Pod (GetMatchingVPA).

For example the vpa-admission-controller, on every Pod mutation:

It is fetching all VPAs from the Pod namespace:

autoscaler/vertical-pod-autoscaler/pkg/admission-controller/resource/vpa/matcher.go

Line 53 in 2bba2ba

configs, err := m.vpaLister.VerticalPodAutoscalers(pod.Namespace).List(labels.Everything())
It is fetching the selector for every VPA:

autoscaler/vertical-pod-autoscaler/pkg/admission-controller/resource/vpa/matcher.go

Line 63 in 2bba2ba

selector, err := m.selectorFetcher.Fetch(vpaConfig)
For kind that is not well-known, fetching the selector does a GET request to the /scale subresource to fetch the selector from there:

autoscaler/vertical-pod-autoscaler/pkg/target/fetcher.go

Line 140 in c79bdaf

selector, err := f.getLabelSelectorFromResource(groupKind, vpa.Namespace, vpa.Spec.TargetRef.Name)

and

autoscaler/vertical-pod-autoscaler/pkg/target/fetcher.go

Line 186 in c79bdaf

scale, err := f.scaleNamespacer.Scales(namespace).Get(context.TODO(), groupResource, name, metav1.GetOptions{})
After we have all VPAs and their selectors, we try to FindTopMostWellKnownOrScalable controller by tracing back the ownerReference of the Pod.
In the FindTopMostWellKnownOrScalable, we also GET the /scale subresource if we have a controller that is not well-known. The GET request in the FindTopMostWellKnownOrScalable is cached and the cache is refreshed every 10mins. Negative results are cached as well.

Currently, VPA requires a targetRef and internally resolves the targetRef to selector and then verifies that the Pod's labels matche the VPA targetRef's selector and that the Pod's top most well-known scalable controller is the one from the targetRef:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx

It would be much more efficient if VPA instead supports in its spec Pod selector (such as Deployment/Service/NetworkPolicy and other K8s resources) and it only verifies that the VPA spec selector matches the Pod's labels:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
spec:
  selector:
    app: nginx

What is the benefit of using targetRef over Pod selector? Why VPA has to know about the targetRef? Isn't it enough for VPA to be able to find the Pods for given VPA resource and vice-versa?

The text was updated successfully, but these errors were encountered:

ejholmes · 2024-06-14T21:31:03Z

We ran into this recently as well. We use argo rollouts, and have many hundreds of Rollout objects. If we enable VPA against these, it quickly gets into a state where it starts getting throttled on API calls to the Rollout's /scale endpoint and makes it unusable for us. We see hundreds of:

Waited for 27.39281646s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/argoproj.io/v1alpha1/namespaces/<namespace>/rollouts/<rollout>/scale

voelzmo · 2024-06-16T17:51:04Z

The question why VPA requires a controller targetRef comes up quite often, but this was a deliberate design decision: I tried to summarize this at some point in #6493 (comment). Let me know if that makes sense.

@ejholmes I guess the easiest solution here would be to raise the client-side rate limits in VPA. The defaults are quite low, so if you really put some usage on VPA, it is absolutely important to raise those limits on all 3 components, or else you will get into trouble.

ialidzhikov · 2024-06-19T13:40:03Z

@ejholmes , wrt to #6925 (comment): we also ran into this recently. @voelzmo also created #6884 describing that the currently safeguard mechanism between the vpa-updater and vpa-admission-controller based on the lease resource does not cover each and every case that can lead to endless Pod evictions by the vpa-updater.

ialidzhikov · 2024-07-11T14:56:55Z

The question why VPA requires a controller targetRef comes up quite often, but this was a deliberate design decision: I tried to summarize this at some point in #6493 (comment). Let me know if that makes sense.

I am not sure if UX/Safety and Convenience reasons are justified. The Service resource is using selector for selecting to which Pods it is able to route it. At the end a core component like VPA has to be effective and scalable. An operation like "find the matching VPA for a given Pod" right now consists of 2 potential GET requests for the scale subresource (1 of the them is caching the response) and tracking back parent controller over ownerRefernces. VPA also relies on implicit "contract" - tracing back ownerReferences from Pod to higher level controller to find out whether this is a Pod associated with the targetRef of the VPA.

I saw these days the earlier versions of the API were based on selector, then newer version changed it to targetRef.

I didn't dig in details what was the exact motivation. Maybe there was intention to align the HPA and VPA APIs to use targetRef. But HPA's architecture is suitable for such targetRef approach. It is a single control loop in KCM, it does not need to find a matching HPA for given Pod. Having in mind the specifics of the VPA's architecture and components (the resources update mechanism through eviction and webhook mutation) the targetRef approach is not very suitable and scalable.

ejholmes · 2024-07-11T16:57:54Z

👍 to the above. I've also opened #7024 to add better support for non-standard controllers. Tweaking the kube client QPS settings helped us, but unfortunately wasn't enough for us to be comfortable putting VPA in production.

I understand why targetRef was chosen, but it would still be nice if users could drop down to a raw selector when needed.

ialidzhikov added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 14, 2024

k8s-ci-robot added the area/vertical-pod-autoscaler label Jun 14, 2024

ialidzhikov mentioned this issue Jul 8, 2024

vpa-admission-controller: Improve finding the matching VPA for Pod #7010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPA: Why using `targetRef` instead of Pod selector? #6925

VPA: Why using `targetRef` instead of Pod selector? #6925

ialidzhikov commented Jun 14, 2024 •

edited

Loading

ejholmes commented Jun 14, 2024

voelzmo commented Jun 16, 2024 •

edited

Loading

ialidzhikov commented Jun 19, 2024

ialidzhikov commented Jul 11, 2024

ejholmes commented Jul 11, 2024

VPA: Why using targetRef instead of Pod selector? #6925

VPA: Why using targetRef instead of Pod selector? #6925

Comments

ialidzhikov commented Jun 14, 2024 • edited Loading

ejholmes commented Jun 14, 2024

voelzmo commented Jun 16, 2024 • edited Loading

ialidzhikov commented Jun 19, 2024

ialidzhikov commented Jul 11, 2024

ejholmes commented Jul 11, 2024

VPA: Why using `targetRef` instead of Pod selector? #6925

VPA: Why using `targetRef` instead of Pod selector? #6925

ialidzhikov commented Jun 14, 2024 •

edited

Loading

voelzmo commented Jun 16, 2024 •

edited

Loading