Add VPA for admission-gcp deployment #141

ialidzhikov · 2020-07-25T18:30:13Z

How to categorize this PR?

/area auto-scaling
/kind task
/priority normal
/platform gcp

What this PR does / why we need it:
With the rollout of provider-gcp@v1.8.2, in some of the large landscapes we observed that the admission-gcp Pod was OOMKilled several times (with memory limit set to 200Mi) - probably admission-gcp now requires more memory after the introduction of #112 (new informers for Secrets and SecretBindings are added with a new webhook endpoint). I guess we could use a VerticalPodAutoscaler to minimize the manual request/limit adjustments in future.

Release note:

`admission-gcp` chart does now include a VerticalPodAutoscaler for the webhook deployment.

Signed-off-by: ialidzhikov <i.alidjikov@gmail.com>

timuthy · 2020-07-27T06:13:27Z

probably admission-gcp now requires more memory after the introduction of #112

Independent from the changes introduced by this PR, can you please double check the impact of #112? I'm a bit disturbed that a "simple" webhooks requires that much memory. WDYT?

rfranzke

/lgtm
/assign @timuthy

timuthy · 2020-07-27T10:03:34Z

Tbh, I prefer to know the root cause of the higher memory consumption before we merge this PR. If the mem allocation was a spike because many requests hit the API server, would the VPA even help here? In our setup we have recommender-interval=1m0s.

rfranzke · 2020-07-27T10:22:38Z

Agreed that it would be helpful to know what causes the memory consumption, but IMO it's no prerequisite for merging this PR. Putting our components under auto-scaling means makes sense in general. Do you agree?

timebertt · 2020-07-27T11:08:20Z

probably admission-gcp now requires more memory after the introduction of #112

Independent from the changes introduced by this PR, can you please double check the impact of #112? I'm a bit disturbed that a "simple" webhooks requires that much memory. WDYT?

I think, with #112 admission-gcp watches Shoots, Secrets and SecretBindings, which it hasn't been doing before.
And because controller-runtime watches are not filtered, it will watch all instances of those resources, which obviously can be quite many in large gardener installations.
Would be nice to be able to filter for a specific spec.provider.type, though (ref kubernetes-sigs/controller-runtime#244)

FMPOV this explains, why the memory usage is then sometimes exceeding 200Mi.
WDYT?

timuthy · 2020-07-27T14:02:25Z

Putting our components under auto-scaling means makes sense in general. Do you agree?

Not if we do it by default and hope that the OOM issue above dissolves.

I'm mainly worried about downscaling actions performed by VPA. In operations, we observed cases for which a proper upscaling didn't happen after downscaling and in order to recover we needed to delete the VPA object. The admission component directly affects the availability of the Shoot API and thus I'm a bit critical. In addition, restarting the admission component isn't cheap any more since the introduction of #112 and the cache syncs. On our busiest landscapes I observed start-up times of ~2 mins.

What I'm trying to say is that adding VPA comes at a price:

Only add VPA if it helps to improve our situation. I double checked @tim-ebert's statement and can confirm this partly but think in our case it's more related to getting Secrets via the Controller-Runtime client which is backed by a cache, i.e. all secrets are synced to the cache.
We should think about the issue mentioned above. Either setting minAllowed to 200Mi or increasing the replica count as a safety buffer.

timuthy · 2020-07-28T12:39:26Z

@ialidzhikov let's increase the default replica count to 3 and decrease the memory footprint in another PR via #143.

ialidzhikov · 2020-08-03T07:10:01Z

/close

Add VPA for admission-gcp deployment

83f8d59

Signed-off-by: ialidzhikov <i.alidjikov@gmail.com>

ialidzhikov requested review from a team as code owners July 25, 2020 18:30

gardener-robot added area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related kind/task General task platform/gcp Google cloud platform/infrastructure priority/normal labels Jul 25, 2020

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 25, 2020

gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 25, 2020

rfranzke approved these changes Jul 27, 2020

View reviewed changes

gardener-robot added the reviewed/lgtm Has approval for merging label Jul 27, 2020

gardener-robot assigned timuthy Jul 27, 2020

gardener-robot closed this Aug 3, 2020

ialidzhikov deleted the enh/webhook-vpa branch August 25, 2020 20:05

ialidzhikov mentioned this pull request Feb 5, 2021

VPA for validator/admission component #234

Merged

gardener-robot added the priority/3 Priority (lower number equals higher priority) label Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VPA for admission-gcp deployment #141

Add VPA for admission-gcp deployment #141

ialidzhikov commented Jul 25, 2020

timuthy commented Jul 27, 2020

rfranzke left a comment •

edited

Loading

timuthy commented Jul 27, 2020

rfranzke commented Jul 27, 2020

timebertt commented Jul 27, 2020

timuthy commented Jul 27, 2020

timuthy commented Jul 28, 2020

ialidzhikov commented Aug 3, 2020

Add VPA for admission-gcp deployment #141

Add VPA for admission-gcp deployment #141

Conversation

ialidzhikov commented Jul 25, 2020

timuthy commented Jul 27, 2020

rfranzke left a comment • edited Loading

Choose a reason for hiding this comment

timuthy commented Jul 27, 2020

rfranzke commented Jul 27, 2020

timebertt commented Jul 27, 2020

timuthy commented Jul 27, 2020

timuthy commented Jul 28, 2020

ialidzhikov commented Aug 3, 2020

rfranzke left a comment •

edited

Loading