Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VPA for admission-gcp deployment #141

Closed
wants to merge 1 commit into from

Conversation

ialidzhikov
Copy link
Member

How to categorize this PR?

/area auto-scaling
/kind task
/priority normal
/platform gcp

What this PR does / why we need it:
With the rollout of provider-gcp@v1.8.2, in some of the large landscapes we observed that the admission-gcp Pod was OOMKilled several times (with memory limit set to 200Mi) - probably admission-gcp now requires more memory after the introduction of #112 (new informers for Secrets and SecretBindings are added with a new webhook endpoint). I guess we could use a VerticalPodAutoscaler to minimize the manual request/limit adjustments in future.

Release note:

`admission-gcp` chart does now include a VerticalPodAutoscaler for the webhook deployment.

Signed-off-by: ialidzhikov <i.alidjikov@gmail.com>
@ialidzhikov ialidzhikov requested review from a team as code owners July 25, 2020 18:30
@gardener-robot gardener-robot added area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related kind/task General task platform/gcp Google cloud platform/infrastructure priority/normal labels Jul 25, 2020
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 25, 2020
@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 25, 2020
@timuthy
Copy link
Member

timuthy commented Jul 27, 2020

probably admission-gcp now requires more memory after the introduction of #112

Independent from the changes introduced by this PR, can you please double check the impact of #112? I'm a bit disturbed that a "simple" webhooks requires that much memory. WDYT?

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @timuthy

@gardener-robot gardener-robot added the reviewed/lgtm Has approval for merging label Jul 27, 2020
@timuthy
Copy link
Member

timuthy commented Jul 27, 2020

Tbh, I prefer to know the root cause of the higher memory consumption before we merge this PR. If the mem allocation was a spike because many requests hit the API server, would the VPA even help here? In our setup we have recommender-interval=1m0s.

@rfranzke
Copy link
Member

Agreed that it would be helpful to know what causes the memory consumption, but IMO it's no prerequisite for merging this PR. Putting our components under auto-scaling means makes sense in general. Do you agree?

@timebertt
Copy link
Member

probably admission-gcp now requires more memory after the introduction of #112

Independent from the changes introduced by this PR, can you please double check the impact of #112? I'm a bit disturbed that a "simple" webhooks requires that much memory. WDYT?

I think, with #112 admission-gcp watches Shoots, Secrets and SecretBindings, which it hasn't been doing before.
And because controller-runtime watches are not filtered, it will watch all instances of those resources, which obviously can be quite many in large gardener installations.
Would be nice to be able to filter for a specific spec.provider.type, though (ref kubernetes-sigs/controller-runtime#244)

FMPOV this explains, why the memory usage is then sometimes exceeding 200Mi.
WDYT?

@timuthy
Copy link
Member

timuthy commented Jul 27, 2020

Putting our components under auto-scaling means makes sense in general. Do you agree?

Not if we do it by default and hope that the OOM issue above dissolves.

I'm mainly worried about downscaling actions performed by VPA. In operations, we observed cases for which a proper upscaling didn't happen after downscaling and in order to recover we needed to delete the VPA object. The admission component directly affects the availability of the Shoot API and thus I'm a bit critical. In addition, restarting the admission component isn't cheap any more since the introduction of #112 and the cache syncs. On our busiest landscapes I observed start-up times of ~2 mins.

What I'm trying to say is that adding VPA comes at a price:

  • Only add VPA if it helps to improve our situation. I double checked @tim-ebert's statement and can confirm this partly but think in our case it's more related to getting Secrets via the Controller-Runtime client which is backed by a cache, i.e. all secrets are synced to the cache.
  • We should think about the issue mentioned above. Either setting minAllowed to 200Mi or increasing the replica count as a safety buffer.

@timuthy
Copy link
Member

timuthy commented Jul 28, 2020

@ialidzhikov let's increase the default replica count to 3 and decrease the memory footprint in another PR via #143.

@ialidzhikov
Copy link
Member Author

/close

@ialidzhikov ialidzhikov deleted the enh/webhook-vpa branch August 25, 2020 20:05
@gardener-robot gardener-robot added the priority/3 Priority (lower number equals higher priority) label Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related kind/task General task needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) platform/gcp Google cloud platform/infrastructure priority/3 Priority (lower number equals higher priority) reviewed/lgtm Has approval for merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants