-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VPA for admission-gcp deployment #141
Conversation
Signed-off-by: ialidzhikov <i.alidjikov@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @timuthy
Tbh, I prefer to know the root cause of the higher memory consumption before we merge this PR. If the mem allocation was a spike because many requests hit the API server, would the VPA even help here? In our setup we have |
Agreed that it would be helpful to know what causes the memory consumption, but IMO it's no prerequisite for merging this PR. Putting our components under auto-scaling means makes sense in general. Do you agree? |
I think, with #112 FMPOV this explains, why the memory usage is then sometimes exceeding |
Not if we do it by default and hope that the OOM issue above dissolves. I'm mainly worried about downscaling actions performed by VPA. In operations, we observed cases for which a proper upscaling didn't happen after downscaling and in order to recover we needed to delete the VPA object. The admission component directly affects the availability of the Shoot API and thus I'm a bit critical. In addition, restarting the admission component isn't cheap any more since the introduction of #112 and the cache syncs. On our busiest landscapes I observed start-up times of ~2 mins. What I'm trying to say is that adding VPA comes at a price:
|
@ialidzhikov let's increase the default replica count to |
/close |
How to categorize this PR?
/area auto-scaling
/kind task
/priority normal
/platform gcp
What this PR does / why we need it:
With the rollout of provider-gcp@v1.8.2, in some of the large landscapes we observed that the admission-gcp Pod was
OOMKilled
several times (with memory limit set to200Mi
) - probably admission-gcp now requires more memory after the introduction of #112 (new informers for Secrets and SecretBindings are added with a new webhook endpoint). I guess we could use a VerticalPodAutoscaler to minimize the manual request/limit adjustments in future.Release note: