[Design] Dynamic resource usage for GMP operator #793

pintohutch · 2024-01-29T13:44:59Z

Can we find ways to avoid OOM crashes in the gmp-operator? Maybe using a VPA?

Acceptance criteria:

Proposal with design and trade-offs

clearclaw · 2024-08-19T17:47:44Z

This weekend gmp-operator started crashlooping with, "no endpoints available for service gmp-operator". While there were no OOMKill events, that turned out to be the cause. Putting in a VPA addressed.

Previous issue with rules-evaluator

ALL deploys were broken due to gmp-operator's webhook being dead.
Really bad visibility on the OOMKill. (I don't know why scheduler events weren't propagating, but they weren't)
As with rules-evaluator: VPA FTW.

I see that the rules-evaluator VPA hasn't moved either. I can live lacking VPAs, but having the webhook fail closed for gmp-operator was excruciating.

pintohutch · 2024-08-19T18:08:03Z

Thanks for reporting @clearclaw - I apologize for the frustration.

With the operator being such a mission-critical binary in managed-collection, we can also explore failing open for our webhooks. Would that have helped in this situation over a VPA specifically?

clearclaw · 2024-08-19T22:06:06Z

Yes, in that deploys would have worked despite the failure. Being unable to eg ship a critical security fix due to GMP indigestion is not right. Additionally, the lack of default observability/alerting around GMP remains a concern (gmp-operator or rules-evaluator previously here, but there's more components). If the observability stack fails, all the alarms should be going off. It would be really nice to see/have default Rules objects for all of GMP. (Why are there no podmonitoring objects for the GMP processes?)

clearclaw · 2024-08-19T22:48:12Z

Supplemental in case useful -- heading into and during the failures:

bwplotka · 2024-08-20T08:04:39Z

+1, some way of handling this would be important, let's prioritize it.

VPA/limits is one thing, but perhaps additionally it would be to prioritize the code optimization phase, to check for low hanging fruits and start to queue things up (trading latency vs memory).

Additionally, the lack of default observability/alerting around GMP
remains a concern (gmp-operator or rules-evaluator previously here,
but there's more components). If the observability stack fails, all the
alarms should be going off. It would be really nice to see/have
default Rules objects for all of GMP. (Why are there no podmonitoring
objects for the GMP processes?)

For managed collection we don't do this, because we have our own alerting pipeline (however oriented on more fleet-wide situations and unknown situations). On top of that we historically didn't deploy those self-observability resources by default as everyone on GKE would need to pay for those extra metrics. We ship some example configuration anyone can deploy (on GKE std), though missing operator.

To sum up, we have a bit to improve here, thanks for reporting & ideas. Perhaps it's time to add that self-monitoring feature/option to OperatorConfig but likely opt-in.

clearclaw · 2024-08-20T08:15:14Z

Making self-observation opt-in is sensible. Something as simple as a doc page or reference pointing at reference pod monitors and possible alert rules under https://g.co/kgs/dDdszEW would help a lot. In short, make it easy for people to do good things. -- JCL

…

On Tue, 20 Aug 2024, 01:05 Bartlomiej Plotka, ***@***.***> wrote: +1, some way of handling this would be important, let's prioritize it. VPA/limits is one thing, but perhaps additionally it would be to prioritize the code optimization phase, to check for low hanging fruits and start to queue things up (trading latency vs memory). Additionally, the lack of default observability/alerting around GMP remains a concern (gmp-operator or rules-evaluator previously here, but there's more components). If the observability stack fails, all the alarms should be going off. It would be really nice to see/have default Rules objects for all of GMP. (Why are there no podmonitoring objects for the GMP processes?) For managed collection we don't do this, because we have our own alerting pipeline (however oriented on more fleet-wide situations and unknown situations). On top of that we historically didn't deploy those self-observability resources by default as everyone on GKE would need to pay for those extra metrics. We ship some example configuration <https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/examples/self-pod-monitoring.yaml> anyone can deploy (on GKE std), though missing operator <#1123>. To sum up, we have a bit to improve here, thanks for reporting & ideas. Perhaps it's time to add that self-monitoring feature/option to OperatorConfig but likely opt-in. — Reply to this email directly, view it on GitHub <#793 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJYHIK4AM26DO4XMNYMHQ3ZSL2C3AVCNFSM6AAAAABMYJKNZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJYGIZDONJWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

The gmp-operator exposes Prometheus metrics that can be helpful to debug issues with managed-collection. This is also the case with the managed alertmanager. We have examples of how to scrape metrics from other components, but not the operator or the alertmanager. So here we provide examples to supplement the self-monitoring exporter documented in https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/prometheus?hl=en. Partly addresses #793. Signed-off-by: Danny Clark <danielclark@google.com>

The gmp-operator exposes Prometheus metrics that can be helpful to debug issues with managed-collection. This is also the case with the managed alertmanager. We have examples of how to scrape metrics from other components, but not the operator or the alertmanager. So here we provide examples to supplement the self-monitoring exporter documented in https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/prometheus?hl=en. Partly addresses #793.

pintohutch mentioned this issue Jan 29, 2024

gmp-operator OOMKilled when turning on targetStatus #774

Closed

github-actions bot assigned pintohutch Jan 29, 2024

pintohutch removed their assignment Jan 30, 2024

pintohutch mentioned this issue Aug 22, 2024

feat: add more example self-podmonitorings #1127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design] Dynamic resource usage for GMP operator #793

[Design] Dynamic resource usage for GMP operator #793

pintohutch commented Jan 29, 2024

clearclaw commented Aug 19, 2024

pintohutch commented Aug 19, 2024

clearclaw commented Aug 19, 2024 via email •

edited

Loading

clearclaw commented Aug 19, 2024 •

edited

Loading

bwplotka commented Aug 20, 2024

clearclaw commented Aug 20, 2024 via email

[Design] Dynamic resource usage for GMP operator #793

[Design] Dynamic resource usage for GMP operator #793

Comments

pintohutch commented Jan 29, 2024

clearclaw commented Aug 19, 2024

pintohutch commented Aug 19, 2024

clearclaw commented Aug 19, 2024 via email • edited Loading

clearclaw commented Aug 19, 2024 • edited Loading

bwplotka commented Aug 20, 2024

clearclaw commented Aug 20, 2024 via email

clearclaw commented Aug 19, 2024 via email •

edited

Loading

clearclaw commented Aug 19, 2024 •

edited

Loading