Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design] Dynamic resource usage for GMP operator #793

Open
pintohutch opened this issue Jan 29, 2024 · 6 comments
Open

[Design] Dynamic resource usage for GMP operator #793

pintohutch opened this issue Jan 29, 2024 · 6 comments

Comments

@pintohutch
Copy link
Collaborator

Can we find ways to avoid OOM crashes in the gmp-operator? Maybe using a VPA?

Acceptance criteria:

  • Proposal with design and trade-offs
@clearclaw
Copy link

This weekend gmp-operator started crashlooping with, "no endpoints available for service gmp-operator". While there were no OOMKill events, that turned out to be the cause. Putting in a VPA addressed.

Previous issue with rules-evaluator

  • ALL deploys were broken due to gmp-operator's webhook being dead.
  • Really bad visibility on the OOMKill. (I don't know why scheduler events weren't propagating, but they weren't)
  • As with rules-evaluator: VPA FTW.

I see that the rules-evaluator VPA hasn't moved either. I can live lacking VPAs, but having the webhook fail closed for gmp-operator was excruciating.

@pintohutch
Copy link
Collaborator Author

Thanks for reporting @clearclaw - I apologize for the frustration.

With the operator being such a mission-critical binary in managed-collection, we can also explore failing open for our webhooks. Would that have helped in this situation over a VPA specifically?

@clearclaw
Copy link

clearclaw commented Aug 19, 2024 via email

@clearclaw
Copy link

clearclaw commented Aug 19, 2024

Supplemental in case useful -- heading into and during the failures:
image

@bwplotka
Copy link
Collaborator

+1, some way of handling this would be important, let's prioritize it.

VPA/limits is one thing, but perhaps additionally it would be to prioritize the code optimization phase, to check for low hanging fruits and start to queue things up (trading latency vs memory).

Additionally, the lack of default observability/alerting around GMP
remains a concern (gmp-operator or rules-evaluator previously here,
but there's more components). If the observability stack fails, all the
alarms should be going off. It would be really nice to see/have
default Rules objects for all of GMP. (Why are there no podmonitoring
objects for the GMP processes?)

For managed collection we don't do this, because we have our own alerting pipeline (however oriented on more fleet-wide situations and unknown situations). On top of that we historically didn't deploy those self-observability resources by default as everyone on GKE would need to pay for those extra metrics. We ship some example configuration anyone can deploy (on GKE std), though missing operator.

To sum up, we have a bit to improve here, thanks for reporting & ideas. Perhaps it's time to add that self-monitoring feature/option to OperatorConfig but likely opt-in.

@clearclaw
Copy link

clearclaw commented Aug 20, 2024 via email

pintohutch added a commit that referenced this issue Aug 22, 2024
The gmp-operator exposes Prometheus metrics that can be helpful to debug
issues with managed-collection. This is also the case with the managed
alertmanager.

We have examples of how to scrape metrics from other components, but
not the operator or the alertmanager.

So here we provide examples to supplement the self-monitoring exporter
documented in
https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/prometheus?hl=en.

Partly addresses #793.

Signed-off-by: Danny Clark <danielclark@google.com>
bwplotka added a commit that referenced this issue Aug 22, 2024
The gmp-operator exposes Prometheus metrics that can be helpful to debug
issues with managed-collection. This is also the case with the managed
alertmanager.

We have examples of how to scrape metrics from other components, but not
the operator or the alertmanager.

So here we provide examples to supplement the self-monitoring exporter
documented in

https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/prometheus?hl=en.

Partly addresses #793.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants