Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Increase otel-agent CPU request to 20m #783

Closed

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Jul 31, 2023

The limits are being set to request values on autopilot, which has been causing the otel-agent to fail liveness probes and get restarted. This increase should help prevent that.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karlkfi karlkfi requested review from sdowell, janetkuo and nan-yu and removed request for victorpras and haiyanmeng July 31, 2023 17:01
@janetkuo
Copy link
Contributor

The cpu request hasn't changed for a few releases. Did we introduce something new in this release that causes the need to increase cpu request/limit on Autopilot?

@karlkfi
Copy link
Contributor Author

karlkfi commented Jul 31, 2023

The cpu request hasn't changed for a few releases. Did we introduce something new in this release that causes the need to increase cpu request/limit on Autopilot?

I don't know. All I know is that it's failing on autopilot-rapid, because it's setting CPULimit = CPURequest.
https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#resource-limits

The limits are being set to request values on autopilot, which
has been causing the otel-agent to fail liveness probes and get
restarted. This increase should help prevent that.
@karlkfi
Copy link
Contributor Author

karlkfi commented Jul 31, 2023

This PR may have increased resource usage of the otel-agent somewhat: #763

The generation metrics label isn't going to Cloud Monitoring, but it does go through the otel-agent and otel-collector.

@janetkuo
Copy link
Contributor

Given that this number will multiply by the number of RSyncs in the cluster, I'd like to avoid increasing it if possible (or keep it to a minimum). Here's our current number: https://cloud.google.com/anthos-config-management/docs/how-to/installing-config-sync#detailed_resource_requests

@karlkfi
Copy link
Contributor Author

karlkfi commented Aug 30, 2023

I think risking breakage to keep requirements low is a bad trade-off, but now that the liveness probes are removed, it at least won't cause endless restarts.

@karlkfi karlkfi closed this Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants