-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calico : Increase CPU limit to prevent throttling #8076
Calico : Increase CPU limit to prevent throttling #8076
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Welcome @olevitt! |
Hi @olevitt. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/check-cla |
Why not just remove the CPU limit ? |
On production clusters this is not really quite a good idea. We run OPA and prevent pods without limits and requests to run so I would strongly advise to just document the situation where increasing the limit makes sense and leave the current default in place. A deployer can always set the ansible variable in their local inventory to whatever makes sense for their environment. |
I totally agree, running limitless is not secure at all. Better higher limits |
I'm just a sample of one but I've never been saved by limits, and I have wasted hours troubleshooting weird issues where it was just the process being throttled for 10s of seconds. For most multi-threaded software it's just impossible to estimate what the limit should be, maybe your workload is stable under normal condition but need 20x cpu when network is unstable, if the worker has cycles available I don't see why we should throttle. |
A good High CPU throttling alerting system is right the reason why #8056 was opened, I didn't have to troubleshoot for hours. Good CPU limits can be set with a reasonable analysis of CPU usage and corresponding throttling, after all CPU limits have spent years here without issues. |
With good alerting you might be able to fix the issue quickly but low limits can still create a self inflicted outage where the process starts to be throttled and need to do more work to catch up so it's throttled more ... |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: floryut, olevitt The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for accepting the PR. |
My use case is a bit different, we deploy 10s of cluster at all our customers to run our softwares on top. Of course hw is never the same, so limits that worked at all other customers for MetalLB ended up being too small on a particular system. We started by investigating the network and at first we missed that the ARP response were delayed so it took us some time to find the issue. |
Hi,
As reported in #8056 , Calico pods seem to struggle under low CPU limits. We are experiencing about 25% throttling under normal usage.
This PR may serve as a discussion on if we should increase the CPU limits default defined in kubespray (currently
100m
) and what the new default should be. I personally have no idea what a reasonable default should be as I don't know about the average kubespray user. In our clusters we randomly chose a limit of3
and throttling disappeared but I don't think a number that high should be necessary (even if it's only a limit, not a request).Sidenote : for anyone wandering here and being as clueless as I was about CPU limits and throttling, this medium post explained a few basics : https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718
Does this PR introduce a user-facing change?: