Calico : Increase CPU limit to prevent throttling #8076

olevitt · 2021-10-13T09:41:51Z

Hi,

As reported in #8056 , Calico pods seem to struggle under low CPU limits. We are experiencing about 25% throttling under normal usage.
This PR may serve as a discussion on if we should increase the CPU limits default defined in kubespray (currently 100m) and what the new default should be. I personally have no idea what a reasonable default should be as I don't know about the average kubespray user. In our clusters we randomly chose a limit of 3 and throttling disappeared but I don't think a number that high should be necessary (even if it's only a limit, not a request).

Sidenote : for anyone wandering here and being as clueless as I was about CPU limits and throttling, this medium post explained a few basics : https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718

Does this PR introduce a user-facing change?:

[Calico] Increase CPU limit to prevent throttling

k8s-ci-robot · 2021-10-13T09:41:54Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: login-issues@jira.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2021-10-13T09:41:59Z

Welcome @olevitt!

It looks like this is your first PR to kubernetes-sigs/kubespray 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kubespray has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2021-10-13T09:41:59Z

Hi @olevitt. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

olevitt · 2021-10-13T10:11:01Z

/check-cla

champtar · 2021-10-13T10:48:53Z

Why not just remove the CPU limit ?

cristicalin · 2021-10-13T19:32:24Z

Why not just remove the CPU limit ?

On production clusters this is not really quite a good idea. We run OPA and prevent pods without limits and requests to run so I would strongly advise to just document the situation where increasing the limit makes sense and leave the current default in place.

A deployer can always set the ansible variable in their local inventory to whatever makes sense for their environment.

irizzant · 2021-10-14T08:54:04Z

On production clusters this is not really quite a good idea

I totally agree, running limitless is not secure at all. Better higher limits

champtar · 2021-10-14T13:56:36Z

I'm just a sample of one but I've never been saved by limits, and I have wasted hours troubleshooting weird issues where it was just the process being throttled for 10s of seconds. For most multi-threaded software it's just impossible to estimate what the limit should be, maybe your workload is stable under normal condition but need 20x cpu when network is unstable, if the worker has cycles available I don't see why we should throttle.

irizzant · 2021-10-14T15:43:29Z

A good High CPU throttling alerting system is right the reason why #8056 was opened, I didn't have to troubleshoot for hours.
The very same amount of troubleshooting hours could be for a Pod running limitless and eagerly consuming most of your CPUs leaving no room for other apps to run (maliciously or unintentionally).

Good CPU limits can be set with a reasonable analysis of CPU usage and corresponding throttling, after all CPU limits have spent years here without issues.

champtar · 2021-10-14T16:13:19Z

With good alerting you might be able to fix the issue quickly but low limits can still create a self inflicted outage where the process starts to be throttled and need to do more work to catch up so it's throttled more ...
Here some good discussion on the subject: https://news.ycombinator.com/item?id=28352071, you will see that Tim Hockin recommends to just turn off CPU limits.
We can merge this as it's still a net improvements
/lgtm
/ok-to-test

oomichi · 2021-10-14T16:31:25Z

/lgtm

k8s-ci-robot · 2021-10-14T18:02:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: floryut, olevitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [floryut]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

irizzant · 2021-10-15T07:49:04Z

Thanks for accepting the PR.
I usually don't start running Pod's with 1m of CPU limits, I check its behavior while it's still in test and then we decide for reasonable CPU limits before moving in production. There is no self inflicted outage with this, Pods do not get evicted or restarted anyway.
I read other articles like the one you suggested. We used to run Pods without CPU limits. Then one day Kubelets stopped to register the nodes to the API server randomly, weird timeout errors appeared contacting the API servers. I let you figure out the amount of time and work needed before we actually realized that some CI/CD jobs (and under specific conditions) eating CPU were the cause of all this.
Not to mention that many security compliance checks nowadays mandate setting these limits.
So I prefer having configurable limits rather than not having them at all, if I misconfigure limits I get warned in time before moving to production and I don't risk surprises.

champtar · 2021-10-15T09:24:33Z

My use case is a bit different, we deploy 10s of cluster at all our customers to run our softwares on top. Of course hw is never the same, so limits that worked at all other customers for MetalLB ended up being too small on a particular system. We started by investigating the network and at first we missed that the ARP response were delayed so it took us some time to find the issue.

Increase cpu limit to prevent throttling

4bfeedc

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Oct 13, 2021

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 13, 2021

k8s-ci-robot requested review from bozzo and EppO October 13, 2021 09:42

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 13, 2021

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 14, 2021

k8s-ci-robot assigned champtar Oct 14, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2021

k8s-ci-robot assigned oomichi Oct 14, 2021

floryut added the kind/network Network section in the release note label Oct 14, 2021

floryut approved these changes Oct 14, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2021

k8s-ci-robot merged commit 7019c26 into kubernetes-sigs:master Oct 14, 2021

irizzant mentioned this pull request Nov 8, 2021

calico-kube-controllers deployment with high CPU throttling projectcalico/calico#4988

Closed

oomichi mentioned this pull request Nov 11, 2021

calico-kube-controllers CPU throttling #8056

Closed

floryut mentioned this pull request Dec 21, 2021

Release Proposal v2.18 #8325

Closed

sakuraiyuta pushed a commit to sakuraiyuta/kubespray that referenced this pull request Apr 16, 2022

Increase cpu limit to prevent throttling (kubernetes-sigs#8076)

36a31d5

LuckySB pushed a commit to southbridgeio/kubespray that referenced this pull request Jun 27, 2023

Increase cpu limit to prevent throttling (kubernetes-sigs#8076)

010d32d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico : Increase CPU limit to prevent throttling #8076

Calico : Increase CPU limit to prevent throttling #8076

olevitt commented Oct 13, 2021 •

edited by floryut

Loading

k8s-ci-robot commented Oct 13, 2021

k8s-ci-robot commented Oct 13, 2021

k8s-ci-robot commented Oct 13, 2021

olevitt commented Oct 13, 2021

champtar commented Oct 13, 2021

cristicalin commented Oct 13, 2021

irizzant commented Oct 14, 2021

champtar commented Oct 14, 2021

irizzant commented Oct 14, 2021

champtar commented Oct 14, 2021

oomichi commented Oct 14, 2021

k8s-ci-robot commented Oct 14, 2021

irizzant commented Oct 15, 2021

champtar commented Oct 15, 2021

Calico : Increase CPU limit to prevent throttling #8076

Calico : Increase CPU limit to prevent throttling #8076

Conversation

olevitt commented Oct 13, 2021 • edited by floryut Loading

k8s-ci-robot commented Oct 13, 2021

k8s-ci-robot commented Oct 13, 2021

k8s-ci-robot commented Oct 13, 2021

olevitt commented Oct 13, 2021

champtar commented Oct 13, 2021

cristicalin commented Oct 13, 2021

irizzant commented Oct 14, 2021

champtar commented Oct 14, 2021

irizzant commented Oct 14, 2021

champtar commented Oct 14, 2021

oomichi commented Oct 14, 2021

k8s-ci-robot commented Oct 14, 2021

irizzant commented Oct 15, 2021

champtar commented Oct 15, 2021

olevitt commented Oct 13, 2021 •

edited by floryut

Loading