Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico : Increase CPU limit to prevent throttling #8076

Merged

Conversation

olevitt
Copy link
Contributor

@olevitt olevitt commented Oct 13, 2021

Hi,

As reported in #8056 , Calico pods seem to struggle under low CPU limits. We are experiencing about 25% throttling under normal usage.
This PR may serve as a discussion on if we should increase the CPU limits default defined in kubespray (currently 100m) and what the new default should be. I personally have no idea what a reasonable default should be as I don't know about the average kubespray user. In our clusters we randomly chose a limit of 3 and throttling disappeared but I don't think a number that high should be necessary (even if it's only a limit, not a request).

Sidenote : for anyone wandering here and being as clueless as I was about CPU limits and throttling, this medium post explained a few basics : https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718

Does this PR introduce a user-facing change?:

[Calico] Increase CPU limit to prevent throttling

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Oct 13, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @olevitt!

It looks like this is your first PR to kubernetes-sigs/kubespray 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kubespray has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @olevitt. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 13, 2021
@k8s-ci-robot k8s-ci-robot requested review from bozzo and EppO October 13, 2021 09:42
@olevitt
Copy link
Contributor Author

olevitt commented Oct 13, 2021

/check-cla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 13, 2021
@champtar
Copy link
Contributor

Why not just remove the CPU limit ?

@cristicalin
Copy link
Contributor

Why not just remove the CPU limit ?

On production clusters this is not really quite a good idea. We run OPA and prevent pods without limits and requests to run so I would strongly advise to just document the situation where increasing the limit makes sense and leave the current default in place.

A deployer can always set the ansible variable in their local inventory to whatever makes sense for their environment.

@irizzant
Copy link
Contributor

On production clusters this is not really quite a good idea

I totally agree, running limitless is not secure at all. Better higher limits

@champtar
Copy link
Contributor

I'm just a sample of one but I've never been saved by limits, and I have wasted hours troubleshooting weird issues where it was just the process being throttled for 10s of seconds. For most multi-threaded software it's just impossible to estimate what the limit should be, maybe your workload is stable under normal condition but need 20x cpu when network is unstable, if the worker has cycles available I don't see why we should throttle.

@irizzant
Copy link
Contributor

A good High CPU throttling alerting system is right the reason why #8056 was opened, I didn't have to troubleshoot for hours.
The very same amount of troubleshooting hours could be for a Pod running limitless and eagerly consuming most of your CPUs leaving no room for other apps to run (maliciously or unintentionally).

Good CPU limits can be set with a reasonable analysis of CPU usage and corresponding throttling, after all CPU limits have spent years here without issues.

@champtar
Copy link
Contributor

With good alerting you might be able to fix the issue quickly but low limits can still create a self inflicted outage where the process starts to be throttled and need to do more work to catch up so it's throttled more ...
Here some good discussion on the subject: https://news.ycombinator.com/item?id=28352071, you will see that Tim Hockin recommends to just turn off CPU limits.
We can merge this as it's still a net improvements
/lgtm
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 14, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2021
@oomichi
Copy link
Contributor

oomichi commented Oct 14, 2021

/lgtm

@floryut floryut added the kind/network Network section in the release note label Oct 14, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: floryut, olevitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2021
@k8s-ci-robot k8s-ci-robot merged commit 7019c26 into kubernetes-sigs:master Oct 14, 2021
@irizzant
Copy link
Contributor

Thanks for accepting the PR.
I usually don't start running Pod's with 1m of CPU limits, I check its behavior while it's still in test and then we decide for reasonable CPU limits before moving in production. There is no self inflicted outage with this, Pods do not get evicted or restarted anyway.
I read other articles like the one you suggested. We used to run Pods without CPU limits. Then one day Kubelets stopped to register the nodes to the API server randomly, weird timeout errors appeared contacting the API servers. I let you figure out the amount of time and work needed before we actually realized that some CI/CD jobs (and under specific conditions) eating CPU were the cause of all this.
Not to mention that many security compliance checks nowadays mandate setting these limits.
So I prefer having configurable limits rather than not having them at all, if I misconfigure limits I get warned in time before moving to production and I don't risk surprises.

@champtar
Copy link
Contributor

My use case is a bit different, we deploy 10s of cluster at all our customers to run our softwares on top. Of course hw is never the same, so limits that worked at all other customers for MetalLB ended up being too small on a particular system. We started by investigating the network and at first we missed that the ARP response were delayed so it took us some time to find the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/network Network section in the release note lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants