-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: use memory limit from cgroup when set and lower #2699
Conversation
When you run kubelet in container if you set some memory limit cadvisor still use system memory. But the memory is limit by the cgroup and the process may be kill if it use more memory than the limit. The strategie is: - by default use the system memory limit - if there is cgroup memory limit read it - compare the system memory limit and the cgroup memory limit, take the lower - return the lower memory limit found
Hi @louiznk. Thanks for your PR. I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
@louiznk I don't think it's a good idea to modify
See: Let me know if I misunderstand your intention, but for the time being I think that the PR should be closed. |
Hello @iwankgb Say me if I well understood your point: If the kubelet is started with some cgroup isolation and with cgroup memory limitation, but not the rest of the node, then the memory sends to the api-server for this node is wrong. My intention is that a cluster running in docker (kind or k3d for example) knows the real memory available. If it doesn't, your nodes that run in some containers could be oom killed. I don't know now how to deal with these both points, do you have an idea of how to achieve this? (even if it was not for production I think it will be great to run kubernetes in docker without risk of oom kill and it's will be a pleasure to help on this) For my curiosity is there much kubernetes distribution that running kubelet in a container ? |
@louiznk at the moment kubelet will send correct information. With your patch applied it would send wrong information and the node would effectively be unschedulable. AFAIK there were some attempts for running kubelet inside a container in the past but they were abandoned. |
@iwankgb perhaps if I create another function said |
To be clear those efforts were running kubelet within a container relative to the host, as in alongside the workload containers. that was indeed abandoned, in part because this makes everything with the filesystem messy. With kind, k3d, etc. the "host" (node as far as kubernetes is concerned) is fully "within" a single container as if it were a VM, and itself runs a "nested" container runtime. kubelet is "not containerized" relative to the pods. It's not something to use in production, but we use it to test kubernetes itself cheaply, amongst other things. EDIT: and kubernetes was already using KIND when that mode of operation was made unsupported and the flag deprecated, that's fine because that's not what we're doing. We do not use "containerized kubelet" mode. However everything is in a container. |
@louiznk I don't think so, it looks hacky to me, especially as it is supposed to support development environments only. I would rather opt for mounting carefully crafted /proc/meminfo when necessary. |
Thanks @BenTheElder Finally, it's a little confusing to me. As I understood, the case explains by @iwankgb will never happen? Thank you |
@louiznk any kubelet running in a container (e.g. one that |
Ok, thanks |
I will try if it was possible |
@louiznk I like the idea to get memory or cpu capacity against docker container so that we can make more use of
|
As I understood the point of @iwankgb the risk is when you have a "classic" k8s with kubelet running in a container (even if this is not common usage) than you cluster will have the wrong node's capacities |
@huoqifeng can you help me understand production use cases for such a scenario? |
@iwankgb we're seeking to use |
@huoqifeng, do you think this idea could solve your problem? #2699 (comment) |
@iwankgb as @BenTheElder mentioned, |
Isn't this file full of dynamic values? what would this look like? EDIT: Within reason I'd be happy to implement faking this in kind and recommend users rely on that instead, but I'm somewhat skeptical of this approach. |
Note that kubelet in a container (relative to the host it is managing) is not a supported configuration in Kubernetes. SIG Node chose to drop this. With kind we're talking about (somewhat abusively perhaps) pretending a container is the host, with everything about kubernetes "inside" it. Just containerizing your kubelet is not supported upstream as-is. EDIT: and therefore risk to just kubelet in a container seems not to be a big concern. [I still don't know what the correct answer is here, just attempting to clarify this aspect.] |
Yes, this is dynamic values. I have tried to mount a fake /proc/meminfo and it's work, the difficulty is if you need to update this file |
@louiznk @BenTheElder do you think that usign |
Hello @iwankgb |
I think you can use it to affect amount of memory that is available to pods. As far as I understand, it should help you solve your problem. |
Sorry, I just understood that you speak about the kubelet option. This is an interesting workaround and I think it's would work. I don't find this very elegant because this option was planned for preserving resource for the operating system, not resizing the resource available for kube, but that my point of view |
Could we instead have a more direct mechanism to instruct cadvisor what we'd like the resource limits reported as? (So e.g. monitoring tools behave as expected). I think it would be more generally useful to be able to bypass the detection mechanisms and provide the "real limits". E.g. imagine some bug is encountered, today you have to get all the binaries patched or do something very hacky because there's no override mechanism. |
I still don't think it is a good idea. The only use case that we have is one related to testing and I don't believe that introducing code aimed at testing project X to project Y is beneficial to project Y as it increases project's complexity and affects, badly, maintainability. |
Hello, I think this PR could be close if you think this wasn't the good implementation, but I think the issue is still here for projects like kind and k3d, but that a point of view and not a trust. |
The production usecase is bypassing broken detection on the host which I have experienced before. (E.g. the hyperthread snafu). |
@BenTheElder ultimately kubelet would have to inform cAdvisor that it should return memory capacity as specified in cgroup rather than
It would require us to:
Instead of it we could alter behavior of kubelet (in https://github.com/kubernetes/kubernetes/blob/3321f00ed14e07f774b84d3198ede545c1dee697/pkg/kubelet/kubelet.go#L550) and force it to cache altered Kubernetes is a large project and important user of cAdvisor, but I still don't think that leaking kubelet testing logic to Cadvisor is a good idea. We should keep in mind that there are companies and people who use cAdvisor to monitor their infrastructure. For them cAdvisor reporting its own memory limit as node memory capacity sounds like a nightmare, I think. By the way, can you elaborate more on hyperthread snafu? Sounds interesting. |
Regarding mounting fake
|
No that's not what I'm asking for. I'm asking to be able to tell cadvisor directly "this is how much memory, CPU, etc. the system has" via config or similar, which would then be helpful whenever trying to work around bugs in production. Right now if I encounter a bug like the disabled hyperthread CPU count issue previously, I have to get cadvisor patched THEN kubernetes patched and then upgrade kubernetes. If I could instead tell cadvisor "no actually this is what the system resource limits are" this might also be useful in prod, not just to artificially limit them. We could change something in kubernetes for this similar to system reserved but then the metrics will be incorrect. #2579, you actually sent the final patch :-) At one point, cadvisor incorrectly reported offline CPUs for the CPU resource limit (and also didn't work on arm platforms), forcing people to upgrade around it. It took a while to get a patch in. I would have instead specified the correct CPU count externally if that was possible.
Sure, but what do we break if all these other values become fixed and inaccurate ..? |
OK, I understand your reasoning and what you want to achieve but what you are describing is using another tool (a shell script, perhaps) to obtain topology/memory information and feed it to cAdvisor. It looks as if cAdvisor was not needed in the first place. Regarding arm - as far as I understand the architecture has never been supported in cAdvisor and there are at least two open PRs (#2751, #2744) that can help at lest a bit. @bobbypage, @dashpole - could you weigh in? I think that maintainer decision is needed here. |
I don't think we should use cgroup values as part of MachineInfo. We have APIs for host information, and others for cgroups, and the data they contain is quite different. If the kubelet wants, it is free to use the values from its own cgroup instead of the values from the host. It already knows which cgroup it is in, so it wouldn't be too hard to implement. If you want to pursue that, I would discuss the idea at Sig-Node, and (if they like the idea) would probably write a KEP... The idea of overriding host capacity has been brought up before, but i'm afraid of it being abused. IIRC, the use-case last time was to overcommit pods on the node (e.g. if they request 1 core, give them 0.7 cores instead). |
I mean if users really want to abuse this they already can via the VFS route, isn't that their choice? |
As a kubelet user I don't get to choose if cAdvisor is used ...?
It's supported in Kubernetes ... but my point is not specific to arm, only that in those cases in order to fix their usage they must also upgrade to a new Kubernetes version currently because you must use what cAdvisor reports and no layer of the stack provides a way to correct this detected data, even though it would have been trivial for these users to supply the correct topology information. |
@BenTheElder how about applying the mechanism that you have just described in kubelet? |
That may be a potential avenue, Thought I'm concerned about mismatch from the reported stats, which are again exposed through kubernetes to users direct from cAdvisor IIRC. In either case I think this PR should probably have long been closed, it seems this approach is not accepted and there's ongoing discussion in slack instead of here (and a PR is likely not the right place to discuss approaches to begin with). |
I close it |
Motivation
When you run kubelet in container if you set some memory limit cadvisor still uses system memory. But the memory is limit by the cgroup and the process may be killed if it uses more memory than the limit.
This is describe in issue #2698
The strategy for resolve this is:
Tests
For testing the idea, I directly change the code of cAdvisor in K3S (only and only for test), and I make a container of this custom K3S.
This is available on docker hub and github.
Start k3s in docker
✅ the container is started
Check the memory limit of the container
✅ the container use 2 GiB
Check the memory available for kubelet
Open a shell in the container
❗ the system memory is still 32 GiB
✅ and the memory is still limit at 2 GiB
✅ 🎉 and for kubelet the memory available is 2 GiB, kubelet use the memory limit fix by the docker container 🎉
Usefull documentation
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
https://docs.docker.com/config/containers/resource_constraints/