Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possibly leaking memory cgroups #421

Closed
swachter opened this issue Mar 29, 2019 · 19 comments
Closed

possibly leaking memory cgroups #421

swachter opened this issue Mar 29, 2019 · 19 comments
Assignees
Labels
lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@swachter
Copy link

swachter commented Mar 29, 2019

I repeatedly created and deleted kind clusters and watched the number of cgroups under /proc/cgroups (see attached file). Here is a short summary. The columns show the initial number of cgroups, and after repeatedly creating or deleting a kind cluster:

subsys_name initial created deleted created deleted created deleted
cpuset 1 32 2 32 2 26 2
cpu 35 83 36 83 36 77 36
cpuacct 35 83 36 83 36 77 36
blkio 35 83 36 83 36 77 36
memory 62 126 109 168 124 175 138
devices 35 83 36 83 36 77 36
freezer 1 32 2 32 2 26 2
net_cls 1 32 2 32 2 26 2
perf_event 1 32 2 32 2 26 2
net_prio 3 32 2 32 2 26 2
hugetlb 1 32 2 32 2 26 2
pids 40 88 41 88 41 82 41
rdma 1 1 1 1 1 1 1

After each creation / deletion cycle the number of memory cgroups increases whereas the other cgroups stay the same. I am not sure if this is a kind specific problem or if its related to the underlying docker (client and server are version 18.06.1-ce).

cgroups.txt

@BenTheElder
Copy link
Member

Actually leaking them is not ideal but is not overly suprising, this is definitely the most likely resource to leak. The limited upside is that they shouldn't persist a reboot.

Can we get more system info? Kernel version?

@BenTheElder
Copy link
Member

There's been kernel issues around this in the past https://bugzilla.kernel.org/show_bug.cgi?id=12464

@BenTheElder BenTheElder self-assigned this Mar 29, 2019
@BenTheElder BenTheElder added triage/needs-information Indicates an issue needs more information in order to work on it. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 29, 2019
@swachter
Copy link
Author

swachter commented Apr 1, 2019

The tests where done in a VirtualBox setup by Vagrant (config.vm.box = "ubuntu/bionic64"). uname -a outputs:

Linux ubuntu-bionic 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

On our CI-Environment we also seem to suffered from leaking cgroups. On that environment uname -a yields:

Linux gke-usu-manage-saas-cont-runner-pool2-647369da-0qlw 4.14.91+ #1 SMP Wed Jan 23 21:34:58 PST 2019 x86_64 Intel(R) Xeon(R) CPU @ 2.30GHz GenuineIntel GNU/Linux

@aojea
Copy link
Contributor

aojea commented Apr 1, 2019

This is suspiciously similar moby/moby#29638

@swachter
Copy link
Author

swachter commented Apr 1, 2019

Another similar issues is google/cadvisor#1581. I used a script mentioned in that ticket (inotify_watchers.sh) to output the installed watchers. It seems that after a kind delete cluster all watchers are removed. Therefore I think that the increasing number of cgroups is the root cause because cAdvisor tries to install a proportional number of watchers.

The attached file watchers.txt shows the installed watchers after repeated cluster creations and deletions.

@neolit123
Copy link
Member

this also happens using kubeadm directly:
kubeadm init ... && kubeadm reset

docker: 18.06.3 (cg driver = systemd)
kernel: 4.13.0-41-generic

but the amount of leakage is fairly low.
apart from testing a different version of docker and the linux kernel, i don't think there is much we can do here.

@swachter
Copy link
Author

swachter commented Apr 1, 2019

but the amount of leakage is fairly low.

The amount of leakage increases if some components are installed in the cluster on each create/delete cluster cycle. The original measurements where taken on newly created, empty clusters. With some few components being installed on each cycle the number of leaked memory cgroups increased to ~ 30 for each cycle.

@aojea
Copy link
Contributor

aojea commented Apr 1, 2019

The amount of leakage increases if some components are installed in the cluster on each create/delete cluster cycle

If the new components are new containers and assuming we are hitting one of the bugs linked with docker leaking cgroups, it can make sense since the more containers the more leakages

@BenTheElder
Copy link
Member

has anyone checked moby/moby#29638 (comment) yet?

I will look into this soon but haven't gotten to it yet.

@neolit123
Copy link
Member

The amount of leakage increases if some components are installed in the cluster on each create/delete cluster cycle. The original measurements where taken on newly created, empty clusters. With some few components being installed on each cycle the number of leaked memory cgroups increased to ~ 30 for each cycle.

with the cgroup memory control cap in the kernel of 65535 (USHRT_MAX) it will take a while.
but i can see this being a problem in a persistently running setup without reboot.

@BenTheElder BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Apr 1, 2019
@BenTheElder
Copy link
Member

BenTheElder commented Apr 1, 2019

On my machine lscgroup | grep -c memory while creating / deleting clusters so far would suggest that we're not leaking any, perhaps moby/moby#29638 (comment) is correct?

@mlaventure writes:

@BenHall Every directory under the mount point (including the mount point) is considered to be a cgroup. So to get the actual number of cgroups from the FS you would have to run: find /sys/fs/cgroup/memory -type d | wc -l and that should match the number found in /proc/cgroups

It turns out that this is not always the case. I corresponded with a Linux cgroups maintainer (Michal Hocko) recently, who said:

Please note that memcgs are completely removed after the last memory accounted to
them disappears. And that happens lazily on the memory pressure. So it is quite possible that this happens much later than the actual rmdir on the memcg.

So, it's not uncommon for the num_cgroups value in /proc/cgroups to differ from what you might see in lscgroup.

Will investigate more later, off to a meeting 😅

Edit: I am on a newer 4.19.20<snip> kernel though FWIW (with Docker 18.09.3)

@BenTheElder
Copy link
Member

looking back at #412, is it possible we're hitting inotify limits instead?

I've yet to find anything in your logs that wasn't watch related

Mar 27 18:22:20 kind-control-plane kubelet[952]: E0327 18:22:20.645229     952 raw.go:146] Failed to watch directory "/sys/fs/cgroup/blkio/docker/ebd0b4c8f8840ef15d77d256089b3c79bdfe85ab8152559f5abd5ee5b67c4463/system.slice": inotify_add_watch /sys/fs/cgroup/blkio/docker/ebd0b4c8f8840ef15d77d256089b3c79bdfe85ab8152559f5abd5ee5b67c4463/system.slice: no space left on device

@BenTheElder
Copy link
Member

@BenTheElder BenTheElder changed the title leaking memory cgroups possibly leaking memory cgroups Apr 1, 2019
@BenTheElder
Copy link
Member

BenTheElder commented Apr 1, 2019

The inotify scripts are not working quite right for me.

It's also entirely possible that we don't leak watches or groups at all, and that something else is just using up the limit when you have many clusters etc. The default inotify limits are rather low on many setups (looks like ubuntu is 8192).

EDIT: it's possible to check with cat /proc/sys/fs/inotify/max_user_watches

Every failure in that log appears to be related to setting up an inotify watch.

Based on some testing with lscgroup and on moby/moby#29638 (comment) I don't think we're seeing real cgroup "leaks".

@BenTheElder
Copy link
Member

BenTheElder commented Apr 1, 2019

I'm also seeing that memory groups in /proc/cgroups do seem to stay higher for some time (unlike lscgroup), but drop later, which I can force early with sync; sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches'

@swachter
Copy link
Author

swachter commented Apr 2, 2019

EDIT: it's possible to check with cat /proc/sys/fs/inotify/max_user_watches

On the GKE node where the problem occurred max_user_watches is 8192. (On that node the Gitlab-CI runner places jobs/pods that in turn use kind in integration tests.)

I remember that on this node cat /proc/cgroups showed some numbers above 1000. Unfortunately, I did not save the output. Restarting that node helped. Now the output is:

#subsys_name    hierarchy       num_cgroups     enabled
cpuset  3       17      1
cpu     5       125     1
cpuacct 5       125     1
blkio   4       125     1
memory  8       229     1
devices 7       117     1
freezer 10      17      1
net_cls 2       17      1
perf_event      6       17      1
net_prio        2       17      1
hugetlb 11      17      1
pids    12      125     1
rdma    9       1       1

It seems to me that these numbers increase by time. I will check them regularly.

Could it be that the cAdvisor that comes with kind tries to install watches on too many cgroups, in particular on cgroups that do not "belong" to the kind cluster?

@swachter
Copy link
Author

swachter commented Apr 2, 2019

Based on some testing with lscgroup and on moby/moby#29638 (comment) I don't think we're seeing real cgroup "leaks".

I can confirm that repeated cluster creation / deletion does NOT leak cgroups. The output of lscgroup | grep -c memory is the same after each creation / deletion cycle.

Thank you for looking into this. I think the issue has nothing to do with kind and this issue can be closed.

However, I wonder how the issue can be tracked down further. Unfortunately the lscgroup utility is not installed on GKE nodes and /proc/cgroups is unreliable. Is there a possibility to monitor cgroups on GKE nodes?

@BenTheElder
Copy link
Member

presumably GKE nodes -> COS? (they could be ubuntu, which is a different story)

on COS I believe the expectation is that everything will run as containers, you can probably docker run an image with the right tools in it and fiddle with the mounts.

@BenTheElder
Copy link
Member

going to close this for now since it seems to not be cgroups leaking.

we might need to figure out a good pattern to increase inotify watches (possibly a daemonset?), but that's a seperate issue.

stg-0 added a commit to stg-0/kind that referenced this issue Feb 22, 2024
…s-sigs#421)

* Disable azure cloud routes & fix azure csi drivers upgrade

* Clean code

* Update upgrad script to cluster-operator pre release

* Fix azurefile csi driver upgrade

* Remove not working code

* Clean code

* Scale cloud-controller-manager to 2 replicas

* Fix capz

* Remove untested code

* Fix cloud-controller-manager procedure

---------

Co-authored-by: stg <65890694+stg-0@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

4 participants