Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prow cluster resource leak #1988

Closed
2 of 3 tasks
howardjohn opened this issue Oct 24, 2019 · 22 comments · Fixed by istio/tools#624
Closed
2 of 3 tasks

Prow cluster resource leak #1988

howardjohn opened this issue Oct 24, 2019 · 22 comments · Fixed by istio/tools#624
Assignees

Comments

@howardjohn
Copy link
Member

howardjohn commented Oct 24, 2019

Background: kubernetes-sigs/kind#759

When we started using Kind, we found all of the sudden our tests started flaking often. It turns out this was due to a resource leak by not cleaning up kind clusters at all. I am not 100% what was leaking, but something related to hostPath mount to /lib/modules or /sys/fs/cgroup. Once we added cleanup configuration, things went back to normal.

The best reference we had was the "resting CPU" rate of the nodes. In the original case, after a 2 week period was at between 30-90% with no tests running.

Now, months later, the same problem has returned 🙁

2019-24-10_09-39-14

This time it it much more subtle though, only 20% max usage after 2 weeks.

My suspicion is that we still have some leaks from kind.

Action items:

  • Investigate stronger cleanup, as recommended by @BenTheElder

The docker in docker runner / wrapper script we use in test-infra / prow.k8s.io also terminates all containers in an exit handler, amongst other things, redundantly to the cluster deletion we do in the kind specific scripts.

  • Problem seemed to start on October 1st. Was there any changes there?

  • Can we identify that we have a leak and clean up proactively as a periodic job?

@howardjohn howardjohn self-assigned this Oct 24, 2019
@BenTheElder
Copy link

You should be able to check the cgroups on a host node for if there's actually something leaking, there shouldn't be. I've done fairly extensive testing of that with kind itself locally, but it's possible something else is up in CI.

xref: kubernetes-sigs/kind#421 (note that the "memory cgroups leak" is actually just a kernel bug in flushing them after deletion, we'e also seen this just running containers without any nesting etc.)

@howardjohn
Copy link
Member Author

Likely root cause 01e1bf6, adds kind to a new repo. This repo does have the trap cleanup stuff but for some reason its not running.. When I run locally it cleans up, but in CI I don't see the cleanup occuring: https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_installer/457/base_installer/96

@BenTheElder
Copy link

is this trap in the image or in the repo? where are the image sources? 👀

@howardjohn
Copy link
Member Author

@BenTheElder
Copy link

BenTheElder commented Oct 24, 2019

yes, that should just be "$@", exec will replace the current process.

@BenTheElder
Copy link

https://github.com/istio/installer/blob/fe4079d536f2d1b94eb4a7d377527a102734189b/bin/with-kind.sh#L26 should also be kind export logs --name istio-testing "${ARTIFACTS}/kind" || true to ensure that we continue to cleanup if export fails for some reason.

it's separately concerning that we see leaks without this, but that should mitigate.

@howardjohn
Copy link
Member Author

On a node that looks good:

$ cat /proc/cgroups | tr '\t' ',' | column -t -s,
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        9          209          1
cpu           6          375          1
cpuacct       6          375          1
blkio         7          375          1
memory        12         1195         1
devices       4          368          1
freezer       11         209          1
net_cls       10         209          1
perf_event    3          209          1
net_prio      10         209          1
hugetlb       8          209          1
pids          5          375          1
rdma          2          1            1

On a node that looks bad:

$ cat /proc/cgroups | tr '\t' ',' | column -t -s,
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        8          1026         1
cpu           2          1529         1
cpuacct       2          1529         1
blkio         5          1529         1
memory        11         3398         1
devices       3          1522         1
freezer       10         1026         1
net_cls       6          1026         1
perf_event    4          1026         1
net_prio      6          1026         1
hugetlb       9          1026         1
pids          12         1529         1
rdma          7          1            1

@howardjohn
Copy link
Member Author

On my local machine, there is one folder in /sys/fs/cgroup/pids/docker per kind instance. On the bad node there is like 8, and on the good node there is also 2. So seems like we have stuff staying around there by not cleaning up docker properly.

(note: I really have no clue what I am doing here, not familiar with docker much so just poking around)

@BenTheElder
Copy link

On my local machine, there is one folder in /sys/fs/cgroup/pids/docker per kind instance.

yes -- there should be one per docker-in-docker container.

On the bad node there is like 8, and on the good node there is also 2. So seems like we have stuff staying around there by not cleaning up docker properly.

So the "good node" is not expected to have any running?

So seems like we have stuff staying around there by not cleaning up docker properly.

Admittedly I've focused on kind under normal circumstances, in prow.k8s.io we only support docker in docker when used with the common wrapper scripts that have always aggressively cleaned up in exit handlers.

It's possible we're hitting this too though, previously I looked for signs of this and we weren't (probably because we do actually clean up).

I'm going to see about simulating this on a clean GKE cluster and monitor the node...

We should also specifically check the running processes, from the host's side we should be able to see the process tree fine. Leaking cgroups shouldn't actually cost us anything significant, that part in and of itself is relatively fine. Leaking processes running in those groups does though.

istio-testing pushed a commit to istio/tools that referenced this issue Oct 25, 2019
* Clean up docker after job shutdown

For istio/test-infra#1988

* Use other docker clean method
@howardjohn
Copy link
Member Author

@clarketm rebooted all the nodes, things look back to a healthy state for now at least:
2019-25-10_09-10-37

What I am looking at here is not the spikes, since that just means tests are running constantly, but the flat lines when no tests are running, we should see minimal CPU usage and we do now

So the "good node" is not expected to have any running?

It could have a couple if there was an active test running. "Good" here meant it was using low CPU, so it could have leaked just once or something possibly.

@BenTheElder
Copy link

ACK.
Will look into options to further mitigate, I think for now the best thing is to ensure that for any dind we delete all containers, AFAIK that has worked fine on the Kubernetes Prow, and seems like it's hopefully working here.
Will note this in the kind issue as well.

@istio-policy-bot istio-policy-bot added the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Oct 30, 2019
@howardjohn
Copy link
Member Author

This was fixed and we have some more safeguards here. The one thing we don't have is proactive monitoring, but hopefully the safeguards make that not important

@istio-policy-bot istio-policy-bot removed the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Nov 24, 2019
@howardjohn
Copy link
Member Author

This is back 🙁
2019-04-12_16-06-22

Started on Dec 2nd. There we no commits around that time

@howardjohn
Copy link
Member Author

2019-04-12_17-37-38

New guess: it didn't start dec 2nd, there was just no activity until then because of thanksgiving. It starting to leak on monday is just a result of more jobs starting monday.

A potential suspect, from 11/26: istio/istio#19156. Based on the graph above it reasonable could have started on 11/26.. It would be nice to run ls /sys/fs/cgroup and look how much stuff is there and from when on every node to get a better timeline, but I am not sure a feasible way to do this

@howardjohn howardjohn reopened this Dec 5, 2019
@istio-policy-bot istio-policy-bot added the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Dec 5, 2019
howardjohn added a commit to howardjohn/istio that referenced this issue Dec 5, 2019
I suspect this is the root cause of
istio/test-infra#1988

For now, we will revert this and see if things improve
@BenTheElder
Copy link

BenTheElder commented Dec 5, 2019

Not sure how buildx works under the hood yet, but highly recommend having a single top level entrypoint wrapper that cleans up everything when doing docker in docker.

Skimming buildx now: It looks like buildx does create containers.

@howardjohn
Copy link
Member Author

We do have a wrapper script (inspired by the first time this happened), but maybe it is too primitive: https://github.com/istio/tools/pull/471/files.

My understanding is that buildx has two modes. One where you have a docker container that runs all the builds and one where you just build them normally(?) not in a container. When I run buildx locally and watch docker ps nothing ever shows up. I'll try to look more into whats actually going on here.

Why do we actually need to mount the cgroup hostPath? I just did it because k8s did it 🙂. It seems like if we could remove that we could prevent this problem completely, but not sure if that is feasible

istio-testing pushed a commit to istio/istio that referenced this issue Dec 5, 2019
I suspect this is the root cause of
istio/test-infra#1988

For now, we will revert this and see if things improve
@BenTheElder
Copy link

So the cgroups mount actually predates kind and comes from the previous prototype. IIRC there were stability issues without this but it's been a while.

Might be worth a shot, but in general privileged workloads need to clean up after themselves.

We've not had any issues as far as I can tell by just enforcing all docker in docker go through the single wrapper.

As for builds, AIUI any docker building with a RUN will need to have intermediate containers.

@howardjohn
Copy link
Member Author

My attempt to fix did not work:
2019-10-12_09-00-49

@howardjohn
Copy link
Member Author

Didn't mean to close this. we are now doing a full docker system prune -a and doesn't seem to fully fix it. Its pretty hard to debug this without a way to correlate the additional cgroups left around to a pod (or more specifically what job was running)

@istio-policy-bot istio-policy-bot removed the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Dec 17, 2019
@BenTheElder
Copy link

BenTheElder commented Dec 18, 2019 via email

howardjohn added a commit to howardjohn/tools that referenced this issue Dec 18, 2019
For istio/test-infra#1988

We should kill running docker containers at the end of the job. Pruning
doesn't stop running containers.
@howardjohn howardjohn reopened this Dec 18, 2019
@istio-policy-bot istio-policy-bot added the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Dec 18, 2019
istio-testing pushed a commit to istio/tools that referenced this issue Dec 18, 2019
For istio/test-infra#1988

We should kill running docker containers at the end of the job. Pruning
doesn't stop running containers.
@howardjohn
Copy link
Member Author

Good point.. I added another step to stop all running containers as well. Will see how that works. Interestingly, we also turned on autoscaling on our cluster which "fixes" this issue by killing nodes somewhat frequently.

I made a change to make the cgroup mount readonly, and tests seemed to work fine... if we make that read only it seems like we cannot possibly have this issue, since we now can no longer write to any hostPath so everything should be cleaned up properly? Does this seem like a horrible idea for any reason?

@howardjohn
Copy link
Member Author

All good since #2243

2019-26-12_11-26-40

@istio-policy-bot istio-policy-bot removed the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Dec 30, 2019
Shuanglu pushed a commit to Shuanglu/istio-tools that referenced this issue Jun 30, 2022
* Clean up docker after job shutdown

For istio/test-infra#1988

* Use other docker clean method
Shuanglu pushed a commit to Shuanglu/istio-tools that referenced this issue Jun 30, 2022
For istio/test-infra#1988

We should kill running docker containers at the end of the job. Pruning
doesn't stop running containers.
Shuanglu pushed a commit to Shuanglu/istio-tools that referenced this issue Jul 6, 2022
* Clean up docker after job shutdown

For istio/test-infra#1988

* Use other docker clean method
Shuanglu pushed a commit to Shuanglu/istio-tools that referenced this issue Jul 6, 2022
For istio/test-infra#1988

We should kill running docker containers at the end of the job. Pruning
doesn't stop running containers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants