Potential resource leak #759

howardjohn · 2019-08-07T15:40:46Z

What happened:
On July 10th, we started using kind in our prow cluster for Istio (see istio/test-infra#1455). We have about 6 tests or so that use kind, and maybe run 40 tests a day, so I'd estimate we are creating roughly 250 kind clusters a day across 60 nodes.

Ever since July 10th, we have run into problems on our cluster. Almost immediately, we ran out of inotify watches, which we fixed by just increasing the limit -- see #717.

Since then, the CPU of our nodes has slowly increased, like a memory leak but CPU. This can be attributed to the Kubelet process, which a profile indicates most of the time spent by cAdvisor.

Here is a graph of the minimum CPU of our nodes for a 4 hour interval. During a 4 hour time, it is almost certain we will have no test pods scheduled on the node, so this essentially shows the base overhead CPU of the nodes.

The big drops are times when we got rid of nodes. You can see the problems seem to start right around when we started using kind (note - it is possible it is a coincidence. My only evidence it is related to kind is the timing).

Within two weeks we see some nodes using 90% of CPU just idling.

What you expected to happen:
kind does not cause cluster-wide degradation.

How to reproduce it (as minimally and precisely as possible):
I am not sure, but we consistently run into this problem so I can run some diagnostics on the node.

Environment:

kind version: (use kind version): v0.3, v0.4
Kubernetes version: (use kubectl version): GKE versions 1.11, 1.12, 1.13 (tried upgrading twice to attempt to resolve this)
OS (e.g. from /etc/os-release): COS

The text was updated successfully, but these errors were encountered:

aojea · 2019-08-07T16:13:32Z

this issue remembers me this other one #421

howardjohn · 2019-08-07T16:17:33Z

I did run into that while investigating, but it seems the conclusion was

I can confirm that repeated cluster creation / deletion does NOT leak cgroups. The output of lscgroup | grep -c memory is the same after each creation / deletion cycle.

By the way, if it is relevant, we never do kind cluster delete, we just do kind create cluster then run some test and exit.

aojea · 2019-08-07T16:37:32Z

By the way, if it is relevant, we never do kind cluster delete, we just do kind create cluster then run some test and exit.

but ... then the old cluster containers keep running forever ... is it possible to add the kind cluster delete to your workflow once the tests finish and check if this solves the problem?

howardjohn · 2019-08-07T16:45:25Z

But we run it in a pod, once the pod is removed shouldn't everything be cleaned up? Or maybe because we have

        volumeMounts:
        - mountPath: /lib/modules
          name: modules
          readOnly: true
        - mountPath: /sys/fs/cgroup
          name: cgroup
      volumes:
      - hostPath:
          path: /lib/modules
          type: Directory
        name: modules
      - hostPath:
          path: /sys/fs/cgroup
          type: Directory
        name: cgroup

in our pod spec, so it never gets properly cleaned up.

I'll try adding the delete cluster to the end.

Does Kubernetes prow do this in their tests using kind? I am worried if the test crashes part way through we won't properly clean up

aojea · 2019-08-07T16:48:52Z

Does Kubernetes prow do this in their tests using kind? I am worried if the test crashes part way through we won't properly clean up

@BenTheElder is the authority on this, but the tests that are running in the CI execute the hack/ci/e2e.sh, and it does have a cleanup function that delete the cluster on EXIT

kind/hack/ci/e2e.sh

Lines 26 to 41 in 991e45e

    
           # our exit handler (trap) 
        
           cleanup() { 
        
               # always attempt to dump logs 
        
               kind "export" logs "${ARTIFACTS}/logs" || true 
        
               # KIND_IS_UP is true once we: kind create 
        
               if [[ "${KIND_IS_UP:-}" = true ]]; then 
        
                   kind delete cluster || true 
        
               fi 
        
               # clean up e2e.test symlink 
        
               rm -f _output/bin/e2e.test 
        
               # remove our tempdir 
        
               # NOTE: this needs to be last, or it will prevent kind delete 
        
               if [[ -n "${TMP_DIR:-}" ]]; then 
        
                   rm -rf "${TMP_DIR}" 
        
               fi 
        
           }

howardjohn · 2019-08-07T16:50:46Z

Thank you! we will try that out

Because we mount host paths in the pods, resources will not actually be fully freed once a test is complete. This causes a resource leak that eventually leads to a complete degradation of the entire node. See kubernetes-sigs/kind#759 for details.

howardjohn · 2019-08-08T00:40:17Z

Well, it seems resolved - hard to be 100% sure since its only been 8 hrs, but seems good. I feel pretty dumb trying to figure this out for a month or so when it was such a simple fix -- thanks for the help!!

BenTheElder · 2019-08-08T00:45:43Z

The docker in docker runner / wrapper script we use in test-infra / prow.k8s.io also terminates all containers in an exit handler, amongst other things, redundantly to the cluster deletion we do in the kind specific scripts.

Sriov lane uses KIND infrastructure. In order to prevent memory leaks is recommanded to use 'kind' binary to teardown the cluster [1], which is what 'make cluster-down' does. [1] kubernetes-sigs/kind#759 Signed-off-by: Or Mergi <ormergi@redhat.com>

Sriov lane uses KIND infrastructure. In order to prevent resources leak it is recommanded to use 'kind' binary to teardown the cluster [1], which is what 'make cluster-down' does. [1] kubernetes-sigs/kind#759 Signed-off-by: Or Mergi <ormergi@redhat.com>

* sriov lanes, ensure cluster teardown Sriov lane uses KIND infrastructure. In order to prevent resources leak it is recommanded to use 'kind' binary to teardown the cluster [1], which is what 'make cluster-down' does. [1] kubernetes-sigs/kind#759 Signed-off-by: Or Mergi <ormergi@redhat.com> * enable rehearsal Signed-off-by: Or Mergi <ormergi@redhat.com>

Sriov lane uses KIND infrastructure. In order to prevent resources leak it is recommanded to use 'kind' binary to teardown the cluster [1], which is what 'make cluster-down' does. [1] kubernetes-sigs/kind#759 Signed-off-by: Or Mergi <ormergi@redhat.com>

howardjohn added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2019

howardjohn mentioned this issue Aug 7, 2019

Delete kind cluster after test completion istio/istio#16105

Merged

howardjohn closed this as completed Aug 8, 2019

This was referenced Aug 8, 2019

document how to run kind in a kubernetes pod #303

Open

Error: failed to create cluster: failed to apply overlay network: exit status 126 #736

Closed

KinD image loading extremely slow on prow istio/test-infra#1458

Closed

howardjohn mentioned this issue Oct 16, 2019

cross-repo e2e gating for the operator istio/operator#379

Merged

howardjohn mentioned this issue Oct 24, 2019

Prow cluster resource leak istio/test-infra#1988

Closed

3 tasks

alexeldeib mentioned this issue Nov 21, 2019

🏃 🛠 fix prow presubmits kubernetes-sigs/kubebuilder#1201

Merged

cofyc mentioned this issue Feb 5, 2020

CI fails to start container frequently pingcap/tidb-operator#1603

Closed

ormergi mentioned this issue Nov 18, 2020

SRIOV lane, Prevent Resources Leak kubevirt/project-infra#719

Merged

mitar mentioned this issue Jan 27, 2021

Running on GitLab CI using Kubernetes runner kind-ci/examples#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential resource leak #759

Potential resource leak #759

howardjohn commented Aug 7, 2019

aojea commented Aug 7, 2019 •

edited

Loading

howardjohn commented Aug 7, 2019

aojea commented Aug 7, 2019

howardjohn commented Aug 7, 2019

aojea commented Aug 7, 2019 •

edited

Loading

howardjohn commented Aug 7, 2019

howardjohn commented Aug 8, 2019

BenTheElder commented Aug 8, 2019

Potential resource leak #759

Potential resource leak #759

Comments

howardjohn commented Aug 7, 2019

aojea commented Aug 7, 2019 • edited Loading

howardjohn commented Aug 7, 2019

aojea commented Aug 7, 2019

howardjohn commented Aug 7, 2019

aojea commented Aug 7, 2019 • edited Loading

howardjohn commented Aug 7, 2019

howardjohn commented Aug 8, 2019

BenTheElder commented Aug 8, 2019

aojea commented Aug 7, 2019 •

edited

Loading

aojea commented Aug 7, 2019 •

edited

Loading