-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential resource leak #759
Comments
this issue remembers me this other one #421 |
I did run into that while investigating, but it seems the conclusion was
By the way, if it is relevant, we never do |
but ... then the old cluster containers keep running forever ... is it possible to add the |
But we run it in a pod, once the pod is removed shouldn't everything be cleaned up? Or maybe because we have
in our pod spec, so it never gets properly cleaned up. I'll try adding the delete cluster to the end. Does Kubernetes prow do this in their tests using kind? I am worried if the test crashes part way through we won't properly clean up |
@BenTheElder is the authority on this, but the tests that are running in the CI execute the Lines 26 to 41 in 991e45e
|
Thank you! we will try that out |
Because we mount host paths in the pods, resources will not actually be fully freed once a test is complete. This causes a resource leak that eventually leads to a complete degradation of the entire node. See kubernetes-sigs/kind#759 for details.
Because we mount host paths in the pods, resources will not actually be fully freed once a test is complete. This causes a resource leak that eventually leads to a complete degradation of the entire node. See kubernetes-sigs/kind#759 for details.
Well, it seems resolved - hard to be 100% sure since its only been 8 hrs, but seems good. I feel pretty dumb trying to figure this out for a month or so when it was such a simple fix -- thanks for the help!! |
The docker in docker runner / wrapper script we use in test-infra / prow.k8s.io also terminates all containers in an exit handler, amongst other things, redundantly to the cluster deletion we do in the kind specific scripts. |
Sriov lane uses KIND infrastructure. In order to prevent memory leaks is recommanded to use 'kind' binary to teardown the cluster [1], which is what 'make cluster-down' does. [1] kubernetes-sigs/kind#759 Signed-off-by: Or Mergi <ormergi@redhat.com>
Sriov lane uses KIND infrastructure. In order to prevent resources leak it is recommanded to use 'kind' binary to teardown the cluster [1], which is what 'make cluster-down' does. [1] kubernetes-sigs/kind#759 Signed-off-by: Or Mergi <ormergi@redhat.com>
* sriov lanes, ensure cluster teardown Sriov lane uses KIND infrastructure. In order to prevent resources leak it is recommanded to use 'kind' binary to teardown the cluster [1], which is what 'make cluster-down' does. [1] kubernetes-sigs/kind#759 Signed-off-by: Or Mergi <ormergi@redhat.com> * enable rehearsal Signed-off-by: Or Mergi <ormergi@redhat.com>
Sriov lane uses KIND infrastructure. In order to prevent resources leak it is recommanded to use 'kind' binary to teardown the cluster [1], which is what 'make cluster-down' does. [1] kubernetes-sigs/kind#759 Signed-off-by: Or Mergi <ormergi@redhat.com>
Sriov lane uses KIND infrastructure. In order to prevent resources leak it is recommanded to use 'kind' binary to teardown the cluster [1], which is what 'make cluster-down' does. [1] kubernetes-sigs/kind#759 Signed-off-by: Or Mergi <ormergi@redhat.com>
What happened:
On July 10th, we started using kind in our prow cluster for Istio (see istio/test-infra#1455). We have about 6 tests or so that use kind, and maybe run 40 tests a day, so I'd estimate we are creating roughly 250 kind clusters a day across 60 nodes.
Ever since July 10th, we have run into problems on our cluster. Almost immediately, we ran out of inotify watches, which we fixed by just increasing the limit -- see #717.
Since then, the CPU of our nodes has slowly increased, like a memory leak but CPU. This can be attributed to the Kubelet process, which a profile indicates most of the time spent by cAdvisor.
Here is a graph of the minimum CPU of our nodes for a 4 hour interval. During a 4 hour time, it is almost certain we will have no test pods scheduled on the node, so this essentially shows the base overhead CPU of the nodes.
The big drops are times when we got rid of nodes. You can see the problems seem to start right around when we started using kind (note - it is possible it is a coincidence. My only evidence it is related to kind is the timing).
Within two weeks we see some nodes using 90% of CPU just idling.
What you expected to happen:
kind does not cause cluster-wide degradation.
How to reproduce it (as minimally and precisely as possible):
I am not sure, but we consistently run into this problem so I can run some diagnostics on the node.
Environment:
kind version
): v0.3, v0.4kubectl version
): GKE versions 1.11, 1.12, 1.13 (tried upgrading twice to attempt to resolve this)/etc/os-release
): COSThe text was updated successfully, but these errors were encountered: