Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] koordlet CgroupReconcile panics on mergePodResourceQoSForMemoryQoS #1670

Closed
BlackPigHe opened this issue Sep 19, 2023 · 9 comments
Closed
Labels
kind/bug Create a report to help us improve

Comments

@BlackPigHe
Copy link

What happened:
kubelet的容器一起在崩溃重启

I0919 17:59:08.568730 2319034 cpu_suppress.go:186] nodeSuppressBE[CPU(Core)]:6 = node.Total:8 * SLOPercent:65% - systemUsage:1 - podLSUsed:1
I0919 17:59:08.568746 2319034 predict_server.go:309] wait for the state to be synchronized, skipping the step of model GC
E0919 17:59:08.568778 2319034 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 332 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x20b9260?, 0x3ab9df0})
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0?})
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/runtime/runtime.go:49 +0x75
panic({0x20b9260, 0x3ab9df0})
/usr/local/go/src/runtime/panic.go:838 +0x207
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).mergePodResourceQoSForMemoryQoS(0x0?, 0xc0015b6800, 0x0)
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:371 +0x39
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).getMergedPodResourceQoS(0x21c9d00?, 0xc0015b6800, 0xc00159ed60?)
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:361 +0x90
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).calculateResources(0x1dbd291?, 0xc0014780c0, 0xc000244160?, {0xc000560000, 0x12, 0xc000244160?})
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:161 +0x4dd
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).calculateAndUpdateResources(0xc00090a540, 0xc000244160)
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:131 +0xb5
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).reconcile(0xc00090a540)
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:109 +0x52
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x10000000001?)
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x28e48a0, 0xc001478090}, 0x1, 0xc000111740)
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x1?, 0x44e665?)
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0xc000755da0?, 0x0?)
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/wait/wait.go:92 +0x25
created by github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).Run
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:92 +0xea
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1e0be19]

goroutine 332 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0?})
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/runtime/runtime.go:56 +0xd8
panic({0x20b9260, 0x3ab9df0})
/usr/local/go/src/runtime/panic.go:838 +0x207
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).mergePodResourceQoSForMemoryQoS(0x0?, 0xc0015b6800, 0x0)
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:371 +0x39
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).getMergedPodResourceQoS(0x21c9d00?, 0xc0015b6800, 0xc00159ed60?)
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:361 +0x90
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).calculateResources(0x1dbd291?, 0xc0014780c0, 0xc000244160?, {0xc000560000, 0x12, 0xc000244160?})
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:161 +0x4dd
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).calculateAndUpdateResources(0xc00090a540, 0xc000244160)
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:131 +0xb5
github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).reconcile(0xc00090a540)
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:109 +0x52
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x10000000001?)
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x28e48a0, 0xc001478090}, 0x1, 0xc000111740)
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x1?, 0x44e665?)
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0xc000755da0?, 0x0?)
/go/pkg/mod/k8s.io/apimachinery@v0.24.15/pkg/util/wait/wait.go:92 +0x25
created by github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile.(*cgroupResourcesReconcile).Run
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/qosmanager/plugins/cgreconcile/cgroup_reconcile.go:92 +0xea

What you expected to happen:
容器不能一直重启

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • App version:
    koordinator 1.3
  • Kubernetes version (use kubectl version):
    1.18
  • Install details (e.g. helm install args):
    默认参数
  • Node environment (for koordlet/runtime-proxy issue):
    • Containerd/Docker version:
      docker版本
      Client: Docker Engine - Community
      Version: 19.03.12
      API version: 1.40
      Go version: go1.13.10
      Git commit: 48a66213fe
      Built: Mon Jun 22 15:42:53 2020
      OS/Arch: linux/amd64
      Experimental: false
    • OS version:
      Linux k8s-master0 3.10.0-1160.90.1.el7.x86_64 ✨ Add NodeMetric API #1 SMP Thu May 4 15:21:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
    • Kernal version:
    • Cgroup driver: cgroupfs/systemd
  • Others:
@BlackPigHe BlackPigHe added the kind/bug Create a report to help us improve label Sep 19, 2023
@saintube saintube changed the title [BUG] [BUG] koordlet CgroupReconcile panics on mergePodResourceQoSForMemoryQoS Sep 19, 2023
@saintube
Copy link
Member

@BlackPigHe Hi, did you have the LSE or SYSTEM QoS pods on the node? It is probably a bug fixed in #1556 and #1663.

@BlackPigHe
Copy link
Author

image
被你猜对啦

@BlackPigHe
Copy link
Author

那我需要怎么做,基于最新的分支build镜像吗,我是用helm装的,已经是用的最新的啦

@BlackPigHe
Copy link
Author

BlackPigHe commented Sep 19, 2023

@BlackPigHe Hi, did you have the LSE or SYSTEM QoS pods on the node? It is probably a bug fixed in #1556 and #1663.

那我需要怎么做,基于最新的分支build镜像吗,我是用helm装的,已经是用的最新的啦
@saintube

@saintube
Copy link
Member

@BlackPigHe Hi, did you have the LSE or SYSTEM QoS pods on the node? It is probably a bug fixed in #1556 and #1663.

那我需要怎么做,基于最新的分支build镜像吗,我是用helm装的,已经是用的最新的啦 @saintube

@BlackPigHe Bugfixes 还没有 release,所以使用修复版本可能需要基于最新分支 build 镜像。如果当前没有用到 MemoryQoS 特性的话,也可以通过配置 koordlet feature-gate 中 CgroupReconcile=false 来临时绕过问题。

@BlackPigHe
Copy link
Author

@BlackPigHe Hi, did you have the LSE or SYSTEM QoS pods on the node? It is probably a bug fixed in #1556 and #1663.

那我需要怎么做,基于最新的分支build镜像吗,我是用helm装的,已经是用的最新的啦 @saintube

@BlackPigHe Bugfixes 还没有 release,所以使用修复版本可能需要基于最新分支 build 镜像。如果当前没有用到 MemoryQoS 特性的话,也可以通过配置 koordlet feature-gate 中 CgroupReconcile=false 来临时绕过问题。

老哥稳,回复得很及时,非常感谢,我尝试一下

@saintube
Copy link
Member

@BlackPigHe Hi, did you have the LSE or SYSTEM QoS pods on the node? It is probably a bug fixed in #1556 and #1663.

那我需要怎么做,基于最新的分支build镜像吗,我是用helm装的,已经是用的最新的啦 @saintube

@BlackPigHe Bugfixes 还没有 release,所以使用修复版本可能需要基于最新分支 build 镜像。如果当前没有用到 MemoryQoS 特性的话,也可以通过配置 koordlet feature-gate 中 CgroupReconcile=false 来临时绕过问题。

老哥稳,回复得很及时,非常感谢,我尝试一下

@BlackPigHe Hi,请问修复方案验证的如何

@BlackPigHe
Copy link
Author

@BlackPigHe Hi, did you have the LSE or SYSTEM QoS pods on the node? It is probably a bug fixed in #1556 and #1663.

那我需要怎么做,基于最新的分支build镜像吗,我是用helm装的,已经是用的最新的啦 @saintube

@BlackPigHe Bugfixes 还没有 release,所以使用修复版本可能需要基于最新分支 build 镜像。如果当前没有用到 MemoryQoS 特性的话,也可以通过配置 koordlet feature-gate 中 CgroupReconcile=false 来临时绕过问题。

老哥稳,回复得很及时,非常感谢,我尝试一下

@BlackPigHe Hi,请问修复方案验证的如何
@saintube 可以的,kubelet不报错啦

@saintube
Copy link
Member

saintube commented Sep 25, 2023

/close
Bugfixes in #1556 and #1663 will be released in v1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Create a report to help us improve
Projects
None yet
Development

No branches or pull requests

2 participants