Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler panics with errors like "resource is not sufficient to do operation:" #3301

Open
bysph opened this issue Jan 16, 2024 · 10 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@bysph
Copy link

bysph commented Jan 16, 2024

What happened:
The scheduler paniced and restarted several times for about half an hour, then recovered, all without any manual intervention, except for some normal tasks running automatically and completing. The logs are as follows:

igdata2-k8s-bgp25g-music75.gy.ntes fit failed: plugin NodeAffinity predicates failed node(s) didn't match Pod's node affinity/selector
I0115 10:12:11.618831       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music94.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music94.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618829       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music84.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music84.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618838       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music90.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music90.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618846       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata2-k8s-bgp25g-music30.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata2-k8s-bgp25g-music30.gy.ntes fit failed: plugin NodeAffinity predicates failed node(s) didn't match Pod's node affinity/selector
I0115 10:12:11.618853       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata2-k8s-bgp25g-music57.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata2-k8s-bgp25g-music57.gy.ntes fit failed: Insufficient cpu
I0115 10:12:11.618875       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music58.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music58.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618903       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music29.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music29.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618902       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata2-k8s-bgp25g-music66.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata2-k8s-bgp25g-music66.gy.ntes fit failed: plugin NodeAffinity predicates failed node(s) didn't match Pod's node affinity/selector
E0115 10:12:11.618926       1 runtime.go:77] Observed a panic: resource is not sufficient to do operation: <cpu -2600.00, memory 96698806272.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00> sub <cpu 0.00, memory 0.00>
goroutine 21922 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1dbd8c0?, 0xc0885a0570})
        /go/pkg/mod/k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x200000003?})
        /go/pkg/mod/k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1dbd8c0, 0xc0885a0570})
        /usr/local/go/src/runtime/panic.go:884 +0x213
volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0xaf?, {0xc0c15be000?, 0xc0a308dca8?})
        /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:33 +0x174
volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0x0, {0x222bdaf?, 0x40dc08?}, {0xc0a308dca8?, 0x7ff56a9ed4c0?, 0xd0?})
        /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:43 +0x56
volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc0c15a7980, 0xc0c00af520)
        /go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:246 +0x9c
volcano.sh/volcano/pkg/scheduler/api.(*NodeInfo).FutureIdle(0xc0c008c480)
        /go/src/volcano.sh/volcano/pkg/scheduler/api/node_info.go:96 +0x246
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc01e879600, 0xc0c008c480)
        /go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:101 +0x36
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0x6?)
        /go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:74 +0x3ab
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
        /go/pkg/mod/k8s.io/client-go@v0.27.2/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
        /go/pkg/mod/k8s.io/client-go@v0.27.2/util/workqueue/parallelizer.go:76 +0x1ce
E0115 10:12:11.618952       1 runtime.go:77] Observed a panic: resource is not sufficient to do operation: <cpu -8510.00, memory 81269567488.00, ephemeral-storage 1924896325632000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> sub <cpu 0.00, memory 0.00>
goroutine 21928 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1dbd8c0?, 0xc0880da720})
        /go/pkg/mod/k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x450fd6?})
        /go/pkg/mod/k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1dbd8c0, 0xc0880da720})
        /usr/local/go/src/runtime/panic.go:884 +0x213
volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0xaf?, {0xc0c11b4000?, 0xc0c1b35ca8?})
        /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:33 +0x174
volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0x0, {0x222bdaf?, 0x40dc08?}, {0xc0c1b35ca8?, 0x7ff56a9e7250?, 0xd0?})
        /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:43 +0x56
volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc0c118df40, 0xc0c00bc8a0)
        /go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:246 +0x9c
volcano.sh/volcano/pkg/scheduler/api.(*NodeInfo).FutureIdle(0xc0c008c540)
        /go/src/volcano.sh/volcano/pkg/scheduler/api/node_info.go:96 +0x246
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc01e879600, 0xc0c008c540)
        /go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:101 +0x36
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0x9?)
        /go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:74 +0x3ab
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
        /go/pkg/mod/k8s.io/client-go@v0.27.2/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
        /go/pkg/mod/k8s.io/client-go@v0.27.2/util/workqueue/parallelizer.go:76 +0x1ce

earlier logs that may help:

I0115 10:12:07.316612       1 allocate.go:245] Binding Task <spark/wizard-308744-exec-7> to node <bigdata2-k8s-bgp25g-music58.gy.ntes>
I0115 10:12:07.316629       1 statement.go:266] After allocated Task <spark/wizard-308744-exec-7> to Node <bigdata2-k8s-bgp25g-music58.gy.ntes>: idle <cpu 36400.00, memory 227695308800.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>, used <cpu 90600.00, memory 304705064960.00, ephemeral-storage 5368709120000.00>, releasing <cpu 0.00, memory 0.00>
I0115 10:12:07.331873       1 allocate.go:245] Binding Task <spark/wizard-308765-exec-5> to node <bigdata2-k8s-bgp25g-music58.gy.ntes>
I0115 10:12:07.331889       1 statement.go:266] After allocated Task <spark/wizard-308765-exec-5> to Node <bigdata2-k8s-bgp25g-music58.gy.ntes>: idle <cpu 33400.00, memory 218031632384.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>, used <cpu 93600.00, memory 314368741376.00, ephemeral-storage 5368709120000.00>, releasing <cpu 0.00, memory 0.00>
I0115 10:12:07.335107       1 allocate.go:245] Binding Task <spark/wizard-308773-exec-2> to node <bigdata2-k8s-bgp25g-music58.gy.ntes>
I0115 10:12:07.335127       1 statement.go:266] After allocated Task <spark/wizard-308773-exec-2> to Node <bigdata2-k8s-bgp25g-music58.gy.ntes>: idle <cpu 30400.00, memory 208367955968.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>, used <cpu 96600.00, memory 324032417792.00, ephemeral-storage 5368709120000.00>, releasing <cpu 0.00, memory 0.00>
I0115 10:12:07.335293       1 cache.go:307] Revert assumed volumes for task spark/wizard-308773-exec-2 on node bigdata2-k8s-bgp25g-music58.gy.ntes
E0115 10:12:07.335299       1 statement.go:393] Failed to allocate task: for node bigdata2-k8s-bgp25g-music58.gy.ntes resources [cpu] are not enough to put task <spark/wizard-308773-exec-2>, idle: cpu 2400.00, memory 113610240000.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, req: cpu 3000.00, memory 9663676416.00
E0115 10:12:07.352554       1 node_info.go:393] "Idle resources turn into negative after allocated" nodeName="bigdata2-k8s-bgp25g-music58.gy.ntes" task="spark/wizard-308703-exec-287" resources=[cpu] idle="cpu -1600.00, memory 100322684928.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00" req="cpu 3000.00, memory 9663676416.00"
E0115 10:12:07.357392       1 node_info.go:393] "Idle resources turn into negative after allocated" nodeName="bigdata2-k8s-bgp25g-music58.gy.ntes" task="spark/wizard-308703-exec-299" resources=[cpu] idle="cpu -4600.00, memory 90659008512.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00" req="cpu 3000.00, memory 9663676416.00"
I0115 10:12:07.361560       1 allocate.go:245] Binding Task <spark/wizard-308824-exec-10> to node <bigdata2-k8s-bgp25g-music58.gy.ntes>
I0115 10:12:07.361578       1 statement.go:266] After allocated Task <spark/wizard-308824-exec-10> to Node <bigdata2-k8s-bgp25g-music58.gy.ntes>: idle <cpu 27400.00, memory 198704279552.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>, used <cpu 99600.00, memory 333696094208.00, ephemeral-storage 5368709120000.00>, releasing <cpu 0.00, memory 0.00>
I0115 10:12:07.361698       1 cache.go:307] Revert assumed volumes for task spark/wizard-308824-exec-10 on node bigdata2-k8s-bgp25g-music58.gy.ntes
E0115 10:12:07.361703       1 statement.go:393] Failed to allocate task: for node bigdata2-k8s-bgp25g-music58.gy.ntes resources [cpu] are not enough to put task <spark/wizard-308824-exec-10>, idle: cpu -4600.00, memory 90659008512.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, req: cpu 3000.00, memory 9663676416.00
E0115 10:12:07.363133       1 node_info.go:393] "Idle resources turn into negative after allocated" nodeName="bigdata2-k8s-bgp25g-music58.gy.ntes" task="spark/kyuubi-pandora-by-m-bi-soc-eventpage-view-stat-1d-by-foll-c31991fd-82a8-39e2-b0c1-91e516c0f535-driver" resources=[cpu] idle="cpu -5600.00, memory 87035129856.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00" req="cpu 1000.00, memory 3623878656.00"

What you expected to happen:
The scheduler should remain stable without panic

How to reproduce it (as minimally and precisely as possible):
I have not find the reproduction conditions yet

Anything else we need to know?:
I found this PR makes a modification to allow "used" to become negative: https://github.com/volcano-sh/volcano/pull/3006/files. But I'm not sure if it's related to it, maybe we can discuss here to make this issue clearer.

Environment:

  • Volcano Version: 1.8.1
  • Kubernetes version (use kubectl version): 1.19.3
@bysph bysph added the kind/bug Categorizes issue or PR as related to a bug. label Jan 16, 2024
@bysph
Copy link
Author

bysph commented Jan 16, 2024

@Monokaix Hello, I've come across this problem, but after reviewing the code, I haven't been able to identify the root cause. Can you offer any assistance? thx~

@bysph bysph changed the title Scheduler panics for with errors like "resource is not sufficient to do operation:" Scheduler panics with errors like "resource is not sufficient to do operation:" Jan 16, 2024
@Monokaix
Copy link
Member

Monokaix commented Jan 17, 2024

Yeah, the pr #3006 you mentioned has made a modification to allow node idle resources become negative. Can you search "Node out of sync" in volcano scheduler log? When tasks has specfied nodeName or node cpu allocatable decreased, this case might hanpped, so you can check that.

@Monokaix
Copy link
Member

And you should find which node's idle cpu resource become negative and check what changes have happened to the node and
what tasks are assigned and removed on this node.

@lowang-bh
Copy link
Member

func (ni *NodeInfo) allocateIdleResource(ti *TaskInfo) {
	ok, resources := ti.Resreq.LessEqualWithResourcesName(ni.Idle, Zero)
	if ok {
		ni.Idle.sub(ti.Resreq)
		return
	}

	ni.Idle.sub(ti.Resreq)
	klog.ErrorS(nil, "Idle resources turn into negative after allocated",
		"nodeName", ni.Name, "task", klog.KObj(ti.Pod), "resources", resources, "idle", ni.Idle.String(), "req", ti.Resreq.String())
}

how about add a recover from panic and set node idle to zero resource?

@bysph
Copy link
Author

bysph commented Jan 17, 2024

Yeah, the pr #3006 you mentioned has made a modification to allow node idle resources become negative. Can you search "Node out of sync" in volcano scheduler log? When tasks has specfied nodeName or node cpu allocatable decreased, this case might hanpped, so you can check that.

Yes, I can find the following logs, but the logs do not indicate that the node (bigdata2-k8s-bgp25g-music58.gy.ntes) caused a panic. Other than these two possibilities, is it possible that the issue was caused by Volcano and kube-scheduler being used simultaneously?
image

@Monokaix
Copy link
Member

Yeah, the pr #3006 you mentioned has made a modification to allow node idle resources become negative. Can you search "Node out of sync" in volcano scheduler log? When tasks has specfied nodeName or node cpu allocatable decreased, this case might hanpped, so you can check that.

Yes, I can find the following logs, but the logs do not indicate that the node (bigdata2-k8s-bgp25g-music58.gy.ntes) caused a panic. Other than these two possibilities, is it possible that the issue was caused by Volcano and kube-scheduler being used simultaneously? image

Yeah, you can check pods on that node and find after which pod scheduled the scheduler paiced and analysis the whole scheduling process of all pods on that node.

@Monokaix
Copy link
Member

func (ni *NodeInfo) allocateIdleResource(ti *TaskInfo) {
	ok, resources := ti.Resreq.LessEqualWithResourcesName(ni.Idle, Zero)
	if ok {
		ni.Idle.sub(ti.Resreq)
		return
	}

	ni.Idle.sub(ti.Resreq)
	klog.ErrorS(nil, "Idle resources turn into negative after allocated",
		"nodeName", ni.Name, "task", klog.KObj(ti.Pod), "resources", resources, "idle", ni.Idle.String(), "req", ti.Resreq.String())
}

how about add a recover from panic and set node idle to zero resource?

This will get an inconsistent node cache, which is not applicable.

@lowang-bh
Copy link
Member

But we should not panic in scheduler. The origin code let the node to be outofsync and will not schedule task.

@Monokaix
Copy link
Member

Monokaix commented Jan 23, 2024

But we should not panic in scheduler. The origin code let the node to be outofsync and will not schedule task.

The outofsync state has been removed by #2998 to deal with some scenarios, and we should not count on this and node or device plugin shoud recover resource report themselves.

@rt849122
Copy link

But we should not panic in scheduler. The origin code let the node to be outofsync and will not schedule task.

The outofsync state has been removed by #2998 to deal with some scenarios, and we should not count on this and node or device plugin shoud recover resource report themselves.

In our setting the outofsync (i.e. allocated > idle) node is quite common, especially when the cluster is crowded. This is because GPU/CPU breaking down are common events in a large cluster. Given "outofsync" is a common and normal event, the scheduler should never panic in such a normal path.

As the most obvious solution, the scheduler should skip the node instead of panic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants