v0.4 runtime panic #861

kinglion811 · 2019-05-28T02:52:59Z

kubernetes version is :1.11
kube-batch version is:v0.5
when I start kube-batch and schedule tf job，After running for a while ，kube-batch will panic:
panic information is

Resource is not sufficient to do operation: <cpu 52000.00, memory 261334462464.00, GPU 0.00> sub <cpu 12000.00, memory 60000000000.00, GPU 2000.00> [recovered]
	panic: Resource is not sufficient to do operation: <cpu 52000.00, memory 261334462464.00, GPU 0.00> sub <cpu 12000.00, memory 60000000000.00, GPU 2000.00>

Causing panic are：

The text was updated successfully, but these errors were encountered:

kinglion811 · 2019-05-28T02:59:43Z

　@k82cn

k82cn · 2019-05-28T06:51:18Z

We merged #860 few days ago, which maybe helpful :)

kinglion811 · 2019-05-29T14:45:45Z

@k82cn that can not slove my problem,this issue is mainly node's Idle And the main reason is the inconsistent resources of gpu，I merge the code,the problem reappear

k82cn · 2019-05-30T05:03:29Z

Thanks for your confirmation :)
We also meet similar issue this morning; in our cause, device plugin did not report gpu info in time when kubelet restart. We're working on the PR. Is that similar to your scenario, panic when kubelet with device plugin restart?

k82cn · 2019-05-30T08:21:48Z

/kind bug
/priority important-soon
/sig scheduling

kinglion811 · 2019-05-30T09:03:58Z

@k82cn maybe,I will confirm your information

kinglion811 · 2019-06-05T02:34:56Z

@k82cn
how is the progress of this issue？

k82cn · 2019-06-12T03:09:33Z

@asifdxtreme , would you help to cherry pick volcano-retired#26 into kube-batch :)

kinglion811 · 2019-07-02T14:11:46Z

Observed a panic: &errors.errorString{s:"Resource is not sufficient to do operation: <cpu 56000.00, memory 270086234112.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 5000.00> sub <cpu 54000.00, memory 268435456000.00, nvidia.com/gpu 8000.00>"} (Resource is not sufficient to do operation: <cpu 56000.00, memory 270086234112.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 5000.00> sub <cpu 54000.00, memory 268435456000.00, nvidia.com/gpu 8000.00>)
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api/resource_info.go:158
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api/node_info.go:182
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/cache/event_handlers.go:82
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/cache/event_handlers.go:93
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/cache/event_handlers.go:192
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/cache/cache.go:262
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/controller.go:195
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/controller.go:227
:0
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/shared_informer.go:554
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/shared_informer.go:390
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71

This is mainly caused by gpu lost，
If the gpu plugin reports gpu lost, it will cause the resource view to be inconsistent.

kinglion811 · 2019-07-02T14:11:59Z

@k82cn

kinglion811 changed the title ~~v0.5 runtime panic~~ v0.4 runtime panic May 29, 2019

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 30, 2019

k82cn mentioned this issue May 31, 2019

Observed a panic: "invalid memory address or nil pointer dereference" #853

Closed

k82cn mentioned this issue Jun 10, 2019

Ignore nodes if out of syc. volcano-retired/scheduler#26

Merged

asifdxtreme mentioned this issue Jun 20, 2019

[Cherry-Pick #26] Ignore nodes if out of syc. #863

Merged

k8s-ci-robot closed this as completed in #863 Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4 runtime panic #861

v0.4 runtime panic #861

kinglion811 commented May 28, 2019 •

edited by k82cn

Loading

kinglion811 commented May 28, 2019

k82cn commented May 28, 2019

kinglion811 commented May 29, 2019 •

edited

Loading

k82cn commented May 30, 2019

k82cn commented May 30, 2019

kinglion811 commented May 30, 2019

kinglion811 commented Jun 5, 2019

k82cn commented Jun 12, 2019

kinglion811 commented Jul 2, 2019

kinglion811 commented Jul 2, 2019

v0.4 runtime panic #861

v0.4 runtime panic #861

Comments

kinglion811 commented May 28, 2019 • edited by k82cn Loading

kinglion811 commented May 28, 2019

k82cn commented May 28, 2019

kinglion811 commented May 29, 2019 • edited Loading

k82cn commented May 30, 2019

k82cn commented May 30, 2019

kinglion811 commented May 30, 2019

kinglion811 commented Jun 5, 2019

k82cn commented Jun 12, 2019

kinglion811 commented Jul 2, 2019

kinglion811 commented Jul 2, 2019

kinglion811 commented May 28, 2019 •

edited by k82cn

Loading

kinglion811 commented May 29, 2019 •

edited

Loading