-
Notifications
You must be signed in to change notification settings - Fork 956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler panics with errors like "resource is not sufficient to do operation:" #3301
Comments
@Monokaix Hello, I've come across this problem, but after reviewing the code, I haven't been able to identify the root cause. Can you offer any assistance? thx~ |
Yeah, the pr #3006 you mentioned has made a modification to allow node idle resources become negative. Can you search "Node out of sync" in volcano scheduler log? When tasks has specfied nodeName or node cpu allocatable decreased, this case might hanpped, so you can check that. |
And you should find which node's idle cpu resource become negative and check what changes have happened to the node and |
how about add a recover from panic and set node idle to zero resource? |
Yes, I can find the following logs, but the logs do not indicate that the node (bigdata2-k8s-bgp25g-music58.gy.ntes) caused a panic. Other than these two possibilities, is it possible that the issue was caused by Volcano and kube-scheduler being used simultaneously? |
Yeah, you can check pods on that node and find after which pod scheduled the scheduler paiced and analysis the whole scheduling process of all pods on that node. |
This will get an inconsistent node cache, which is not applicable. |
But we should not panic in scheduler. The origin code let the node to be |
The |
In our setting the outofsync (i.e. allocated > idle) node is quite common, especially when the cluster is crowded. This is because GPU/CPU breaking down are common events in a large cluster. Given "outofsync" is a common and normal event, the scheduler should never panic in such a normal path. As the most obvious solution, the scheduler should skip the node instead of panic. |
What happened:
The scheduler paniced and restarted several times for about half an hour, then recovered, all without any manual intervention, except for some normal tasks running automatically and completing. The logs are as follows:
earlier logs that may help:
What you expected to happen:
The scheduler should remain stable without panic
How to reproduce it (as minimally and precisely as possible):
I have not find the reproduction conditions yet
Anything else we need to know?:
I found this PR makes a modification to allow "used" to become negative: https://github.com/volcano-sh/volcano/pull/3006/files. But I'm not sure if it's related to it, maybe we can discuss here to make this issue clearer.
Environment:
kubectl version
): 1.19.3The text was updated successfully, but these errors were encountered: