-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vc job status stay pending when job retry times exceed maxRetries(=15) #1029
Comments
@Thor-wl , please take a look. |
ok |
it seems #1062 can fix this issue |
Yeah, the requirement is reasonal in some degree but it may bring the risk of dead lock as what i mentioned in the PR discussion. Please take a look at the discussion process. The risk will be potential in private usage scene but high possible in high throughput scenario. |
sorry,i may not get the point. is #1062 (comment) the deadlock scenario you mentioned? But the modification in #1062 is to terminate failed job and release resources, instead of retry forever. |
I'm not sure whether the information the user provided is all in accordance with the fact. As what i can see, if gang plugin is configured, only resource request meets the job's demand can the scheduler schedule all the tasks. |
what i want to say is that the actual modification in https://github.com/volcano-sh/volcano/pull/1062/files can fix this issue. maybe it's better to file a new pr? |
Yeah, a new PR is better. Current modification is available. |
/kind bug
What happened:
Apparently, only 2 worker pod will be created.
After 4 minutes, the vcjob stays pending status, and there is no warning event when describe the job。
we can only tell what happened from volcano controller manager log。
After reading volcano controller manager code, i found the job failed reason:
But I think Maybe it's better to set vcjob status to failed when retry times exceed maxRetries.
Environment:
The text was updated successfully, but these errors were encountered: