Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vc job status stay pending when job retry times exceed maxRetries(=15) #1029

Closed
merryzhou opened this issue Sep 1, 2020 · 8 comments
Closed

Comments

@merryzhou
Copy link
Contributor

/kind bug

What happened:

  1. create a resourcequota with gpu resources limit: 4
  2. create a vcjob who need 8 gpu(1 master * 0 gpu, 4 worker * 2gpu)
    Apparently, only 2 worker pod will be created.

After 4 minutes, the vcjob stays pending status, and there is no warning event when describe the job。
we can only tell what happened from volcano controller manager log。

W0826 08:59:45.255193       1 job_controller.go:348] Dropping job<test/vcjob-1> out of the queue: failed to create 2 pods of 4 because max retries has reached

After reading volcano controller manager code, i found the job failed reason:

const (
	// maxRetries is the number of times a volcano job will be retried before it is dropped out of the queue.
	// With the current rate-limiter in use (5ms*2^(maxRetries-1)) the following numbers represent the times
	// a volcano job is going to be requeued:
	//
	// 5ms, 10ms, 20ms, 40ms, 80ms, 160ms, 320ms, 640ms, 1.3s, 2.6s, 5.1s, 10.2s, 20.4s, 41s, 82s
	maxRetries = 15
)

But I think Maybe it's better to set vcjob status to failed when retry times exceed maxRetries.

Environment:

  • Volcano Version: v0.4.2
@william-wang
Copy link
Member

@Thor-wl , please take a look.

@Thor-wl
Copy link
Contributor

Thor-wl commented Sep 3, 2020

@Thor-wl , please take a look.

ok

@merryzhou
Copy link
Contributor Author

it seems #1062 can fix this issue

@Thor-wl
Copy link
Contributor

Thor-wl commented Oct 14, 2020

it seems #1062 can fix this issue

Yeah, the requirement is reasonal in some degree but it may bring the risk of dead lock as what i mentioned in the PR discussion. Please take a look at the discussion process. The risk will be potential in private usage scene but high possible in high throughput scenario.

@merryzhou
Copy link
Contributor Author

merryzhou commented Oct 14, 2020

Yeah, the requirement is reasonal in some degree but it may bring the risk of dead lock as what i mentioned in the PR discussion. Please take a look at the discussion process. The risk will be potential in private usage scene but high possible in high throughput scenario.

sorry,i may not get the point.

is #1062 (comment) the deadlock scenario you mentioned?

But the modification in #1062 is to terminate failed job and release resources, instead of retry forever.

@Thor-wl
Copy link
Contributor

Thor-wl commented Oct 14, 2020

Yeah, the requirement is reasonal in some degree but it may bring the risk of dead lock as what i mentioned in the PR discussion. Please take a look at the discussion process. The risk will be potential in private usage scene but high possible in high throughput scenario.

sorry,i may not get the point.

is #1062 (comment) the deadlock scenario you mentioned?

But the modification in #1062 is to terminate failed job and release resources, instead of retry forever.

I'm not sure whether the information the user provided is all in accordance with the fact. As what i can see, if gang plugin is configured, only resource request meets the job's demand can the scheduler schedule all the tasks.

@merryzhou
Copy link
Contributor Author

merryzhou commented Oct 14, 2020

I'm not sure whether the information the user provided is all in accordance with the fact. As what i can see, if gang plugin is configured, only resource request meets the job's demand can the scheduler schedule all the tasks.

what i want to say is that the actual modification in https://github.com/volcano-sh/volcano/pull/1062/files can fix this issue.

maybe it's better to file a new pr?

@Thor-wl
Copy link
Contributor

Thor-wl commented Oct 14, 2020

I'm not sure whether the information the user provided is all in accordance with the fact. As what i can see, if gang plugin is configured, only resource request meets the job's demand can the scheduler schedule all the tasks.

what i want to say is that the actual modification in https://github.com/volcano-sh/volcano/pull/1062/files can fix this issue.

maybe it's better to file a new pr?

Yeah, a new PR is better. Current modification is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants