-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Google Batch hang when internal error during scheduling #5567
Fix Google Batch hang when internal error during scheduling #5567
Conversation
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
plugins/nf-google/src/test/nextflow/cloud/google/batch/GoogleBatchTaskHandlerTest.groovy
Outdated
Show resolved
Hide resolved
plugins/nf-google/src/test/nextflow/cloud/google/batch/GoogleBatchTaskHandlerTest.groovy
Outdated
Show resolved
Hide resolved
…atchTaskHandlerTest.groovy Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
…atchTaskHandlerTest.groovy Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
plugins/nf-google/src/test/nextflow/cloud/google/batch/GoogleBatchTaskHandlerTest.groovy
Outdated
Show resolved
Hide resolved
…atchTaskHandlerTest.groovy [ci skip] Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
…atchTaskHandler.groovy Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
…atchTaskHandler.groovy Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
…atchTaskHandler.groovy [ci skip] Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
Yes, initially the job might not have any tasks, but we need to check the task state now to support job arrays, so when there are no tasks we just wait for tasks to be created. Checking the job status for failure makes sense |
@bentsherman approve if you think it's good |
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
plugins/nf-google/src/main/nextflow/cloud/google/batch/client/BatchClient.groovy
Outdated
Show resolved
Hide resolved
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
plugins/nf-google/src/test/nextflow/cloud/google/batch/GoogleBatchTaskHandlerTest.groovy
Outdated
Show resolved
Hide resolved
plugins/nf-google/src/main/nextflow/cloud/google/batch/client/BatchClient.groovy
Outdated
Show resolved
Hide resolved
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Show resolved
Hide resolved
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
It would be nice to test against a real error case, but not sure if it can be easily replicated. Aside from that, the changes look good to me |
Good point, Esha reported the issue she has a test case |
Tested internally |
Signed-off-by: jorgee <jorge.ejarque@seqera.io> Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com> Signed-off-by: Ben Sherman <bentshermann@gmail.com> Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Co-authored-by: Ben Sherman <bentshermann@gmail.com>
When there is an internal error in Google Batch the job during scheduling. Jobs go from queued to failed. This situation is happening when request a VM type which is not available in the selected region. In this situation, there are no tasks in the batch job and NF is blocked in an infinite loop caused by the following part of code.
nextflow/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
In this PR, this part is modified including a job status check to avoid to continue checking when job is failed.
I still have pending to decide what to do if job do not have tasks and job status is null. I think we should abort but @bentsherman do you remember why you returned PENDING when there are no tasks in the job? Is the some initial state where GCP is creating the tasks and we should consider PENDING?