[Google Batch] errorStrategy 'retry' fails when multiple tasks attempt to retry #3607

schuyler-smith · 2023-02-03T22:10:12Z

Bug report

Expected behavior and actual behavior

I am using a dynamic retry to increase resource availability for my process. The process completes as expected when run in parallel with samples that do not need to restart with more resources. When I submit a single sample that needs to retry, the retry works as expected, it fails with exit code 137. then resubmits a new job with the larger resource request. When I do this with multiple samples that trigger this behavior, it will handle the first sample and say it's submitting the retry, then when the second sample fails, it will error and kill the entire workflow.

Steps to reproduce the problem

I don't have a great MRE, unfortunately. But I am happy to work with anyone to show them what I am experiencing.

these are some of my configurations:

plugins{
    id 'nf-google'
}

google {
    location        = 'us-central1'
    project         = 'ssmith'
    batch.spot      = true
}
process {
    executor        = 'google-batch'
    withName: BAR {
        container     = 'my-container'
        errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
        disk          = { task.exitStatus in 137..140 ? '150 GB' : '100 GB' }
        memory        = { task.exitStatus in 137..140 ? '448 GB' : '240 GB' }
        cpus          = { task.exitStatus in 137..140 ? 56 : 60 }
        machineType   = { task.exitStatus in 137..140 ? 'c2d-highmem-56' : 'c2-standard-60' }
    }
 
}
docker.enabled = true

Program output

I think this is the relevant part of the log file:

Feb-03 12:42:04.180 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor google-batch > tasks to be completed: 2 -- submitted tasks are shown below
~> TaskHandler[id: 11; name: FOO:BAR (SAMPLE2); status: RUNNING; exit: -; error: -; workDir: work/bd/6f0516832d537ee0b60ea9aa9d7ed8]
~> TaskHandler[id: 12; name: FOO:BAR (SAMPLE1); status: RUNNING; exit: -; error: -; workDir: work/65/a3792ccdea05b2b2ac7798bbe85759]
Feb-03 12:42:34.340 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-65a3792c-1675449204309; state=FAILED
Feb-03 12:42:36.020 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 12; name: FOO:BAR (SAMPLE1); status: COMPLETED; exit: 137; error: -; workDir: work/65/a3792ccdea05b2b2ac7798bbe85759]
Feb-03 12:42:36.434 [Task monitor] INFO  nextflow.processor.TaskProcessor - [65/a3792c] NOTE: Process `FOO:BAR (SAMPLE1)` terminated with an error exit status (137) -- Execution is retried (1)
Feb-03 12:42:37.809 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] submitted > job=nf-21f3eddc-1675449756658; uid=j-885f9b0c-b997-4aa0-8bdc-16e0485c06ff; work-dir=work/21/f3eddca9ab267fb32e9b7d32dcce08
Feb-03 12:42:37.809 [Task submitter] INFO  nextflow.Session - [21/f3eddc] Re-submitted process > FOO:BAR (SAMPLE1)
Feb-03 12:43:34.271 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-bd6f0516-1675449202577; state=FAILED
Feb-03 12:43:34.566 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'FOO:BAR (SAMPLE2)'

Caused by:
  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@2cb64ca9[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@2a341a00[Wrapped task = TrustedListenableFutureTask@53cd9cc9[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@f67aeee]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@26d8a725[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

Environment

Nextflow version: 22.10.0
Java version: openjdk 18.0.1.1 2022-04-22
OpenJDK Runtime Environment Homebrew (build 18.0.1.1+0)
OpenJDK 64-Bit Server VM Homebrew (build 18.0.1.1+0, mixed mode, sharing)
Operating system: macOS

The text was updated successfully, but these errors were encountered:

bentsherman · 2023-02-06T13:56:12Z

Looks like a duplicate of #3166

bentsherman · 2023-02-06T13:58:13Z

By the way, does this work?

        disk          = { task.exitStatus in 137..140 ? '150 GB' : '100 GB' }
        memory        = { task.exitStatus in 137..140 ? '448 GB' : '240 GB' }
        cpus          = { task.exitStatus in 137..140 ? 56 : 60 }
        machineType   = { task.exitStatus in 137..140 ? 'c2d-highmem-56' : 'c2-standard-60' }

Since these settings are determined before the task is run, the exit status wouldn't exist yet. Usually we use the task attempt to compute the dynamic resources:

        memory        = { task.attempt > 1 ? '448 GB' : '240 GB' }

schuyler-smith · 2023-02-06T16:52:46Z

Looks like a duplicate of #3166

OH! Sorry, I did try to find if this had been brought up before. I guess I was not diligent enough. Reading that thread, was this solved? Is there something that I need to add to allow Google Batch to handle the multiple retries?

By the way, does this work?

It does, or, at least, it does when it's just a single sample and the retry executes. I took that conditional from the user manual here.

schuyler-smith · 2023-02-06T16:56:22Z

Oh.. I think I get what you mean now, that conditional was used just for the errorStrategy in the manual.. Well, it seems to work as I expect it to, the way that I have implemented it. When I have a workflow with no samples that need to retry, it successfully submits them all with the default values, and when 1 sample needs to retry, it successfully resubmits with the alternative values.

bentsherman · 2023-02-06T19:40:26Z

Yes, that condition is used for errorStrategy, but not for resource directives like memory. For the resource directives you should use the task attempt, because the exit status is not known before the task starts. I would change the other four directives to use task.attempt > 1 as the condition and try again, just in case.

Unfortunately we haven't found a solution to this error. We're in contact with the Google Batch team to help us find the root cause, but so far it has eluded us. Don't worry about the duplicate, in fact your test case might help us pinpoint any common factors.

How did you install your version of Java?

schuyler-smith · 2023-02-06T21:26:02Z

Homebrew on MacOS

java --version
openjdk 18.0.1.1 2022-04-22
OpenJDK Runtime Environment Homebrew (build 18.0.1.1+0)
OpenJDK 64-Bit Server VM Homebrew (build 18.0.1.1+0, mixed mode, sharing)

Running the workflow again, will take a bit for it to fail and attempt to restart, will post the results.

schuyler-smith · 2023-02-07T13:42:11Z

Same error occurred after correcting the dynamic resource conditionals.

bentsherman · 2023-02-07T14:21:22Z

Thank you. I'm wondering if the issue is caused by your Java distribution. See #3110 for context. Many users have had issues with Java versions installed via conda, so I suspect the Homebrew version might have problems as well.

We recommend that you install java through SDKMAN, because it's super easy and reliable. Based on the recommendations of this website, you should install Java Temurin or Corretto 17:

sdk install java 17.0.6-amzn
# or
sdk install java 17.0.6-tem

Can you uninstall your current Java, install one of these, and try again?

mribeirodantas · 2023-02-07T14:34:21Z

Just to add to Ben's comment, if you have installed Java with Conda and also Nextflow with Conda, the Java version/distribution that Nextflow will use by default is the one installed with Conda, regardless of what you do with sdkman. I had this issue yesterday 😅.

In the end, I like to install Nextflow with curl -s https://get.nextflow.io | bash and Java with sdkman, as Ben described above.

schuyler-smith · 2023-02-07T16:34:45Z

N E X T F L O W  ~  version 22.10.6

java --version                                               
openjdk 17.0.6 2023-01-17
OpenJDK Runtime Environment Temurin-17.0.6+10 (build 17.0.6+10)
OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)

Still getting the same error

Feb-07 10:17:18.231 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-8978d683-1675786080990; state=FAILED
Feb-07 10:17:20.114 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 6; name: FOO:BAR (Sample1); status: COMPLETED; exit: 137; error: -; workDir: gs://x/work/89/78d683aa8573ddbd158b475becab38]
Feb-07 10:17:20.124 [Task monitor] INFO  nextflow.processor.TaskProcessor - [89/78d683] NOTE: Process `FOO:BAR (Sample1)` terminated with an error exit status (137) -- Execution is retried (1)
Feb-07 10:17:21.617 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] submitted > job=nf-b156c443-1675786640349; uid=j-3175f728-e01e-4789-a0c3-563aa4af1144; work-dir=gs://x/work/b1/56c4436ebb7f0eda80be3d3afa5f12
Feb-07 10:17:21.617 [Task submitter] INFO  nextflow.Session - [b1/56c443] Re-submitted process > FOO:BAR (Sample1)
Feb-07 10:17:28.272 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-9498c2d5-1675786088994; state=FAILED
Feb-07 10:17:28.642 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'FOO:BAR (Sample2)'

Caused by:
  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7d6f5167[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@693385f4[Wrapped task = TrustedListenableFutureTask@41d7205b[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@3d16a585]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@806771b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

It successfully retries the first job that fails with exit code 137, then errors once the second one fails.

bentsherman · 2023-02-07T16:52:38Z

Okay, thanks again for testing.

@hnawar @aaronegolden This issue appears to be the same as #3166 , in case it helps you narrow down the root cause

BeyondTheProof · 2023-02-18T18:57:52Z

@schuyler-smith Have you had any updates on this? I just ran into the same issue. Slightly earlier version of Java.

$ java --version
openjdk 17.0.5 2022-10-18
OpenJDK Runtime Environment (build 17.0.5+8-Ubuntu-2ubuntu120.04)
OpenJDK 64-Bit Server VM (build 17.0.5+8-Ubuntu-2ubuntu120.04, mixed mode, sharing)

And current nextflow version

$ nextflow -version

      N E X T F L O W
      version 22.10.6 build 5843
      created 23-01-2023 23:20 UTC
      cite doi:10.1038/nbt.3820
      http://nextflow.io

And the traceback:

Error executing process > 'TRAIN_MODEL (9)'

Caused by:

  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@49d600b7[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@20d7c5a7[Wrapped task = TrustedListenableFutureTask@29c9a71a[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@7297027]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@57320ed7[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

FWIW, I am retrying jobs that fail, but without changing resources. I get this failure after most of the jobs have completed. Any help would be greatly appreciated! In the meantime, I'll go check on what's happening on 3166.

schuyler-smith · 2023-02-18T19:40:19Z

@BeyondTheProof No, I have not had any updates yet, and I haven't found anything that I am able to do to avoid it either, other than just run the samples that fail individually. I've tried a bunch of different settings to no avail. Hopefully the Google Batch people and @bentsherman and team are able to find the solution sometime in the near future!

pditommaso · 2023-02-19T13:35:43Z

I'd suggest using latest release 23.01.0-edge. See here about edge vs stable and how to install it https://www.nextflow.io/docs/latest/getstarted.html#stable-edge-releases

BeyondTheProof · 2023-02-21T17:02:00Z

Thanks for the response! I believe they're working on it, just gotta keep pressing them on it. When you say "run them individually", do you try to -resume, or literally run them separately outside of Nextflow?

schuyler-smith · 2023-02-21T18:59:58Z

@BeyondTheProof I haven't had the chance to try the edge release yet that Paola suggested, I assume they have a fix implemented there already if he says to use it.

By run them individually i mean, instead of giving, say sample1 sample2 sample3, to run in parallel, I run the worklfow in Nextflow with just sample1. It seems that the retry with Google Batch would work if its just 1 sample, it would fail if multiple samples tried to retry in the same workflow submission.

BeyondTheProof · 2023-02-23T17:35:10Z

@pditommaso @schuyler-smith

Update on the edge version 23.01.0-edge build 5834: it seems to have improved the functionality. I was able to run the whole pipeline, BUT it still did crash once. Interestingly, it seems that the jobs that were already running at the time of crash continued to run. When the crash happened, it looked like the jobs were about 50% done, but when I resumed, it was up to 73%.

Moving forward, I thought that maybe the errors are due to too many resubmits at the same time, and Nextflow is maybe unable to handle it. Therefore, I decreased maxForks to 50 and resumed, and the pipeline finished!

pditommaso · 2023-03-19T09:33:01Z

Yes, latest edge version this error is reported as a warning, still not clear what causes the thread pool exception. Closing in favour of #3772

bentsherman added the executor/google-batch label Feb 6, 2023

pditommaso closed this as completed Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Google Batch] errorStrategy 'retry' fails when multiple tasks attempt to retry #3607

[Google Batch] errorStrategy 'retry' fails when multiple tasks attempt to retry #3607

schuyler-smith commented Feb 3, 2023

bentsherman commented Feb 6, 2023

bentsherman commented Feb 6, 2023

schuyler-smith commented Feb 6, 2023 •

edited

Loading

schuyler-smith commented Feb 6, 2023

bentsherman commented Feb 6, 2023

schuyler-smith commented Feb 6, 2023

schuyler-smith commented Feb 7, 2023

bentsherman commented Feb 7, 2023

mribeirodantas commented Feb 7, 2023

schuyler-smith commented Feb 7, 2023 •

edited

Loading

bentsherman commented Feb 7, 2023

BeyondTheProof commented Feb 18, 2023 •

edited

Loading

schuyler-smith commented Feb 18, 2023

pditommaso commented Feb 19, 2023

BeyondTheProof commented Feb 21, 2023

schuyler-smith commented Feb 21, 2023

BeyondTheProof commented Feb 23, 2023

pditommaso commented Mar 19, 2023

[Google Batch] errorStrategy 'retry' fails when multiple tasks attempt to retry #3607

[Google Batch] errorStrategy 'retry' fails when multiple tasks attempt to retry #3607

Comments

schuyler-smith commented Feb 3, 2023

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

bentsherman commented Feb 6, 2023

bentsherman commented Feb 6, 2023

schuyler-smith commented Feb 6, 2023 • edited Loading

schuyler-smith commented Feb 6, 2023

bentsherman commented Feb 6, 2023

schuyler-smith commented Feb 6, 2023

schuyler-smith commented Feb 7, 2023

bentsherman commented Feb 7, 2023

mribeirodantas commented Feb 7, 2023

schuyler-smith commented Feb 7, 2023 • edited Loading

bentsherman commented Feb 7, 2023

BeyondTheProof commented Feb 18, 2023 • edited Loading

schuyler-smith commented Feb 18, 2023

pditommaso commented Feb 19, 2023

BeyondTheProof commented Feb 21, 2023

schuyler-smith commented Feb 21, 2023

BeyondTheProof commented Feb 23, 2023

pditommaso commented Mar 19, 2023

schuyler-smith commented Feb 6, 2023 •

edited

Loading

schuyler-smith commented Feb 7, 2023 •

edited

Loading

BeyondTheProof commented Feb 18, 2023 •

edited

Loading