Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Google Batch] errorStrategy 'retry' fails when multiple tasks attempt to retry #3607

Closed
schuyler-smith opened this issue Feb 3, 2023 · 18 comments

Comments

@schuyler-smith
Copy link

Bug report

Expected behavior and actual behavior

I am using a dynamic retry to increase resource availability for my process. The process completes as expected when run in parallel with samples that do not need to restart with more resources. When I submit a single sample that needs to retry, the retry works as expected, it fails with exit code 137. then resubmits a new job with the larger resource request. When I do this with multiple samples that trigger this behavior, it will handle the first sample and say it's submitting the retry, then when the second sample fails, it will error and kill the entire workflow.

Steps to reproduce the problem

I don't have a great MRE, unfortunately. But I am happy to work with anyone to show them what I am experiencing.

these are some of my configurations:

plugins{
    id 'nf-google'
}

google {
    location        = 'us-central1'
    project         = 'ssmith'
    batch.spot      = true
}
process {
    executor        = 'google-batch'
    withName: BAR {
        container     = 'my-container'
        errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
        disk          = { task.exitStatus in 137..140 ? '150 GB' : '100 GB' }
        memory        = { task.exitStatus in 137..140 ? '448 GB' : '240 GB' }
        cpus          = { task.exitStatus in 137..140 ? 56 : 60 }
        machineType   = { task.exitStatus in 137..140 ? 'c2d-highmem-56' : 'c2-standard-60' }
    }
 
}
docker.enabled = true

Program output

I think this is the relevant part of the log file:

Feb-03 12:42:04.180 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor google-batch > tasks to be completed: 2 -- submitted tasks are shown below
~> TaskHandler[id: 11; name: FOO:BAR (SAMPLE2); status: RUNNING; exit: -; error: -; workDir: work/bd/6f0516832d537ee0b60ea9aa9d7ed8]
~> TaskHandler[id: 12; name: FOO:BAR (SAMPLE1); status: RUNNING; exit: -; error: -; workDir: work/65/a3792ccdea05b2b2ac7798bbe85759]
Feb-03 12:42:34.340 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-65a3792c-1675449204309; state=FAILED
Feb-03 12:42:36.020 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 12; name: FOO:BAR (SAMPLE1); status: COMPLETED; exit: 137; error: -; workDir: work/65/a3792ccdea05b2b2ac7798bbe85759]
Feb-03 12:42:36.434 [Task monitor] INFO  nextflow.processor.TaskProcessor - [65/a3792c] NOTE: Process `FOO:BAR (SAMPLE1)` terminated with an error exit status (137) -- Execution is retried (1)
Feb-03 12:42:37.809 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] submitted > job=nf-21f3eddc-1675449756658; uid=j-885f9b0c-b997-4aa0-8bdc-16e0485c06ff; work-dir=work/21/f3eddca9ab267fb32e9b7d32dcce08
Feb-03 12:42:37.809 [Task submitter] INFO  nextflow.Session - [21/f3eddc] Re-submitted process > FOO:BAR (SAMPLE1)
Feb-03 12:43:34.271 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-bd6f0516-1675449202577; state=FAILED
Feb-03 12:43:34.566 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'FOO:BAR (SAMPLE2)'

Caused by:
  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@2cb64ca9[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@2a341a00[Wrapped task = TrustedListenableFutureTask@53cd9cc9[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@f67aeee]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@26d8a725[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

Environment

  • Nextflow version: 22.10.0
  • Java version: openjdk 18.0.1.1 2022-04-22
    OpenJDK Runtime Environment Homebrew (build 18.0.1.1+0)
    OpenJDK 64-Bit Server VM Homebrew (build 18.0.1.1+0, mixed mode, sharing)
  • Operating system: macOS
@bentsherman
Copy link
Member

Looks like a duplicate of #3166

@bentsherman
Copy link
Member

By the way, does this work?

        disk          = { task.exitStatus in 137..140 ? '150 GB' : '100 GB' }
        memory        = { task.exitStatus in 137..140 ? '448 GB' : '240 GB' }
        cpus          = { task.exitStatus in 137..140 ? 56 : 60 }
        machineType   = { task.exitStatus in 137..140 ? 'c2d-highmem-56' : 'c2-standard-60' }

Since these settings are determined before the task is run, the exit status wouldn't exist yet. Usually we use the task attempt to compute the dynamic resources:

        memory        = { task.attempt > 1 ? '448 GB' : '240 GB' }

@schuyler-smith
Copy link
Author

schuyler-smith commented Feb 6, 2023

Looks like a duplicate of #3166

OH! Sorry, I did try to find if this had been brought up before. I guess I was not diligent enough. Reading that thread, was this solved? Is there something that I need to add to allow Google Batch to handle the multiple retries?

By the way, does this work?

It does, or, at least, it does when it's just a single sample and the retry executes. I took that conditional from the user manual here.

@schuyler-smith
Copy link
Author

Oh.. I think I get what you mean now, that conditional was used just for the errorStrategy in the manual.. Well, it seems to work as I expect it to, the way that I have implemented it. When I have a workflow with no samples that need to retry, it successfully submits them all with the default values, and when 1 sample needs to retry, it successfully resubmits with the alternative values.

@bentsherman
Copy link
Member

Yes, that condition is used for errorStrategy, but not for resource directives like memory. For the resource directives you should use the task attempt, because the exit status is not known before the task starts. I would change the other four directives to use task.attempt > 1 as the condition and try again, just in case.

Unfortunately we haven't found a solution to this error. We're in contact with the Google Batch team to help us find the root cause, but so far it has eluded us. Don't worry about the duplicate, in fact your test case might help us pinpoint any common factors.

How did you install your version of Java?

@schuyler-smith
Copy link
Author

Homebrew on MacOS

java --version
openjdk 18.0.1.1 2022-04-22
OpenJDK Runtime Environment Homebrew (build 18.0.1.1+0)
OpenJDK 64-Bit Server VM Homebrew (build 18.0.1.1+0, mixed mode, sharing)

Running the workflow again, will take a bit for it to fail and attempt to restart, will post the results.

@schuyler-smith
Copy link
Author

Same error occurred after correcting the dynamic resource conditionals.

@bentsherman
Copy link
Member

Thank you. I'm wondering if the issue is caused by your Java distribution. See #3110 for context. Many users have had issues with Java versions installed via conda, so I suspect the Homebrew version might have problems as well.

We recommend that you install java through SDKMAN, because it's super easy and reliable. Based on the recommendations of this website, you should install Java Temurin or Corretto 17:

sdk install java 17.0.6-amzn
# or
sdk install java 17.0.6-tem

Can you uninstall your current Java, install one of these, and try again?

@mribeirodantas
Copy link
Member

Just to add to Ben's comment, if you have installed Java with Conda and also Nextflow with Conda, the Java version/distribution that Nextflow will use by default is the one installed with Conda, regardless of what you do with sdkman. I had this issue yesterday 😅.

In the end, I like to install Nextflow with curl -s https://get.nextflow.io | bash and Java with sdkman, as Ben described above.

@schuyler-smith
Copy link
Author

schuyler-smith commented Feb 7, 2023

N E X T F L O W  ~  version 22.10.6
java --version                                               
openjdk 17.0.6 2023-01-17
OpenJDK Runtime Environment Temurin-17.0.6+10 (build 17.0.6+10)
OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)

Still getting the same error

Feb-07 10:17:18.231 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-8978d683-1675786080990; state=FAILED
Feb-07 10:17:20.114 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 6; name: FOO:BAR (Sample1); status: COMPLETED; exit: 137; error: -; workDir: gs://x/work/89/78d683aa8573ddbd158b475becab38]
Feb-07 10:17:20.124 [Task monitor] INFO  nextflow.processor.TaskProcessor - [89/78d683] NOTE: Process `FOO:BAR (Sample1)` terminated with an error exit status (137) -- Execution is retried (1)
Feb-07 10:17:21.617 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] submitted > job=nf-b156c443-1675786640349; uid=j-3175f728-e01e-4789-a0c3-563aa4af1144; work-dir=gs://x/work/b1/56c4436ebb7f0eda80be3d3afa5f12
Feb-07 10:17:21.617 [Task submitter] INFO  nextflow.Session - [b1/56c443] Re-submitted process > FOO:BAR (Sample1)
Feb-07 10:17:28.272 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-9498c2d5-1675786088994; state=FAILED
Feb-07 10:17:28.642 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'FOO:BAR (Sample2)'

Caused by:
  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7d6f5167[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@693385f4[Wrapped task = TrustedListenableFutureTask@41d7205b[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@3d16a585]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@806771b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

It successfully retries the first job that fails with exit code 137, then errors once the second one fails.

@bentsherman
Copy link
Member

Okay, thanks again for testing.

@hnawar @aaronegolden This issue appears to be the same as #3166 , in case it helps you narrow down the root cause

@BeyondTheProof
Copy link

BeyondTheProof commented Feb 18, 2023

@schuyler-smith Have you had any updates on this? I just ran into the same issue. Slightly earlier version of Java.

$ java --version
openjdk 17.0.5 2022-10-18
OpenJDK Runtime Environment (build 17.0.5+8-Ubuntu-2ubuntu120.04)
OpenJDK 64-Bit Server VM (build 17.0.5+8-Ubuntu-2ubuntu120.04, mixed mode, sharing)

And current nextflow version

$ nextflow -version

      N E X T F L O W
      version 22.10.6 build 5843
      created 23-01-2023 23:20 UTC
      cite doi:10.1038/nbt.3820
      http://nextflow.io

And the traceback:

Error executing process > 'TRAIN_MODEL (9)'

Caused by:

  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@49d600b7[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@20d7c5a7[Wrapped task = TrustedListenableFutureTask@29c9a71a[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@7297027]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@57320ed7[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

FWIW, I am retrying jobs that fail, but without changing resources. I get this failure after most of the jobs have completed. Any help would be greatly appreciated! In the meantime, I'll go check on what's happening on 3166.

@schuyler-smith
Copy link
Author

@BeyondTheProof No, I have not had any updates yet, and I haven't found anything that I am able to do to avoid it either, other than just run the samples that fail individually. I've tried a bunch of different settings to no avail. Hopefully the Google Batch people and @bentsherman and team are able to find the solution sometime in the near future!

@pditommaso
Copy link
Member

I'd suggest using latest release 23.01.0-edge. See here about edge vs stable and how to install it https://www.nextflow.io/docs/latest/getstarted.html#stable-edge-releases

@BeyondTheProof
Copy link

Thanks for the response! I believe they're working on it, just gotta keep pressing them on it. When you say "run them individually", do you try to -resume, or literally run them separately outside of Nextflow?

@schuyler-smith
Copy link
Author

@BeyondTheProof I haven't had the chance to try the edge release yet that Paola suggested, I assume they have a fix implemented there already if he says to use it.

By run them individually i mean, instead of giving, say sample1 sample2 sample3, to run in parallel, I run the worklfow in Nextflow with just sample1. It seems that the retry with Google Batch would work if its just 1 sample, it would fail if multiple samples tried to retry in the same workflow submission.

@BeyondTheProof
Copy link

@pditommaso @schuyler-smith

Update on the edge version 23.01.0-edge build 5834: it seems to have improved the functionality. I was able to run the whole pipeline, BUT it still did crash once. Interestingly, it seems that the jobs that were already running at the time of crash continued to run. When the crash happened, it looked like the jobs were about 50% done, but when I resumed, it was up to 73%.

Moving forward, I thought that maybe the errors are due to too many resubmits at the same time, and Nextflow is maybe unable to handle it. Therefore, I decreased maxForks to 50 and resumed, and the pipeline finished!

@pditommaso
Copy link
Member

Yes, latest edge version this error is reported as a warning, still not clear what causes the thread pool exception. Closing in favour of #3772

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants