-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Google Batch] errorStrategy 'retry' fails when multiple tasks attempt to retry #3607
Comments
Looks like a duplicate of #3166 |
By the way, does this work? disk = { task.exitStatus in 137..140 ? '150 GB' : '100 GB' }
memory = { task.exitStatus in 137..140 ? '448 GB' : '240 GB' }
cpus = { task.exitStatus in 137..140 ? 56 : 60 }
machineType = { task.exitStatus in 137..140 ? 'c2d-highmem-56' : 'c2-standard-60' } Since these settings are determined before the task is run, the exit status wouldn't exist yet. Usually we use the task attempt to compute the dynamic resources: memory = { task.attempt > 1 ? '448 GB' : '240 GB' } |
OH! Sorry, I did try to find if this had been brought up before. I guess I was not diligent enough. Reading that thread, was this solved? Is there something that I need to add to allow Google Batch to handle the multiple retries?
It does, or, at least, it does when it's just a single sample and the retry executes. I took that conditional from the user manual here. |
Oh.. I think I get what you mean now, that conditional was used just for the |
Yes, that condition is used for Unfortunately we haven't found a solution to this error. We're in contact with the Google Batch team to help us find the root cause, but so far it has eluded us. Don't worry about the duplicate, in fact your test case might help us pinpoint any common factors. How did you install your version of Java? |
Homebrew on MacOS
Running the workflow again, will take a bit for it to fail and attempt to restart, will post the results. |
Same error occurred after correcting the dynamic resource conditionals. |
Thank you. I'm wondering if the issue is caused by your Java distribution. See #3110 for context. Many users have had issues with Java versions installed via conda, so I suspect the Homebrew version might have problems as well. We recommend that you install java through SDKMAN, because it's super easy and reliable. Based on the recommendations of this website, you should install Java Temurin or Corretto 17: sdk install java 17.0.6-amzn
# or
sdk install java 17.0.6-tem Can you uninstall your current Java, install one of these, and try again? |
Just to add to Ben's comment, if you have installed Java with Conda and also Nextflow with Conda, the Java version/distribution that Nextflow will use by default is the one installed with Conda, regardless of what you do with sdkman. I had this issue yesterday 😅. In the end, I like to install Nextflow with |
Still getting the same error
It successfully retries the first job that fails with exit code 137, then errors once the second one fails. |
Okay, thanks again for testing. @hnawar @aaronegolden This issue appears to be the same as #3166 , in case it helps you narrow down the root cause |
@schuyler-smith Have you had any updates on this? I just ran into the same issue. Slightly earlier version of Java.
And current nextflow version
And the traceback:
FWIW, I am retrying jobs that fail, but without changing resources. I get this failure after most of the jobs have completed. Any help would be greatly appreciated! In the meantime, I'll go check on what's happening on 3166. |
@BeyondTheProof No, I have not had any updates yet, and I haven't found anything that I am able to do to avoid it either, other than just run the samples that fail individually. I've tried a bunch of different settings to no avail. Hopefully the Google Batch people and @bentsherman and team are able to find the solution sometime in the near future! |
I'd suggest using latest release |
Thanks for the response! I believe they're working on it, just gotta keep pressing them on it. When you say "run them individually", do you try to |
@BeyondTheProof I haven't had the chance to try the By run them individually i mean, instead of giving, say sample1 sample2 sample3, to run in parallel, I run the worklfow in Nextflow with just sample1. It seems that the retry with Google Batch would work if its just 1 sample, it would fail if multiple samples tried to retry in the same workflow submission. |
Update on the edge Moving forward, I thought that maybe the errors are due to too many resubmits at the same time, and Nextflow is maybe unable to handle it. Therefore, I decreased maxForks to 50 and resumed, and the pipeline finished! |
Yes, latest edge version this error is reported as a warning, still not clear what causes the thread pool exception. Closing in favour of #3772 |
Bug report
Expected behavior and actual behavior
I am using a dynamic retry to increase resource availability for my process. The process completes as expected when run in parallel with samples that do not need to restart with more resources. When I submit a single sample that needs to retry, the retry works as expected, it fails with exit code 137. then resubmits a new job with the larger resource request. When I do this with multiple samples that trigger this behavior, it will handle the first sample and say it's submitting the retry, then when the second sample fails, it will error and kill the entire workflow.
Steps to reproduce the problem
I don't have a great MRE, unfortunately. But I am happy to work with anyone to show them what I am experiencing.
these are some of my configurations:
Program output
I think this is the relevant part of the log file:
Environment
OpenJDK Runtime Environment Homebrew (build 18.0.1.1+0)
OpenJDK 64-Bit Server VM Homebrew (build 18.0.1.1+0, mixed mode, sharing)
The text was updated successfully, but these errors were encountered: