-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exit status 250 and SIGBUS (0x7) errors on GATK processes #1030
Comments
I pulled Sarek's git repo and tried GATK 4.4.0; the same issue is occurring for me. |
nf-sarek was mounting /tmp on the executing host into the container's /tmp. When multiple singularity containers are running on the same host it looks like multiple containers are trying to map the same area in the host's /tmp - similar to the issue reported for Bazel here. Modifying all java-options passed to GATK in nf-sarek to use -XX:-UsePerfData seems to fix the problem with no further issues. I don't think hsperfdata is ever being used in the pipeline, and this may also slightly improve performance. Tagging @pontus and @FriederikeHanssen in case there's something obvious I'm missing here in turning this off? |
I don't see any problem with turning it off, but I'm also consider the linked issue as not really similar (that seems to be running with docker which by defaults creates a pid namespace (but by contrast doesn't bring the host So, my understanding is that the crash in that issue comes from a mapped file being truncated and other processes having mapped that gets sad. For the singularity case, those pid collissions that are almost guaranteed with docker will be very, very unusual with singularity. So, no objection to a PR to add that option by default, but my guess would be that it's not this change that lets your jobs pass (at least not becaue of the reason in the linked issue, and if memory serves, disabling this did no difference with the memory related issues we spent a lot of time troubleshooting - but that was quite long ago now). |
@pontus are you referring to the issue that once upon a time all intermediate files were written to I don't know enough about JVM to judge if this option is good or not. Either way though, I would suggest not updating in sarek directly but in nf-core/modules as all GATK modules are shared and it would benefit many other pipelines & developers. Can you open an issue/PR here: https://github.com/nf-core/modules and we can discuss with more people? :) |
Sorry if that was unclear, I was trying to communicate that I didn't see any problem with disabling perfdata collection, but also didn't think it likely seeing the crash for the same reason as Bazel did in the linked issue (the collisions should be /very/ rare with singularity defaulting to shared pid space and docker defaulting to not share I agree that if this should be brought in, it should be done in the modules repo. |
No problem! I'll open an issues at the modules repo once I verify that
running sarek with PerfData disabled is stable - I'll put an assortment of
samples through and see if I can make it fault.
…On Wed, 24 May 2023, 6:26 pm Pontus Freyhult, ***@***.***> wrote:
Sorry if that was unclear, I was trying to communicate that I didn't see
any problem with disabling perfdata collection, but also didn't think it
likely seeing the crash for the same reason as Bazel did in the linked
issue (the collisions should be /very/ rare with singularity defaulting to
shared pid space and docker defaulting to not share /tmp).
I agree that if this should be brought in, it should be done in the
modules repo.
—
Reply to this email directly, view it on GitHub
<#1030 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC25LCMKTVZVD4LLWIWAFJTXHXA3VANCNFSM6AAAAAAYJZAWP4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
After running into this issue on our cluster as well (just about on every run, with a testset to production data) implementing the fix as suggested by @lfearnley indeed seems to fix this. Everything running stable so far (knocks on wood). To confirm: Adding ' -XX:-UsePerfData' to the --java-options in the GATK modules has fixed the SIGBUS GATK crashes for me. |
@ffmmulder Can add a comment to the issue in the nf-core/modules#3455 |
Thank you all for looking into this. I've had some success overcoming a sigbus error by changing the gatk processes' parameter to: --java_options "-Xmx${avail_mem}M -XX:-UsePerfData" |
This is an ongoing issue which applies to other nextflow/nf-core pipelines when running Java. It's also impacting running nf-raredisease; I'll be updating nf-core/modules#3455. |
Hopefully fixed by #1240, closing. Probably best to collect in nf-core/modules#3455 if it didn't help as expected. |
Description of the bug
I'm encountering the error described in #1024.
Briefly, I'm running nf-sarek using standard parameters on an HPC using singularity. I encounter this error on GATK components intermittently - some steps succeed on resubmission.
I've had a look at the following during debugging:
gatk --java-options "-Djava.io.tmpdir=. -Xmx4g"
to set the tmp dir still results in the error.Any thoughts on how best to debug?
Command used and terminal output
Relevant files
No response
System information
nextflow version 23.04.1.5866
HPC
slurm
Singularity
CentOS
Sarek 3.1.2
The text was updated successfully, but these errors were encountered: