Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nf core dev / variant calling bugs / ideas #489

Merged
merged 16 commits into from
Feb 24, 2022

Conversation

nickhsmith
Copy link
Contributor

PR checklist

The following changes were made after trying to run the variant_calling pipeline (specifically --tools haplotypecaller). I have readded the gatk4/genotypegvcfs module from nf-core (which was updated to include interval index files)

An overview of the changes file by file:

concatenateVCFs.sh : The default for bcftools sorting is on the /tmp directory (or /scratch) however it makes sense to me to maintain the temp.sorting files in the location of your work directory (.)

conf/base.config: GENOTYPEGVCFS is listed as a medium_priority tool, but for WGS it was constantly failing the first 1 or 2 attempts. Giving it the default high_priority mem/cpus

conf/modules.config: added the GENOTYPEGVCFS module to publish the final vcf file. Edited the HAPLOTYPECALLER module to not publish the gvcfs by default (could be set to true, but they are huge for a WGS sample) and to name them "{meta.id}.g" to differentiate them from the final vcf file.

lib/WorkflowMain.groovy: the nextflow pipeline requires an input file. But this is in direct contrast to the automatic restart based on the results/../.csv used in sarek.nf to determine the step and input files. Not elegant, but works to restart if appropriate step is named

subworkflows/local/germline_variant_calling.nf: include GENOTYPEGVCFS and account for both with intervals genotyping (work on the gvcfs after concat) or no intervals (work on the gvfs without concat) and create a new channel for the GENOTYPEGVCFS input. Should eventually be linked with logic for joint variant calling. The current behaviour is ALWAYS do single sample variant calling, and if joint variant calling is called, do that also. (not ideal)

@maxulysse
Copy link
Member

concatenateVCFs.sh : The default for bcftools sorting is on the /tmp directory (or /scratch) however it makes sense to me to maintain the temp.sorting files in the location of your work directory (.)

I'd rather keep the default for bcftools or change the default.

conf/base.config: GENOTYPEGVCFS is listed as a medium_priority tool, but for WGS it was constantly failing the first 1 or 2 attempts. Giving it the default high_priority mem/cpus

Why not changing the modules to a high_priority tool?

conf/modules.config: added the GENOTYPEGVCFS module to publish the final vcf file. Edited the HAPLOTYPECALLER module to not publish the gvcfs by default (could be set to true, but they are huge for a WGS sample) and to name them "{meta.id}.g" to differentiate them from the final vcf file.

Sounds like a good idea

lib/WorkflowMain.groovy: the nextflow pipeline requires an input file. But this is in direct contrast to the automatic restart based on the results/../.csv used in sarek.nf to determine the step and input files. Not elegant, but works to restart if appropriate step is named

Not seeing any other solution there for now

subworkflows/local/germline_variant_calling.nf: include GENOTYPEGVCFS and account for both with intervals genotyping (work on the gvcfs after concat) or no intervals (work on the gvfs without concat) and create a new channel for the GENOTYPEGVCFS input. Should eventually be linked with logic for joint variant calling. The current behaviour is ALWAYS do single sample variant calling, and if joint variant calling is called, do that also. (not ideal)

I see what you mean there

@FriederikeHanssen
Copy link
Contributor

concatenateVCFs.sh : The default for bcftools sorting is on the /tmp directory (or /scratch) however it makes sense to me to maintain the temp.sorting files in the location of your work directory (.)

I don't understand the advantages of this. can you elaborate?

conf/base.config: GENOTYPEGVCFS is listed as a medium_priority tool, but for WGS it was constantly failing the first 1 or 2 attempts. Giving it the default high_priority mem/cpus

I agree that this should be changed upstream

lib/WorkflowMain.groovy: the nextflow pipeline requires an input file. But this is in direct contrast to the automatic restart based on the results/../.csv used in sarek.nf to determine the step and input files. Not elegant, but works to restart if appropriate step is named

not sure about a solution, but also haven't given it any thought. Honestly, not very happy about changing the temaplte code. Can we maybe take this disucssion to #tools before making any changes here? Looks like a release there will happen soon anyways, so it would maybe be good to get it in there now

subworkflows/local/germline_variant_calling.nf: include GENOTYPEGVCFS and account for both with intervals genotyping (work on the gvcfs after concat) or no intervals (work on the gvfs without concat) and create a new channel for the GENOTYPEGVCFS input. Should eventually be linked with logic for joint variant calling. The current behaviour is ALWAYS do single sample variant calling, and if joint variant calling is called, do that also. (not ideal)

Joint variant calling subworkflow needs a bit of work still anyways. So fine with merging it here and then work on the changes in a separate PR.

@nickhsmith
Copy link
Contributor Author

the bcftools sort in the concatenateVCFs.sh file is used because the concatenation of all of the intervals (by chromosome) builds a vcf that is unsorted. bcftools by default uses the /tmp/ dir to create temporary /tmp/bcftools-sort.XXXXXX files that are then merged to create the sorted vcf file.

If a user is on a shared file system that doesn't have write access to /tmp/ or if this is not very large (for WGS these temp gvcf files can get quite big) then you get errors. By changing the default temp space to the local directory . which is actually in the work/hash1/hash2/ dir we make sure that the user has write access to this location, and that presumably the work_dir is large enough to handle all of the the intermediate files this negates the problems mentioned above (although at the cost of potential I/O depending on how /tmp/ and /PATH/TO/WORK/ storage is built)

@nickhsmith
Copy link
Contributor Author

I agree that we can change the upstream priority for nf-core/modules/genotypegvfs but that makes assumptions for other users that may not be ideal. I will make a PR for that though

@FriederikeHanssen
Copy link
Contributor

FriederikeHanssen commented Feb 22, 2022

ah now I see. Yes we had the same issue with GATK tools in the past and had to add --TMP-DIR or so to also allow to write to defined scratch space with the nextflow parameter scratch = true or scratch = /some/place/where/intermediate/file/are/stored . Did you try the process with setting the scratch directory as well to make sure that would work to? It would be preferabe to not enforce the work dir to be used for intermediate files, since it can slow down execution. I think your solution should work, but just be sure maybe a quick test would be nice. Scratch directory for any process can be set in the custom.config too

@FriederikeHanssen
Copy link
Contributor

I agree that we can change the upstream priority for nf-core/modules/genotypegvfs but that makes assumptions for other users that may not be ideal. I will make a PR for that though

oretty sure we are the only ones using it at the moment :D If the raredisease pipeline needs it they would run into the same problem with WGS data.

@maxulysse
Copy link
Member

I agree that we can change the upstream priority for nf-core/modules/genotypegvfs but that makes assumptions for other users that may not be ideal. I will make a PR for that though

oretty sure we are the only ones using it at the moment :D If the raredisease pipeline needs it they would run into the same problem with WGS data.

I think default should be more, so that it's easier to restrict when needed.
If default is less, then we really need to be sure to add more resources

@nickhsmith
Copy link
Contributor Author

nickhsmith commented Feb 22, 2022

ah now I see. Yes we had the same issue with GATK tools in the past and had to add --TMP-DIR or so to also allow to write to defined scratch space with the nextflow parameter scratch = true or scratch = /some/place/where/intermediate/file/are/stored . Did you try the process with setting the scratch directory as well to make sure that would work to? It would be preferabe to not enforce the work dir to be used for intermediate files, since it can slow down execution. I think your solution should work, but just be sure maybe a quick test would be nice. Scratch directory for any process can be set in the custom.config too

What about a params.scratchdir so that we can set any scratch directive with process.scratch = "${params.scratchdir}"

@FriederikeHanssen
Copy link
Contributor

this already exists to provide in a custom config/institutional config etc.. not sure, if process configurations should be command line parameters

@nickhsmith
Copy link
Contributor Author

this already exists to provide in a custom config/institutional config etc.. not sure, if process configurations should be command line parameters

Like (or something):

withName CONCAT_VCF{
    scratch = 'path/to/scratch'
}

@FriederikeHanssen
Copy link
Contributor

FriederikeHanssen commented Feb 22, 2022

yep it can also be set globally for all pipelines like here: https://github.com/nf-core/configs/blob/91b5875b44b01edfd2992e19adec0d1b552feef8/conf/cfc.config#L13 or just for your pipeline run for all processes:

process{
   scratch = false (writes to work/ then)
   
   or
   
   scratch = /some/directory/where/there/is/space
 }

Docu: https://www.nextflow.io/docs/latest/process.html#scratch

@FriederikeHanssen
Copy link
Contributor

FriederikeHanssen commented Feb 23, 2022

I am confused. What do you mean :D Lasse Folkersen is implementing the module right now, so I can add it to the subworkflow soon, but nothing going on in this PR as far as I see.... it's all still in nf-core/modules

@maxulysse
Copy link
Member

I am confused. What do you mean :D Lasse Folkersen is implementing the module right now, so I can add it to the subworkflow soon.

Sorry, replied in the wrong PR...

.gitignore Outdated Show resolved Hide resolved
lib/WorkflowMain.groovy Outdated Show resolved Hide resolved
Co-authored-by: FriederikeHanssen <Friederike.hanssen@qbic.uni-tuebingen.de>
conf/base.config Outdated Show resolved Hide resolved
@maxulysse maxulysse merged commit 9bc4815 into nf-core:dev Feb 24, 2022
@github-actions
Copy link

Markdown linting is failing

To keep the code consistent with lots of contributors, we run automated code consistency checks.
To fix this CI test, please run:

  • Install markdownlint-cli
  • Fix the markdown errors
    • Automatically: markdownlint . --fix
    • Manually resolve anything left from markdownlint .

Once you push these changes the test should pass, and you can hide this comment 👍

We highly recommend setting up markdownlint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

Thanks again for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants