Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenating germline vcfs #792

Merged
merged 36 commits into from
Dec 7, 2022
Merged

Conversation

asp8200
Copy link
Contributor

@asp8200 asp8200 commented Oct 11, 2022

Issue #738.

Adding option --concatenate_vcfs for concatenating the vcf-files from all the germline variant-callers, except cnvkit which doesn't produce a vcf-file.

I've set it up so that Sarek puts the concatenated .vcf.gz-file here:

results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz

The output-vcf-files are sorted and have index-files. Added an INFO-field named SOURCE in the output-vcf-fils giving the name of the (source) vcf-file from whence the variant came. (The filename also indicate which variant-caller was used.)

I updated nextflow_schema.json - perhaps docs/usage.md and docs/output.md should also be updated?

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link

github-actions bot commented Oct 11, 2022

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit b32b4cf

+| ✅ 151 tests passed       |+
#| ❔   8 tests were ignored |#
!| ❗   2 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your prefered methods description, e.g. add publication citation for this pipeline
  • schema_description - No description provided in schema for parameter: cnvkit_reference

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.6
  • Run at 2022-12-06 19:06:23

@maxulysse
Copy link
Member

So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf <patient>.germline.vcf.gz in the results-folder results/variant_calling/concat/<patient>.

I think it's best to start small, and just do germline snps/indels for now.

@maxime doesn't want the concatenation to be optional.

I do think it's better to have that optional, people can have different usage downstream.

I've set it up so that Sarek puts the concatenated .vcf.gz-file here:

results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz

Should there also be a .tbi-file for the vcf-file?

Yes, in my opinion, as long as we produce a vcf.gz, we should have it tabix indexed.
Can we create a results/variant_calling/concat/<patient>/<patient>.germline.txt to list all vcf that were concatenated to produce this file, or do we have that info in the final vcf?

@asp8200
Copy link
Contributor Author

asp8200 commented Nov 11, 2022

So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf <patient>.germline.vcf.gz in the results-folder results/variant_calling/concat/<patient>.

I think it's best to start small, and just do germline snps/indels for now.

@maxime doesn't want the concatenation to be optional.

I do think it's better to have that optional, people can have different usage downstream.

I've set it up so that Sarek puts the concatenated .vcf.gz-file here:

results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz

Should there also be a .tbi-file for the vcf-file?

Yes, in my opinion, as long as we produce a vcf.gz, we should have it tabix indexed. Can we create a results/variant_calling/concat/<patient>/<patient>.germline.txt to list all vcf that were concatenated to produce this file, or do we have that info in the final vcf?

Thanks for the feedback, @maxulysse. Much appreciated. I'll make the concatenation optional somehow :-)

Concerning your idea about the text-file - the vcf-file produced by bcftools concat already contains information about which vcf-files where concatenated:

##bcftools_concatCommand=concat --output test1.germline.vcf.gz --threads 1 test1.strelka.variants.vcf.gz test1.manta.diploid_sv.vcf.gz test1.haplotypecaller.filtered.vcf.gz; Date=Thu Nov 10 21:40:33 2022

I'd say that makes the text-file redundant, right?

@maxulysse
Copy link
Member

I'd say that makes the text-file redundant, right?
yes, that's enough for me indeed

@asp8200
Copy link
Contributor Author

asp8200 commented Nov 29, 2022

@FriederikeHanssen @maxulysse : Can I get you guys to do a preliminary review of this PR?

If this PR looks okay, then I'll update the corresponding modules in github.com/nf-core/modules.

I've tested this PR with the following cmd:

nextflow run main.nf -profile test,singularity  --input mapped_joint_bam.fixed.csv -dump-channels -ansi-log false --step variant_calling --concatenate_vcfs --tools cnvkit,deepvariant,freebayes,haplotypecaller,manta,mpileup,strelka,tiddit

and it gives me a concatenated germline-vcf-file which was made by this bcftools concat - command:

##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.bcftools.vcf.gz testN.tiddit.vcf.gz testN.deepvariant.vcf.gz testN.freebayes.vcf.gz testN.manta.diploid_sv.vcf.gz testN.strelka.variants.vcf.gz testN.haplotypecaller.filtered.vcf.gz; Date=Tue Nov 29 10:47:08 2022

(N.B. The cnvkit doesn't produce a vcf-file, so no variants from cnvkit in the concatenated vcf-file.)

In fact, two concatenated vcf-files were produced, since the input-samplesheet contains to bam-files:

results/variant_calling/concat/testN/testN.germline.vcf.gz
results/variant_calling/concat/testT/testT.germline.vcf.gz

testN.germline.vcf.gz

The vcf-files are sorted and have corresponding tbi-files.

Warning: This PR contains some real clumsy code:
https://github.com/asp8200/sarek/blob/f8edc0034b9f01e3644ae75d7eaf57449581659c/workflows/sarek.nf#L1048-L1060

Copy link
Contributor

@lassefolkersen lassefolkersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overall looks good. I had one comment on sorting --- is that something that has been discussed? I mean post-concatenate sorting?

Also -- I think interpretation wise it does make sense to talk about 'an intersection' of results with long variants, cnvs, e.g. as when including Manta etc. Few long variant callers will agree on exactly where the break ends are of a variant, so it's not really going to be an intersection more like an union with slightly different border repeats of the same CNV.

modules/nf-core/bcftools/concat/main.nf Show resolved Hide resolved
@FriederikeHanssen
Copy link
Contributor

Damn! All the hard work I did with getting the variant-callers to return index-files all the way back to sarek.nf seems to be redundant, as I'll have to compute new index files after adding the INFO-field SET SOURCE to the vcf-files.

🙈 oh no

@asp8200
Copy link
Contributor Author

asp8200 commented Dec 1, 2022

Ok, so I introduced a local module for adding the INFO-field SOURCE=<name-of-input-vcf-file>. Here is the concatenated vcf-file

testN.germline.vcf.gz

With the CLI-options --concatenate_vcfs germline-vcf-files from the following variant-callers will be concatenated:

deepvariant
freebayes
haplotypecaller
manta
mpileup
strelka
tiddit

In the attached concatenated vcf-files, there are no variant from manta or tiddit.

What do you guys think about this solution? I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf? 🤔

@asp8200
Copy link
Contributor Author

asp8200 commented Dec 6, 2022

I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf?

The fastest and easiest solution would be just to get rid of the (new) code which is passing the index-files back to sarek.nf, since then I don't have to update anything in nf-core/modules 😁

@asp8200 asp8200 marked this pull request as ready for review December 6, 2022 20:37
@asp8200 asp8200 changed the title DRAFT: Concatenating vcfs Concatenating germline vcfs Dec 6, 2022
@asp8200
Copy link
Contributor Author

asp8200 commented Dec 6, 2022

I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf?

The fastest and easiest solution would be just to get rid of the (new) code which is passing the index-files back to sarek.nf, since then I don't have to update anything in nf-core/modules 😁

@maxulysse asked me to get rid of the redundant code, and so I did.

I now - finally - have all CI-tests passing. Let's merge this thing!

@FriederikeHanssen
Copy link
Contributor

uh nice 🥳 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants