-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would it possible to add decoys to GRCh37 reference? #2489
Comments
Sergey; If you absolutely need to use build 37 with decoys your best approach would be to add as a custom genome to your install: https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#adding-custom-genomes Thanks again for this discussion. |
Hi Brad, thanks for the explanation. Unfortunately, moving to hg38 is not an option for now for the particular project I have. My indirect and rough (max) estimate is that decoy allows to reduce FDR in SNPs for WES by 0.4%, from 0.6% to 0.2%.
Sergey |
Sergey; https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#alignment This will extract only the standard contigs and avoid any potential issues with GATK or Picard derived tools being unhappy with the reference contig differences. Thanks for coming up with a creative solution and hope this works for you. |
Thanks, Brad! I've created a custom reference with decoy, and tested it.
it is 737M. 1% of reads are gone. And it is reproducible with other samples. Could you imagine any steps where they could be filtered out? When I'm doing the alignment (bam>fastq>bam) with grch37 rather than with grch37d5, the amount of reads remains the same. Read counting procedure: Thanks! |
Sergey; |
Hi Brad! I'm still testing where reads disappear when aligning go grch37d5. Also found another problem: when running variant calling and annotation against grch37 using bam aligned to grch37d5 as input, I'm getting errors related to the reference name (grch37d5), neither bam_clean: remove_extracontigs, nor bam_clean: picard helps to solve it. When using bam_clean: remove_extracontigs, error is when running manta:
When using bam_clean: picard, error with gatk PrintReads:
Do you have any suggestions how to debug 2-step variant calling in grch37 with decoy? Thanks! |
Sergey; Practically, the best approach at this point might be to have your decoy as a separate custom genome build rather than trying to remove the extra contigs as part of bcbio. Is that an option? If you want to remove these additional contigs before, you could try using VariantBam (https://github.com/walaj/VariantBam) as it has mate linking:
It would be useful to know if excluding the problem reads like that and then feeding into bcbio resolves the issue. Sorry again about all the issue and hope this helps. |
Hi Brad! I wanted to share some validation results, as they might be useful for those thinking of using bcbio in a clinical setting with GRCh37, gatk4, and WGS. Of course, they will need way more validations to make this happen. gatk4 validation in bcbio, WGS NA12878 with grch37
Some conclusions:
SV calling after alignment to decoy is not working for now, Manta breaks, hopefully, a little VariantBam adjustment will help: walaj/VariantBam#16 Sergey |
Sergey; More generally from a clinical perspective, is there any hope of moving to 38, or is 37 a permanent fixture? I keep hoping for 38, which solves these issues and many more, to be a viable solution. Thank you again for all this work. |
Hi Brad! In bcbio 1.1.2 VQSR is on by default (experiment 3), I turned it off with tools_off: vqsr (experiment 4).
Some conclusions:
If anybody wanted to repeat this:
Regarding the transition to grch38 in clinical variant calling. Please correct me if here are some clinical bioinformaticians in the community (I'm not). I don't think it will happen soon for many labs. Overhead costs will be huge: validation, documentation, transferring internal databases, teaching genome analysts, surviving the error-prone transition period with 2 genome references, etc. Benefits? Will it add much to solve rate? Probably, not. Will it cause some mis-interpretation of variants just because people are used to think along grch37? Probably, yes. Prioritizing WGS (SV calling) and RNA-seq over grch38 promises more in terms of solve rate. If somebody will publish a variant frequency database of 100K WGS called in grch38, that probably would make people think about the transition. I will try to find, which variants are FP when aligning without decoy. SN |
Hello!
Thanks for the great pipeline!
I found a closed issue about decoys in the reference here #1234
It is 2018, but still many projects are using grch37 reference and it is problematic to switch them to hg38.
The current reference sequence in bcbio is from 1kg project and does not have decoys and EBV:
http://www.internationalgenome.org/category/grch37/
Would it be possible to add decoy sequences to GRCh37 reference?
File sizes would not increase much: hg38 with decoy is 3112M, GRCh37 without decoy is 3007M.
Adding decoys would remove some FP rare variants, and increase trust in bcbio variant calls for people doing variant interpretation.
I could put some efforts to make it happen, if needed.
Thanks!
Sergey
The text was updated successfully, but these errors were encountered: