Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would it possible to add decoys to GRCh37 reference? #2489

Closed
naumenko-sa opened this issue Aug 17, 2018 · 10 comments
Closed

Would it possible to add decoys to GRCh37 reference? #2489

naumenko-sa opened this issue Aug 17, 2018 · 10 comments
Labels
enhancement validation to dig out valuable validation discussions

Comments

@naumenko-sa
Copy link
Contributor

naumenko-sa commented Aug 17, 2018

Hello!

Thanks for the great pipeline!
I found a closed issue about decoys in the reference here #1234

It is 2018, but still many projects are using grch37 reference and it is problematic to switch them to hg38.
The current reference sequence in bcbio is from 1kg project and does not have decoys and EBV:
http://www.internationalgenome.org/category/grch37/

Would it be possible to add decoy sequences to GRCh37 reference?
File sizes would not increase much: hg38 with decoy is 3112M, GRCh37 without decoy is 3007M.

Adding decoys would remove some FP rare variants, and increase trust in bcbio variant calls for people doing variant interpretation.

I could put some efforts to make it happen, if needed.

Thanks!
Sergey

@naumenko-sa naumenko-sa changed the title Could you please add decoys to GRCh37 reference? Would it possible to add decoys to GRCh37 reference? Aug 17, 2018
@chapmanb
Copy link
Member

Sergey;
Thanks for starting this thread. Is using build 38 a solution for your needs? We can't modify the existing hg19 and GRCh37 installs at this point as that would end up breaking runs using the previous approaches since the sequencing dictionaries would be incompatible. We've held off adding decoys in build 37 to avoid this churn and also because I've seen validations where they seem to help and also seem to hurt so the improvements are not unambiguously better. The alt approach in hg38 universally improves resolution of FPs in all our validations, so we've focused instead of supporting build 38 and making it equivalent to build 37 runs.

If you absolutely need to use build 37 with decoys your best approach would be to add as a custom genome to your install:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#adding-custom-genomes

Thanks again for this discussion.

@naumenko-sa
Copy link
Contributor Author

naumenko-sa commented Aug 21, 2018

Hi Brad, thanks for the explanation.

Unfortunately, moving to hg38 is not an option for now for the particular project I have. My indirect and rough (max) estimate is that decoy allows to reduce FDR in SNPs for WES by 0.4%, from 0.6% to 0.2%.
The benefit is not that dramatic to put a lot of effort into decoy. Do you think that the following 2step approach will work?

  1. create custom reference grch37decoy with bwa index and nothing else.
    do alignment with custom genome
  2. do variant calling with regular grch37 reference using noalt_calling starting with bam produced with step1.

Sergey

@chapmanb
Copy link
Member

Sergey;
That approach should work, although I'd also suggest using bam_clean: remove_extracontigs when restarting with the decoy aligned BAM:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#alignment

This will extract only the standard contigs and avoid any potential issues with GATK or Picard derived tools being unhappy with the reference contig differences. Thanks for coming up with a creative solution and hope this works for you.

@naumenko-sa
Copy link
Contributor Author

Thanks, Brad!

I've created a custom reference with decoy, and tested it.
My input bam had 744M records, and after

details:
- algorithm:
    aligner: bwa
    effects: false
    mark_duplicates: true
    realign: true
    recalibrate: true
    save_diskspace: true
    tools_off:
    - gatk4
    variantcaller: false
  analysis: variant2
  description: 181_121141J
  files:
  - /hpf/largeprojects/ccm_dccforge/dccdipg/c4r_wgs/bcbio_buffer/181/input/181_121141J.bam
  genome_build: GRCh37d5
  metadata:
    batch: 181
fc_name: '181'
resources:
  default:
    cores: 7
    jvm_opts:
    - -Xms750m
    - -Xmx7000m
    memory: 7G
upload:
  dir: ../final

it is 737M. 1% of reads are gone. And it is reproducible with other samples.
1% of reads is not much, but I'd prefer to keep them all.

Could you imagine any steps where they could be filtered out? When I'm doing the alignment (bam>fastq>bam) with grch37 rather than with grch37d5, the amount of reads remains the same.

Read counting procedure:
samtools idxstats $1 | awk '{sum+=$3+$4}END{print sum}' > $1.read_count

Thanks!
Sergey

@chapmanb
Copy link
Member

Sergey;
Apologies about the issues, we certainly don't want to lose reads here. Do you have a sense of what reads are not present from the initial input file? Do you have the same issue if you turn off realignment (realign: false). We're not recommending this step and I wonder if it's doing something different with the decoy sequence. If that doesn't change the outputs, you could also try without recalibration to help determine if we can isolate the step that is dropping reads. Hope this helps with debugging and please let us know what you find.

@naumenko-sa
Copy link
Contributor Author

Hi Brad!

I'm still testing where reads disappear when aligning go grch37d5.

Also found another problem: when running variant calling and annotation against grch37 using bam aligned to grch37d5 as input, I'm getting errors related to the reference name (grch37d5), neither bam_clean: remove_extracontigs, nor bam_clean: picard helps to solve it.

When using bam_clean: remove_extracontigs, error is when running manta:

[2018-10-19T15:14:06.169761Z] [r1a-25.cm.cluster] [204417_1] [WorkflowRunner] [ERROR] [2018-10-19T15:13:58.793153Z] [r1a-25.cm.cluster] [204417_1] [makeLocusGraph_chromId_007_8_0006] std::exception::what: Split alignment segment maps to an unknown chromosome: 'hs37d5'
[2018-10-19T15:14:06.169859Z] [r1a-25.cm.cluster] [204417_1] [WorkflowRunner] [ERROR] [2018-10-19T15:13:58.901687Z] [r1a-25.cm.cluster] [204417_1] [makeLocusGraph_chromId_017_18_0001] 	local_bam_record: E00386:198:HLNVCCCXY:5:2214:27559:64175/1 tid:pos:strand 17:11463289:+ cigar: 5S47M99S templSize: 0 sa: hs37d5,30281334,+,99S43M9S,9,1; mate_tid:pos:strand -1:35254438:-

When using bam_clean: picard, error with gatk PrintReads:

October 22, 2018 9:30:29 PM EDT] org.broadinstitute.hellbender.tools.PrintReads done. Elapsed time: 52.40 minutes.
Runtime.totalMemory()=506986496
htsjdk.samtools.SAMException: Exception when processing alignment for BAM index E00386:198:HLNVCCCXY:5:1101:10003:11013 1/2 151b aligned read.
	at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:140)
	at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:199)
	at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyWrite(AsyncSAMFileWriter.java:36)
	at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyWrite(AsyncSAMFileWriter.java:16)
	at htsjdk.samtools.util.AbstractAsyncWriter$WriterRunnable.run(AbstractAsyncWriter.java:123)
	at java.lang.Thread.run(Thread.java:748)
Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record E00386:198:HLNVCCCXY:5:1101:10003:11013 1/2 151b aligned read.
	at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:119)
	at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:137)
	... 5 more
Caused by: htsjdk.samtools.SAMException: Unexpected reference -1 when constructing index for 83 for record E00386:198:HLNVCCCXY:5:1101:10003:11013 1/2 151b aligned read.
	at htsjdk.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:218)
	at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:117)
	... 6 more
' returned non-zero exit status 3

Do you have any suggestions how to debug 2-step variant calling in grch37 with decoy?

Thanks!
Sergey

@chapmanb
Copy link
Member

Sergey;
Sorry about the issues here with cleaning your decoy BAM file. It looks like the issue is that reads get pulled along from the decoy region hs37d5 despite our attempts to remove those with samtools view. These look like secondary and split reads that map to both a standard contig and the decoy. Apologies, it's quite difficult to get these removals correct and it appears as if manta is especially sensitive to detecting this and failing out. Apparently the variant callers that you used were okay since it got to this point, but it's getting all the downstream tool support correct that's the issue.

Practically, the best approach at this point might be to have your decoy as a separate custom genome build rather than trying to remove the extra contigs as part of bcbio. Is that an option?

If you want to remove these additional contigs before, you could try using VariantBam (https://github.com/walaj/VariantBam) as it has mate linking:

variant in.bam -L decoy_contigs.bed -o out.bam -b

It would be useful to know if excluding the problem reads like that and then feeding into bcbio resolves the issue.

Sorry again about all the issue and hope this helps.

@naumenko-sa
Copy link
Contributor Author

Hi Brad!

I wanted to share some validation results, as they might be useful for those thinking of using bcbio in a clinical setting with GRCh37, gatk4, and WGS. Of course, they will need way more validations to make this happen.

gatk4 validation in bcbio, WGS NA12878 with grch37

  • d1 = dataset1, 30X WGS of NA12878
  • d2 = dataset2, 30X WGS of NA12878
  • decoy - custom GRCh37 reference with decoy
  • RR - realignment and recalibration
  • HF - hard filters
  • VQSR - variant quality scores recalibration
Variation Metric d1,gatk4.0.7.0, HF, decoy, RR d1,gatk4.0.7.0, HF,no decoy,no RR d2, gatk4.0.1.2, VQSR,no decoy, no RR d2, gatk 4.0.7.0, HF, no decoy, no RR
SNPs FDR 0.10% 0.40% 0.73% 0.39%
SNPs FNR 0.12% 0.16% 0.08% 0.22%
Indels FDR 0.99% 1.15% 1.26% 1.58%
Indels FNR 1.29% 1.25% 5.85% 2.24%

Some conclusions:

  • adding decoy to grch37 as a custom reference in bcbio works for small variants: alignment step with grch37d5, removal of decoy reads with VariantBam, then variant calling and annotation.
  • decoy+RR improves precision for SNPvs and Indels, FNR for Indels is slightly reduced. The amount of improvement is important in clinical setting, might be ignored for research (where also switch to grch38 is easier)
  • RR adds 25% to running time, VQSR adds another 25%.
  • HF in bcbio improved indel precision/sensitivity balance compared to VQSR.
  • Recommendation for clinical setting: decoy, RR, HF
  • Recommendation for research setting: decoy, noRR, HF

SV calling after alignment to decoy is not working for now, Manta breaks, hopefully, a little VariantBam adjustment will help: walaj/VariantBam#16

Sergey

@chapmanb
Copy link
Member

chapmanb commented Jan 7, 2019

Sergey;
Thanks so much for following up on this with these numbers. This is really helpful for folks wanting to investigate decoy sequences on build 37. Do you have a sense for specific regions where the decoy improves calling? Since the comparison includes both decoy and recal/realignment, have you tried separating these to help determine the contribution of each component? It always starts to be a big time sink to include all these stratifications plus drilling into individual differences, but might be worthwhile if you're thinking about this switch.

More generally from a clinical perspective, is there any hope of moving to 38, or is 37 a permanent fixture? I keep hoping for 38, which solves these issues and many more, to be a viable solution.

Thank you again for all this work.

@naumenko-sa
Copy link
Contributor Author

naumenko-sa commented Jan 10, 2019

Hi Brad!

In bcbio 1.1.2 VQSR is on by default (experiment 3), I turned it off with tools_off: vqsr (experiment 4).

gatk 4.0.7.0 4.0.12.0
HF/VQSR HF HF VQSR HF
RR RR noRR noRR noRR
decoy decoy no decoy decoy decoy
experiment1234
SNPs
FDR0.10%0.40%0.31%0.10%
FNR0.12%0.16%0.06%0.12%
Indels
FDR0.99%1.15%0.62%0.95%
FNR1.29%1.25%5.82%1.19%

Some conclusions:

  • RR is useless, just waste of CPU time;
  • decoy improves precision;
  • I have tools_off: vqsr, would it be the better default for users who just get PASS variants?

If anybody wanted to repeat this:

  1. Step 1. Create custom reference with decoy: grch37d5
  2. Step 2. Align with decoy:
details:
- algorithm:
    aligner: bwa
    effects: false
    mark_duplicates: true
    realign: false
    recalibrate: false
    remove_lcr: false
    save_diskspace: true
    variantcaller: false
  analysis: variant2
  description: wgs_validation
  files:
  - /somewhere/project_1.fq.gz
  - /somewhere/project_2.fq.gz
  genome_build: GRCh37d5
  1. Step 3. Remove decoy reads
    https://github.com/naumenko-sa/cre/blob/master/cre.bam.remove_decoy_reads.sh

  2. Call variants and validate calls

details:
- algorithm:
    aligner: false
    effects: false
    mark_duplicates: false
    realign: false
    recalibrate: false
    remove_lcr: false
    save_diskspace: true
    tools_on:
    - noalt_calling
    tools_off:
    - vqsr
    variantcaller:
    - gatk-haplotype
    validate: giab-NA12878/truth_small_variants.vcf.gz
    validate_regions: giab-NA12878/truth_regions.bed
  analysis: variant2
  description: wgs_validation
  files:
  - /somewhere/NA12878.no_decoy_reads.bam
  genome_build: GRCh37
  metadata:
    batch: wgs
fc_name: wgs
resources:
  default:
    cores: 7
    jvm_opts:
    - -Xms750m
    - -Xmx7000m
    memory: 7G
upload:
  dir: ../final

Regarding the transition to grch38 in clinical variant calling. Please correct me if here are some clinical bioinformaticians in the community (I'm not). I don't think it will happen soon for many labs. Overhead costs will be huge: validation, documentation, transferring internal databases, teaching genome analysts, surviving the error-prone transition period with 2 genome references, etc. Benefits? Will it add much to solve rate? Probably, not. Will it cause some mis-interpretation of variants just because people are used to think along grch37? Probably, yes. Prioritizing WGS (SV calling) and RNA-seq over grch38 promises more in terms of solve rate. If somebody will publish a variant frequency database of 100K WGS called in grch38, that probably would make people think about the transition.

I will try to find, which variants are FP when aligning without decoy.

SN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement validation to dig out valuable validation discussions
Projects
None yet
Development

No branches or pull requests

3 participants