Would it possible to add decoys to GRCh37 reference? #2489

naumenko-sa · 2018-08-17T18:53:25Z

Hello!

Thanks for the great pipeline!
I found a closed issue about decoys in the reference here #1234

It is 2018, but still many projects are using grch37 reference and it is problematic to switch them to hg38.
The current reference sequence in bcbio is from 1kg project and does not have decoys and EBV:
http://www.internationalgenome.org/category/grch37/

Would it be possible to add decoy sequences to GRCh37 reference?
File sizes would not increase much: hg38 with decoy is 3112M, GRCh37 without decoy is 3007M.

Adding decoys would remove some FP rare variants, and increase trust in bcbio variant calls for people doing variant interpretation.

I could put some efforts to make it happen, if needed.

Thanks!
Sergey

chapmanb · 2018-08-21T09:51:54Z

Sergey;
Thanks for starting this thread. Is using build 38 a solution for your needs? We can't modify the existing hg19 and GRCh37 installs at this point as that would end up breaking runs using the previous approaches since the sequencing dictionaries would be incompatible. We've held off adding decoys in build 37 to avoid this churn and also because I've seen validations where they seem to help and also seem to hurt so the improvements are not unambiguously better. The alt approach in hg38 universally improves resolution of FPs in all our validations, so we've focused instead of supporting build 38 and making it equivalent to build 37 runs.

If you absolutely need to use build 37 with decoys your best approach would be to add as a custom genome to your install:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#adding-custom-genomes

Thanks again for this discussion.

naumenko-sa · 2018-08-21T18:00:57Z

Hi Brad, thanks for the explanation.

Unfortunately, moving to hg38 is not an option for now for the particular project I have. My indirect and rough (max) estimate is that decoy allows to reduce FDR in SNPs for WES by 0.4%, from 0.6% to 0.2%.
The benefit is not that dramatic to put a lot of effort into decoy. Do you think that the following 2step approach will work?

create custom reference grch37decoy with bwa index and nothing else.
do alignment with custom genome
do variant calling with regular grch37 reference using noalt_calling starting with bam produced with step1.

Sergey

chapmanb · 2018-08-23T10:18:35Z

Sergey;
That approach should work, although I'd also suggest using bam_clean: remove_extracontigs when restarting with the decoy aligned BAM:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#alignment

This will extract only the standard contigs and avoid any potential issues with GATK or Picard derived tools being unhappy with the reference contig differences. Thanks for coming up with a creative solution and hope this works for you.

naumenko-sa · 2018-09-17T19:22:19Z

Thanks, Brad!

I've created a custom reference with decoy, and tested it.
My input bam had 744M records, and after

details:
- algorithm:
    aligner: bwa
    effects: false
    mark_duplicates: true
    realign: true
    recalibrate: true
    save_diskspace: true
    tools_off:
    - gatk4
    variantcaller: false
  analysis: variant2
  description: 181_121141J
  files:
  - /hpf/largeprojects/ccm_dccforge/dccdipg/c4r_wgs/bcbio_buffer/181/input/181_121141J.bam
  genome_build: GRCh37d5
  metadata:
    batch: 181
fc_name: '181'
resources:
  default:
    cores: 7
    jvm_opts:
    - -Xms750m
    - -Xmx7000m
    memory: 7G
upload:
  dir: ../final

it is 737M. 1% of reads are gone. And it is reproducible with other samples.
1% of reads is not much, but I'd prefer to keep them all.

Could you imagine any steps where they could be filtered out? When I'm doing the alignment (bam>fastq>bam) with grch37 rather than with grch37d5, the amount of reads remains the same.

Read counting procedure:
samtools idxstats $1 | awk '{sum+=$3+$4}END{print sum}' > $1.read_count

Thanks!
Sergey

chapmanb · 2018-09-18T14:32:31Z

Sergey;
Apologies about the issues, we certainly don't want to lose reads here. Do you have a sense of what reads are not present from the initial input file? Do you have the same issue if you turn off realignment (realign: false). We're not recommending this step and I wonder if it's doing something different with the decoy sequence. If that doesn't change the outputs, you could also try without recalibration to help determine if we can isolate the step that is dropping reads. Hope this helps with debugging and please let us know what you find.

naumenko-sa · 2018-10-23T14:14:51Z

Hi Brad!

I'm still testing where reads disappear when aligning go grch37d5.

Also found another problem: when running variant calling and annotation against grch37 using bam aligned to grch37d5 as input, I'm getting errors related to the reference name (grch37d5), neither bam_clean: remove_extracontigs, nor bam_clean: picard helps to solve it.

When using bam_clean: remove_extracontigs, error is when running manta:

[2018-10-19T15:14:06.169761Z] [r1a-25.cm.cluster] [204417_1] [WorkflowRunner] [ERROR] [2018-10-19T15:13:58.793153Z] [r1a-25.cm.cluster] [204417_1] [makeLocusGraph_chromId_007_8_0006] std::exception::what: Split alignment segment maps to an unknown chromosome: 'hs37d5'
[2018-10-19T15:14:06.169859Z] [r1a-25.cm.cluster] [204417_1] [WorkflowRunner] [ERROR] [2018-10-19T15:13:58.901687Z] [r1a-25.cm.cluster] [204417_1] [makeLocusGraph_chromId_017_18_0001] 	local_bam_record: E00386:198:HLNVCCCXY:5:2214:27559:64175/1 tid:pos:strand 17:11463289:+ cigar: 5S47M99S templSize: 0 sa: hs37d5,30281334,+,99S43M9S,9,1; mate_tid:pos:strand -1:35254438:-

When using bam_clean: picard, error with gatk PrintReads:

October 22, 2018 9:30:29 PM EDT] org.broadinstitute.hellbender.tools.PrintReads done. Elapsed time: 52.40 minutes.
Runtime.totalMemory()=506986496
htsjdk.samtools.SAMException: Exception when processing alignment for BAM index E00386:198:HLNVCCCXY:5:1101:10003:11013 1/2 151b aligned read.
	at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:140)
	at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:199)
	at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyWrite(AsyncSAMFileWriter.java:36)
	at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyWrite(AsyncSAMFileWriter.java:16)
	at htsjdk.samtools.util.AbstractAsyncWriter$WriterRunnable.run(AbstractAsyncWriter.java:123)
	at java.lang.Thread.run(Thread.java:748)
Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record E00386:198:HLNVCCCXY:5:1101:10003:11013 1/2 151b aligned read.
	at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:119)
	at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:137)
	... 5 more
Caused by: htsjdk.samtools.SAMException: Unexpected reference -1 when constructing index for 83 for record E00386:198:HLNVCCCXY:5:1101:10003:11013 1/2 151b aligned read.
	at htsjdk.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:218)
	at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:117)
	... 6 more
' returned non-zero exit status 3

Do you have any suggestions how to debug 2-step variant calling in grch37 with decoy?

Thanks!
Sergey

chapmanb · 2018-10-23T18:51:34Z

Sergey;
Sorry about the issues here with cleaning your decoy BAM file. It looks like the issue is that reads get pulled along from the decoy region hs37d5 despite our attempts to remove those with samtools view. These look like secondary and split reads that map to both a standard contig and the decoy. Apologies, it's quite difficult to get these removals correct and it appears as if manta is especially sensitive to detecting this and failing out. Apparently the variant callers that you used were okay since it got to this point, but it's getting all the downstream tool support correct that's the issue.

Practically, the best approach at this point might be to have your decoy as a separate custom genome build rather than trying to remove the extra contigs as part of bcbio. Is that an option?

If you want to remove these additional contigs before, you could try using VariantBam (https://github.com/walaj/VariantBam) as it has mate linking:

variant in.bam -L decoy_contigs.bed -o out.bam -b

It would be useful to know if excluding the problem reads like that and then feeding into bcbio resolves the issue.

Sorry again about all the issue and hope this helps.

naumenko-sa · 2019-01-04T21:25:33Z

Hi Brad!

I wanted to share some validation results, as they might be useful for those thinking of using bcbio in a clinical setting with GRCh37, gatk4, and WGS. Of course, they will need way more validations to make this happen.

gatk4 validation in bcbio, WGS NA12878 with grch37

d1 = dataset1, 30X WGS of NA12878
d2 = dataset2, 30X WGS of NA12878
decoy - custom GRCh37 reference with decoy
RR - realignment and recalibration
HF - hard filters
VQSR - variant quality scores recalibration

Variation	Metric	d1,gatk4.0.7.0, HF, decoy, RR	d1,gatk4.0.7.0, HF,no decoy,no RR	d2, gatk4.0.1.2, VQSR,no decoy, no RR	d2, gatk 4.0.7.0, HF, no decoy, no RR
SNPs	FDR	0.10%	0.40%	0.73%	0.39%
SNPs	FNR	0.12%	0.16%	0.08%	0.22%
Indels	FDR	0.99%	1.15%	1.26%	1.58%
Indels	FNR	1.29%	1.25%	5.85%	2.24%

Some conclusions:

adding decoy to grch37 as a custom reference in bcbio works for small variants: alignment step with grch37d5, removal of decoy reads with VariantBam, then variant calling and annotation.
decoy+RR improves precision for SNPvs and Indels, FNR for Indels is slightly reduced. The amount of improvement is important in clinical setting, might be ignored for research (where also switch to grch38 is easier)
RR adds 25% to running time, VQSR adds another 25%.
HF in bcbio improved indel precision/sensitivity balance compared to VQSR.
Recommendation for clinical setting: decoy, RR, HF
Recommendation for research setting: decoy, noRR, HF

SV calling after alignment to decoy is not working for now, Manta breaks, hopefully, a little VariantBam adjustment will help: walaj/VariantBam#16

Sergey

chapmanb · 2019-01-07T02:19:25Z

Sergey;
Thanks so much for following up on this with these numbers. This is really helpful for folks wanting to investigate decoy sequences on build 37. Do you have a sense for specific regions where the decoy improves calling? Since the comparison includes both decoy and recal/realignment, have you tried separating these to help determine the contribution of each component? It always starts to be a big time sink to include all these stratifications plus drilling into individual differences, but might be worthwhile if you're thinking about this switch.

More generally from a clinical perspective, is there any hope of moving to 38, or is 37 a permanent fixture? I keep hoping for 38, which solves these issues and many more, to be a viable solution.

Thank you again for all this work.

naumenko-sa · 2019-01-10T06:10:21Z

Hi Brad!

In bcbio 1.1.2 VQSR is on by default (experiment 3), I turned it off with tools_off: vqsr (experiment 4).

gatk	4.0.7.0		4.0.12.0
HF/VQSR	HF	HF	VQSR	HF
RR	RR	noRR	noRR	noRR
decoy	decoy	no decoy	decoy	decoy
experiment	1	2	3	4
SNPs
FDR	0.10%	0.40%	0.31%	0.10%
FNR	0.12%	0.16%	0.06%	0.12%
Indels
FDR	0.99%	1.15%	0.62%	0.95%
FNR	1.29%	1.25%	5.82%	1.19%

Some conclusions:

RR is useless, just waste of CPU time;
decoy improves precision;
I have tools_off: vqsr, would it be the better default for users who just get PASS variants?

If anybody wanted to repeat this:

Step 1. Create custom reference with decoy: grch37d5
Step 2. Align with decoy:

details:
- algorithm:
    aligner: bwa
    effects: false
    mark_duplicates: true
    realign: false
    recalibrate: false
    remove_lcr: false
    save_diskspace: true
    variantcaller: false
  analysis: variant2
  description: wgs_validation
  files:
  - /somewhere/project_1.fq.gz
  - /somewhere/project_2.fq.gz
  genome_build: GRCh37d5

Step 3. Remove decoy reads
https://github.com/naumenko-sa/cre/blob/master/cre.bam.remove_decoy_reads.sh
Call variants and validate calls

details:
- algorithm:
    aligner: false
    effects: false
    mark_duplicates: false
    realign: false
    recalibrate: false
    remove_lcr: false
    save_diskspace: true
    tools_on:
    - noalt_calling
    tools_off:
    - vqsr
    variantcaller:
    - gatk-haplotype
    validate: giab-NA12878/truth_small_variants.vcf.gz
    validate_regions: giab-NA12878/truth_regions.bed
  analysis: variant2
  description: wgs_validation
  files:
  - /somewhere/NA12878.no_decoy_reads.bam
  genome_build: GRCh37
  metadata:
    batch: wgs
fc_name: wgs
resources:
  default:
    cores: 7
    jvm_opts:
    - -Xms750m
    - -Xmx7000m
    memory: 7G
upload:
  dir: ../final

Regarding the transition to grch38 in clinical variant calling. Please correct me if here are some clinical bioinformaticians in the community (I'm not). I don't think it will happen soon for many labs. Overhead costs will be huge: validation, documentation, transferring internal databases, teaching genome analysts, surviving the error-prone transition period with 2 genome references, etc. Benefits? Will it add much to solve rate? Probably, not. Will it cause some mis-interpretation of variants just because people are used to think along grch37? Probably, yes. Prioritizing WGS (SV calling) and RNA-seq over grch38 promises more in terms of solve rate. If somebody will publish a variant frequency database of 100K WGS called in grch38, that probably would make people think about the transition.

I will try to find, which variants are FP when aligning without decoy.

SN

naumenko-sa changed the title ~~Could you please add decoys to GRCh37 reference?~~ Would it possible to add decoys to GRCh37 reference? Aug 17, 2018

chapmanb mentioned this issue Jan 15, 2019

GATK4 performance at higher depth/longer reads / Low freq allele filter #2623

Closed

roryk added the enhancement label Aug 10, 2019

naumenko-sa added the validation to dig out valuable validation discussions label Mar 9, 2020

naumenko-sa mentioned this issue Mar 9, 2020

Skipping vcf2db.py? #3091

Closed

naumenko-sa mentioned this issue May 29, 2020

bcbio priorities #3242

Open

90 tasks

naumenko-sa closed this as completed May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would it possible to add decoys to GRCh37 reference? #2489

Would it possible to add decoys to GRCh37 reference? #2489

naumenko-sa commented Aug 17, 2018 •

edited

Loading

chapmanb commented Aug 21, 2018

naumenko-sa commented Aug 21, 2018 •

edited

Loading

chapmanb commented Aug 23, 2018

naumenko-sa commented Sep 17, 2018

chapmanb commented Sep 18, 2018

naumenko-sa commented Oct 23, 2018

chapmanb commented Oct 23, 2018

naumenko-sa commented Jan 4, 2019

chapmanb commented Jan 7, 2019

naumenko-sa commented Jan 10, 2019 •

edited

Loading

Would it possible to add decoys to GRCh37 reference? #2489

Would it possible to add decoys to GRCh37 reference? #2489

Comments

naumenko-sa commented Aug 17, 2018 • edited Loading

chapmanb commented Aug 21, 2018

naumenko-sa commented Aug 21, 2018 • edited Loading

chapmanb commented Aug 23, 2018

naumenko-sa commented Sep 17, 2018

chapmanb commented Sep 18, 2018

naumenko-sa commented Oct 23, 2018

chapmanb commented Oct 23, 2018

naumenko-sa commented Jan 4, 2019

chapmanb commented Jan 7, 2019

naumenko-sa commented Jan 10, 2019 • edited Loading

naumenko-sa commented Aug 17, 2018 •

edited

Loading

naumenko-sa commented Aug 21, 2018 •

edited

Loading

naumenko-sa commented Jan 10, 2019 •

edited

Loading