Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate variants in VCF outputs causing bcftools consensus error #53

Closed
DarianHole opened this issue Jul 21, 2020 · 7 comments
Closed

Comments

@DarianHole
Copy link

DarianHole commented Jul 21, 2020

Hi,

I've run into what seems to be a rare issue (first time I've seen it in 100's of samples) where there are duplicate variants in my *.pass.vcf.gz and *.merged.vcf file which is leading to bcftools consensus to generate an error with an output of:

The site MN908947.3:27967 overlaps with another variant, skipping...
The fasta sequence does not match the REF allele at MN908947.3:28250:
     .vcf: [T]
     .vcf: [TCTG] <- (ALT)
     .fa:  [G]TTCATCTAAACGAACAAACTAAAATGTCTGATAATGGACCCCAAAATCAGCGAAATGCA...

I also get the output from vcf_merge of:

Found primer binding site mismatch: {}
Found primer binding site mismatch: {}

Issue is similar to #21 from the looks of it although I checked and I am on the latest version so it must be something else.

For reference I'm currently using the following:

Sorry I don't know too much more, I'll keep looking at it in the meantime. Thanks for any help you can offer and sorry if I missed a solution somewhere,

Darian

@nickloman
Copy link

Thanks for the report!

So this can happen occasionally and it's not clear what can be easily done about it. Typically this will be if you have two competing edits in your VCF file. Can you check the pass and fail VCFs and look around 27967 and see what variants are reported?

@DarianHole
Copy link
Author

Thanks for the quick response!

Fails vcf has nothing but the one variant:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample
MN908947.3	24981	.	CT	C	564.5	PASS	TotalReads=990;SupportFraction=0.536596;SupportFractionByStrand=0.683247,0.391125;BaseCalledReadsWithVariant=304;BaseCalledFraction=0.304;AlleleCount=1;StrandSupport=156,303,337,194;StrandFisherTest=196;SOR=1.181;RefContext=TGATCCTTTGCA;Pool=nCoV-2019_1	GT	1

Pass vcf has a duplicate at that spot along with all variants after it. These duplicates are also seen in merged.vcf:

...
MN908947.3	25563	.	G	T	11645.8	PASS	TotalReads=990;SupportFraction=0.989544;SupportFractionByStrand=0.989015,0.990068;BaseCalledReadsWithVariant=883;BaseCalledFraction=0.883;AlleleCount=1;StrandSupport=5,5,488,492;StrandFisherTest=1;SOR=0.628354;RefContext=TTTCAGAGCGC;Pool=nCoV-2019_1	GT	1
MN908947.3	27967	.	G	T	10312.8	PASS	TotalReads=992;SupportFraction=0.963149;SupportFractionByStrand=0.978876,0.947486;BaseCalledReadsWithVariant=864;BaseCalledFraction=0.864;AlleleCount=1;StrandSupport=10,26,485,471;StrandFisherTest=20;SOR=0.213;RefContext=GTCATGTACTC;Pool=nCoV-2019_1	GT	1
MN908947.3	27967	.	G	T	10284.5	PASS	TotalReads=985;SupportFraction=0.966758;SupportFractionByStrand=0.967941,0.965573;BaseCalledReadsWithVariant=826;BaseCalledFraction=0.827655;AlleleCount=1;StrandSupport=16,17,477,475;StrandFisherTest=1;SOR=0.634787;RefContext=GTCATGTACTC;Pool=nCoV-2019_2	GT	1
MN908947.3	28250	.	T	TCTG	17682.8	PASS	TotalReads=991;SupportFraction=0.964904;SupportFractionByStrand=0.956657,0.973135;BaseCalledReadsWithVariant=804;BaseCalledFraction=0.804;AlleleCount=1;StrandSupport=21,13,474,483;StrandFisherTest=8;SOR=0.368682;RefContext=TTAGATTTCAT;Pool=nCoV-2019_1	GT	1
MN908947.3	28250	.	T	TCTG	17052.5	PASS	TotalReads=937;SupportFraction=0.96103;SupportFractionByStrand=0.960555,0.961519;BaseCalledReadsWithVariant=742;BaseCalledFraction=0.781053;AlleleCount=1;StrandSupport=19,18,456,444;StrandFisherTest=0;SOR=0.670323;RefContext=TTAGATTTCAT;Pool=nCoV-2019_2	GT	1
MN908947.3	28253	.	CA	C	17682.8	PASS	TotalReads=991;SupportFraction=0.984457;SupportFractionByStrand=0.973536,0.995355;BaseCalledReadsWithVariant=834;BaseCalledFraction=0.834;AlleleCount=1;StrandSupport=13,2,482,494;StrandFisherTest=24;SOR=0.0992646;RefContext=GATTTCATCTAA;Pool=nCoV-2019_1	GT	1
MN908947.3	28253	.	CA	C	17052.5	PASS	TotalReads=937;SupportFraction=0.985751;SupportFractionByStrand=0.980644,0.991002;BaseCalledReadsWithVariant=783;BaseCalledFraction=0.824211;AlleleCount=1;StrandSupport=9,4,466,458;StrandFisherTest=6;SOR=0.234992;RefContext=GATTTCATCTAA;Pool=nCoV-2019_2	GT	1
MN908947.3	28310	.	C	T	11299.3	PASS	TotalReads=990;SupportFraction=0.948394;SupportFractionByStrand=0.957024,0.939764;BaseCalledReadsWithVariant=749;BaseCalledFraction=0.749;AlleleCount=1;StrandSupport=21,30,474,465;StrandFisherTest=5;SOR=0.444288;RefContext=ATGCACCCCGC;Pool=nCoV-2019_1	GT	1
MN908947.3	28310	.	C	T	11198.6	PASS	TotalReads=925;SupportFraction=0.953539;SupportFractionByStrand=0.962119,0.944599;BaseCalledReadsWithVariant=696;BaseCalledFraction=0.741214;AlleleCount=1;StrandSupport=18,25,454,428;StrandFisherTest=7;SOR=0.500405;RefContext=ATGCACCCCGC;Pool=nCoV-2019_2	GT	1

@nickloman
Copy link

OK, this probably isn't the issue I was thinking about, but more relates to the duplicate indel at 28250. I'll check this is the case and it should be easy to fix. In the meantime if you remove one of those duplicate lines at 28250 in the VCF you might find that bcftools consensus can be run manually to generate a consensus genome.

@DarianHole
Copy link
Author

You are correct removing one of the spots at 28250 allowed me to generate a consensus genome! Thanks!

[W::hts_idx_load3] The index file is older than the data file: nml_barcode21.pass.vcf.gz.tbi
Note: the --sample option not given, applying all records regardless of the genotype
The site MN908947.3:27967 overlaps with another variant, skipping...
The site MN908947.3:28253 overlaps with another variant, skipping...
The site MN908947.3:28310 overlaps with another variant, skipping...
Applied 16 variants

@nickloman
Copy link

@will-rowe I think we need an enhancement which is to systematically remove duplicate variants present in both pools before calling bcftools consensus.

@DarianHole
Copy link
Author

I've been messing around a bit more with bcftools consensus and even if I do not remove any of the duplicates, I can still create a consensus file with the normal command:

bcftools consensus -f barcode21.preconsensus.fasta barcode21.pass.vcf.gz -m barcode21.coverage_mask.txt -o barcode21.consensus.fasta

So something before must be causing the issue leading up to this one

@KatSteinke
Copy link

Is there any chance it could be related to this issue? It got resolved in bcftools 1.11 from the release notes, but ARTIC is using a lower version...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants