mpiluep: wrong results with very high coverage #254

spabinger · 2016-10-06T13:59:14Z

Hi,

I just stumbled across an issue with mpileup:
sambamba (0.6.4) mpileup is generating a wrong mpileup file when used with a very high coverage BAM file.

Script to reproduce:
samtools mpileup -A -B -Q 0 -d 1000000000 mpileup_problem.bam > samtools.mpileup
sambamba mpileup mpileup_problem.bam --samtools -A -B -Q 0 -d 1000000000 > sambamba.mpileup

Testfile
https://dl.dropboxusercontent.com/u/14071835/mpileup_problem.bam

Relevant different (subset)

Sambamba output:
chr7 55249052 N 251897
chr7 55249053 N 254582
chr7 55249054 N 255802
chr7 55249055 N 257090
chr7 55249056 N 133237
chr7 55249057 N 134006
chr7 55249058 N 134696
chr7 55249059 N 134969
chr7 55249060 N 135233

Samtools output:
chr7 55249052 N 272056
chr7 55249053 N 274741
chr7 55249054 N 275961
chr7 55249055 N 277249
chr7 55249056 N 279120
chr7 55249057 N 279889
chr7 55249058 N 280579
chr7 55249059 N 280852
chr7 55249060 N 281116

Thanks in advance,
Stephan

The text was updated successfully, but these errors were encountered:

lomereiter · 2016-10-09T13:21:35Z

Hi Stephan,

Thanks for the reproducible test case.

The issue here is that default memory usage of a single chunk is too small to fit all 270k-280k reads at the same position, adding --buffer-size=200000000 resolves it. I don't know any easy fix for that, but at least I added a warning if such situation is encountered.

tadKeys · 2016-10-10T11:44:03Z

Hi,

thanks for the quick reply. Are there any negative effects when --buffer-size=200000000 is always used?

Thanks,
Stephan

lomereiter · 2016-10-10T12:33:12Z

Shouldn't be any except for higher memory usage (about 200MB * number of threads)

tadKeys · 2016-10-10T13:23:24Z

Sounds good. Thanks!

Best wishes,
Stephan

chambm · 2017-02-10T19:07:57Z

I increased buffer size to 128m after getting the buffer size warning, and the warning went away, but then the results I got (from sambamba v0.6.5) were drastically different from basic samtools (v1.3.1). Both output bcfs were 9.2GB, but the result of bcftools call --consensus-caller --output-type "v" --variants-only (v1.2) on the sambamba output was only 3.8MB vs. 50MB from the samtools output. Needless to say, there were a lot more variants in the 50MB VCF. I'm going to try rerunning with fewer threads and then again with a larger buffer.

On the other hand, the speedup vs. samtools is spectacular. Good job on that!

pjotrp · 2017-02-11T08:56:29Z

@chambm, you can use bio-vcf for viewing differences. One example exercise is here https://github.com/pjotrp/bioruby-vcf/blob/master/doc/GATK_comparison.md

chambm · 2017-02-15T19:49:57Z

I tried 24 threads with a 128mb buffer and 30 threads with a 500mb buffer. Both results were about the same as my initial run (40 threads and 128mb). 9.2.gb output BCF, but only ~22,200 lines in the output from bcftools_call (vs. ~300,000 lines in the output from running bcftools_call on the vanilla mpileup BCF). With such a huge discrepancy in output, what do you want me to tell you about the specific differences in the VCFs?

lomereiter · 2017-02-15T20:31:35Z

@chambm please try to obtain a smaller test case by subsampling or extracting only the regions where you see the discrepancy. A good strategy to pin down the problem is bisection: split the file into two roughly equal ones by reference/coordinate, take the half for which the problem is more visible, and repeat this process.

chambm · 2017-02-16T22:34:56Z

Aha! The bulk of the difference seems to be that sambamba was using samtools 1.2 (in my PATH) and my vanilla mpileup job was using samtools 1.3. Wow, what a difference... In the "smaller" test case I put together (30GB BAM down to 170MB, 9GB mpileup down to 90MB), after bcftools call, I got 1890 SNPs with samtools 1.3 vs. 2420 SNPs with 1.2. Using samtools 1.2 for both sambaba and vanilla mpileup, the size difference was a mere 4 lines. Unfortunately the other lines were not completely overlapping, but it's less extreme than the difference between 1.3 and 1.2:

--- sambamba.txt
+++ mpileup.txt
@@ -1,7 +1,7 @@
 ##fileformat=VCFv4.2
 ##FILTER=<ID=PASS,Description="All filters passed">
 ##samtoolsVersion=1.2+htslib-1.2.1
-##samtoolsCommand=samtools mpileup -l /tmp/sambamba-pid17071-hady/1.bed -f /galaxy/data/hg38/sam_index/hg38.fa -C 0 -d 250 -q 20 -Q 20 -g -t DP /tmp/sambamba-pid17071-hady/1
+##samtoolsCommand=samtools mpileup -f /galaxy/data/hg38/sam_index/hg38.fa -C 0 -d 250 -q 20 -Q 20 --VCF --uncompressed --output-tags DP --output /opt/galaxy/galaxy-app/database/datasets/000/dataset_966.dat /opt/galaxy/galaxy-app/database/datasets/000/dataset_965.dat
 ##reference=file:///galaxy/data/hg38/sam_index/hg38.fa
 ##contig=<ID=chr1,length=248956422>
 ##contig=<ID=chr1_KI270706v1_random,length=175055>
@@ -483,7 +483,7 @@
 ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
 ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
 ##bcftools_callVersion=1.2+htslib-1.2.1
-##bcftools_callCommand=call --consensus-caller --output-type v --variants-only /opt/galaxy/galaxy-app/database/datasets/000/dataset_972.dat
+##bcftools_callCommand=call --consensus-caller --output-type v --variants-only /opt/galaxy/galaxy-app/database/datasets/000/dataset_966.dat
 #CHROM	POS	ID	REF	ALT
 chr1	1452091	.	A	G
 chr1	1452098	.	A	G
@@ -1312,6 +1312,7 @@
 chr1	97720521	.	T	C
 chr1	97740408	.	G	A
 chr1	97854852	.	C	T
 chr1	97883329	.	A	G
+chr1	99709114	.	C	T
 chr1	99709320	.	T	A
 chr1	99709686	.	A	G
 chr1	99709817	.	A	C
@@ -1396,6 +1397,8 @@
 chr1	108962477	.	G	A
 chr1	108962554	.	G	A
 chr1	108962891	.	T	A
 chr1	108963067	.	T	C
+chr1	108992271	.	A	G
+chr1	108992318	.	G	A
 chr1	109690516	.	G	C
 chr1	109734111	.	T	C
 chr1	109734781	.	T	C
@@ -1558,7 +1561,6 @@
 chr1	154608419	.	C	G
 chr1	154610494	.	A	G
 chr1	154615341	.	A	G
 chr1	154627896	.	A	G
-chr1	154962777	.	G	C
 chr1	154963791	.	C	T
 chr1	154965759	.	G	C
 chr1	154966186	.	T	C
@@ -1666,6 +1668,7 @@
 chr1	180188756	.	T	G
 chr1	180188807	.	T	C
 chr1	180188823	.	T	C
 chr1	180190519	.	C	T
+chr1	180190522	.	C	T
 chr1	180190593	.	C	G
 chr1	180191099	.	C	T
 chr1	180191349	.	C	G
@@ -1682,7 +1685,6 @@
 chr1	180195400	.	G	A
 chr1	180196261	.	G	T
 chr1	180197427	.	C	T
 chr1	180197429	.	C	T
-chr1	180197955	.	C	T
 chr1	185169695	.	C	T
 chr1	185170732	.	G	A
 chr1	185174589	.	A	G
@@ -1916,6 +1918,7 @@
 chr17	413596	.	T	C
 chr17	413626	.	C	T
 chr17	413831	.	A	G
 chr17	423507	.	G	A
+chr17	1464342	.	C	T
 chr17	1465784	.	G	A
 chr17	1465839	.	T	C
 chr17	1466044	.	T	C
@@ -1938,6 +1941,7 @@
 chr17	1476324	.	T	C
 chr17	1476337	.	T	C
 chr17	1478227	.	T	C
 chr17	1481283	.	C	A
+chr17	1483715	.	C	T
 chr17	1487205	.	T	C
 chr17	1489413	.	A	G
 chr17	1500383	.	G	A
@@ -2119,7 +2123,6 @@
 chr17	7941956	.	C	T
 chr17	7942131	.	G	A
 chr17	7943097	.	A	G
 chr17	7943366	.	G	A
-chr17	7943518	.	A	T
 chr17	7943795	.	A	G
 chr17	7943861	.	G	A
 chr17	7943886	.	A	C
@@ -2793,6 +2796,7 @@
 chr17	78221931	.	C	T
 chr17	78222597	.	A	G
 chr17	78222639	.	A	T
 chr17	78223510	.	G	A
+chr17	78223702	.	T	C
 chr17	78225221	.	C	A
 chr17	78225289	.	A	G
 chr17	78225347	.	T	C

pjotrp · 2017-02-23T17:45:34Z

Differences due to samtools. I think we can close this issue.

lomereiter pushed a commit that referenced this issue Oct 9, 2016

print a warning if buffer size is too small (#254)

a665886

pjotrp pushed a commit to pjotrp/sambamba that referenced this issue Dec 13, 2016

print a warning if buffer size is too small (biod#254)

28536ce

pjotrp closed this as completed Feb 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpiluep: wrong results with very high coverage #254

mpiluep: wrong results with very high coverage #254

spabinger commented Oct 6, 2016

lomereiter commented Oct 9, 2016

tadKeys commented Oct 10, 2016

lomereiter commented Oct 10, 2016

tadKeys commented Oct 10, 2016

chambm commented Feb 10, 2017 •

edited

Loading

pjotrp commented Feb 11, 2017

chambm commented Feb 15, 2017

lomereiter commented Feb 15, 2017

chambm commented Feb 16, 2017

pjotrp commented Feb 23, 2017

mpiluep: wrong results with very high coverage #254

mpiluep: wrong results with very high coverage #254

Comments

spabinger commented Oct 6, 2016

lomereiter commented Oct 9, 2016

tadKeys commented Oct 10, 2016

lomereiter commented Oct 10, 2016

tadKeys commented Oct 10, 2016

chambm commented Feb 10, 2017 • edited Loading

pjotrp commented Feb 11, 2017

chambm commented Feb 15, 2017

lomereiter commented Feb 15, 2017

chambm commented Feb 16, 2017

pjotrp commented Feb 23, 2017

chambm commented Feb 10, 2017 •

edited

Loading