Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpiluep: wrong results with very high coverage #254

Closed
spabinger opened this issue Oct 6, 2016 · 10 comments
Closed

mpiluep: wrong results with very high coverage #254

spabinger opened this issue Oct 6, 2016 · 10 comments

Comments

@spabinger
Copy link

Hi,

I just stumbled across an issue with mpileup:
sambamba (0.6.4) mpileup is generating a wrong mpileup file when used with a very high coverage BAM file.

Script to reproduce:
samtools mpileup -A -B -Q 0 -d 1000000000 mpileup_problem.bam > samtools.mpileup
sambamba mpileup mpileup_problem.bam --samtools -A -B -Q 0 -d 1000000000 > sambamba.mpileup

Testfile
https://dl.dropboxusercontent.com/u/14071835/mpileup_problem.bam

Relevant different (subset)

Sambamba output:
chr7 55249052 N 251897
chr7 55249053 N 254582
chr7 55249054 N 255802
chr7 55249055 N 257090
chr7 55249056 N 133237
chr7 55249057 N 134006
chr7 55249058 N 134696
chr7 55249059 N 134969
chr7 55249060 N 135233

Samtools output:
chr7 55249052 N 272056
chr7 55249053 N 274741
chr7 55249054 N 275961
chr7 55249055 N 277249
chr7 55249056 N 279120
chr7 55249057 N 279889
chr7 55249058 N 280579
chr7 55249059 N 280852
chr7 55249060 N 281116

Thanks in advance,
Stephan

@lomereiter
Copy link
Contributor

Hi Stephan,

Thanks for the reproducible test case.

The issue here is that default memory usage of a single chunk is too small to fit all 270k-280k reads at the same position, adding --buffer-size=200000000 resolves it. I don't know any easy fix for that, but at least I added a warning if such situation is encountered.

@tadKeys
Copy link

tadKeys commented Oct 10, 2016

Hi,

thanks for the quick reply. Are there any negative effects when --buffer-size=200000000 is always used?

Thanks,
Stephan

@lomereiter
Copy link
Contributor

Shouldn't be any except for higher memory usage (about 200MB * number of threads)

@tadKeys
Copy link

tadKeys commented Oct 10, 2016

Sounds good. Thanks!

Best wishes,
Stephan

pjotrp pushed a commit to pjotrp/sambamba that referenced this issue Dec 13, 2016
@chambm
Copy link

chambm commented Feb 10, 2017

I increased buffer size to 128m after getting the buffer size warning, and the warning went away, but then the results I got (from sambamba v0.6.5) were drastically different from basic samtools (v1.3.1). Both output bcfs were 9.2GB, but the result of bcftools call --consensus-caller --output-type "v" --variants-only (v1.2) on the sambamba output was only 3.8MB vs. 50MB from the samtools output. Needless to say, there were a lot more variants in the 50MB VCF. I'm going to try rerunning with fewer threads and then again with a larger buffer.

On the other hand, the speedup vs. samtools is spectacular. Good job on that!

@pjotrp
Copy link
Member

pjotrp commented Feb 11, 2017

@chambm, you can use bio-vcf for viewing differences. One example exercise is here https://github.com/pjotrp/bioruby-vcf/blob/master/doc/GATK_comparison.md

@chambm
Copy link

chambm commented Feb 15, 2017

I tried 24 threads with a 128mb buffer and 30 threads with a 500mb buffer. Both results were about the same as my initial run (40 threads and 128mb). 9.2.gb output BCF, but only ~22,200 lines in the output from bcftools_call (vs. ~300,000 lines in the output from running bcftools_call on the vanilla mpileup BCF). With such a huge discrepancy in output, what do you want me to tell you about the specific differences in the VCFs?

@lomereiter
Copy link
Contributor

@chambm please try to obtain a smaller test case by subsampling or extracting only the regions where you see the discrepancy. A good strategy to pin down the problem is bisection: split the file into two roughly equal ones by reference/coordinate, take the half for which the problem is more visible, and repeat this process.

@chambm
Copy link

chambm commented Feb 16, 2017

Aha! The bulk of the difference seems to be that sambamba was using samtools 1.2 (in my PATH) and my vanilla mpileup job was using samtools 1.3. Wow, what a difference... In the "smaller" test case I put together (30GB BAM down to 170MB, 9GB mpileup down to 90MB), after bcftools call, I got 1890 SNPs with samtools 1.3 vs. 2420 SNPs with 1.2. Using samtools 1.2 for both sambaba and vanilla mpileup, the size difference was a mere 4 lines. Unfortunately the other lines were not completely overlapping, but it's less extreme than the difference between 1.3 and 1.2:

--- sambamba.txt
+++ mpileup.txt
@@ -1,7 +1,7 @@
 ##fileformat=VCFv4.2
 ##FILTER=<ID=PASS,Description="All filters passed">
 ##samtoolsVersion=1.2+htslib-1.2.1
-##samtoolsCommand=samtools mpileup -l /tmp/sambamba-pid17071-hady/1.bed -f /galaxy/data/hg38/sam_index/hg38.fa -C 0 -d 250 -q 20 -Q 20 -g -t DP /tmp/sambamba-pid17071-hady/1
+##samtoolsCommand=samtools mpileup -f /galaxy/data/hg38/sam_index/hg38.fa -C 0 -d 250 -q 20 -Q 20 --VCF --uncompressed --output-tags DP --output /opt/galaxy/galaxy-app/database/datasets/000/dataset_966.dat /opt/galaxy/galaxy-app/database/datasets/000/dataset_965.dat
 ##reference=file:///galaxy/data/hg38/sam_index/hg38.fa
 ##contig=<ID=chr1,length=248956422>
 ##contig=<ID=chr1_KI270706v1_random,length=175055>
@@ -483,7 +483,7 @@
 ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
 ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
 ##bcftools_callVersion=1.2+htslib-1.2.1
-##bcftools_callCommand=call --consensus-caller --output-type v --variants-only /opt/galaxy/galaxy-app/database/datasets/000/dataset_972.dat
+##bcftools_callCommand=call --consensus-caller --output-type v --variants-only /opt/galaxy/galaxy-app/database/datasets/000/dataset_966.dat
 #CHROM	POS	ID	REF	ALT
 chr1	1452091	.	A	G
 chr1	1452098	.	A	G
@@ -1312,6 +1312,7 @@
 chr1	97720521	.	T	C
 chr1	97740408	.	G	A
 chr1	97854852	.	C	T
 chr1	97883329	.	A	G
+chr1	99709114	.	C	T
 chr1	99709320	.	T	A
 chr1	99709686	.	A	G
 chr1	99709817	.	A	C
@@ -1396,6 +1397,8 @@
 chr1	108962477	.	G	A
 chr1	108962554	.	G	A
 chr1	108962891	.	T	A
 chr1	108963067	.	T	C
+chr1	108992271	.	A	G
+chr1	108992318	.	G	A
 chr1	109690516	.	G	C
 chr1	109734111	.	T	C
 chr1	109734781	.	T	C
@@ -1558,7 +1561,6 @@
 chr1	154608419	.	C	G
 chr1	154610494	.	A	G
 chr1	154615341	.	A	G
 chr1	154627896	.	A	G
-chr1	154962777	.	G	C
 chr1	154963791	.	C	T
 chr1	154965759	.	G	C
 chr1	154966186	.	T	C
@@ -1666,6 +1668,7 @@
 chr1	180188756	.	T	G
 chr1	180188807	.	T	C
 chr1	180188823	.	T	C
 chr1	180190519	.	C	T
+chr1	180190522	.	C	T
 chr1	180190593	.	C	G
 chr1	180191099	.	C	T
 chr1	180191349	.	C	G
@@ -1682,7 +1685,6 @@
 chr1	180195400	.	G	A
 chr1	180196261	.	G	T
 chr1	180197427	.	C	T
 chr1	180197429	.	C	T
-chr1	180197955	.	C	T
 chr1	185169695	.	C	T
 chr1	185170732	.	G	A
 chr1	185174589	.	A	G
@@ -1916,6 +1918,7 @@
 chr17	413596	.	T	C
 chr17	413626	.	C	T
 chr17	413831	.	A	G
 chr17	423507	.	G	A
+chr17	1464342	.	C	T
 chr17	1465784	.	G	A
 chr17	1465839	.	T	C
 chr17	1466044	.	T	C
@@ -1938,6 +1941,7 @@
 chr17	1476324	.	T	C
 chr17	1476337	.	T	C
 chr17	1478227	.	T	C
 chr17	1481283	.	C	A
+chr17	1483715	.	C	T
 chr17	1487205	.	T	C
 chr17	1489413	.	A	G
 chr17	1500383	.	G	A
@@ -2119,7 +2123,6 @@
 chr17	7941956	.	C	T
 chr17	7942131	.	G	A
 chr17	7943097	.	A	G
 chr17	7943366	.	G	A
-chr17	7943518	.	A	T
 chr17	7943795	.	A	G
 chr17	7943861	.	G	A
 chr17	7943886	.	A	C
@@ -2793,6 +2796,7 @@
 chr17	78221931	.	C	T
 chr17	78222597	.	A	G
 chr17	78222639	.	A	T
 chr17	78223510	.	G	A
+chr17	78223702	.	T	C
 chr17	78225221	.	C	A
 chr17	78225289	.	A	G
 chr17	78225347	.	T	C

@pjotrp
Copy link
Member

pjotrp commented Feb 23, 2017

Differences due to samtools. I think we can close this issue.

@pjotrp pjotrp closed this as completed Feb 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants