-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpiluep: wrong results with very high coverage #254
Comments
Hi Stephan, Thanks for the reproducible test case. The issue here is that default memory usage of a single chunk is too small to fit all 270k-280k reads at the same position, adding |
Hi, thanks for the quick reply. Are there any negative effects when Thanks, |
Shouldn't be any except for higher memory usage (about 200MB * number of threads) |
Sounds good. Thanks! Best wishes, |
I increased buffer size to 128m after getting the buffer size warning, and the warning went away, but then the results I got (from On the other hand, the speedup vs. samtools is spectacular. Good job on that! |
@chambm, you can use bio-vcf for viewing differences. One example exercise is here https://github.com/pjotrp/bioruby-vcf/blob/master/doc/GATK_comparison.md |
I tried 24 threads with a 128mb buffer and 30 threads with a 500mb buffer. Both results were about the same as my initial run (40 threads and 128mb). 9.2.gb output BCF, but only ~22,200 lines in the output from bcftools_call (vs. ~300,000 lines in the output from running bcftools_call on the vanilla mpileup BCF). With such a huge discrepancy in output, what do you want me to tell you about the specific differences in the VCFs? |
@chambm please try to obtain a smaller test case by subsampling or extracting only the regions where you see the discrepancy. A good strategy to pin down the problem is bisection: split the file into two roughly equal ones by reference/coordinate, take the half for which the problem is more visible, and repeat this process. |
Aha! The bulk of the difference seems to be that --- sambamba.txt
+++ mpileup.txt
@@ -1,7 +1,7 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.2+htslib-1.2.1
-##samtoolsCommand=samtools mpileup -l /tmp/sambamba-pid17071-hady/1.bed -f /galaxy/data/hg38/sam_index/hg38.fa -C 0 -d 250 -q 20 -Q 20 -g -t DP /tmp/sambamba-pid17071-hady/1
+##samtoolsCommand=samtools mpileup -f /galaxy/data/hg38/sam_index/hg38.fa -C 0 -d 250 -q 20 -Q 20 --VCF --uncompressed --output-tags DP --output /opt/galaxy/galaxy-app/database/datasets/000/dataset_966.dat /opt/galaxy/galaxy-app/database/datasets/000/dataset_965.dat
##reference=file:///galaxy/data/hg38/sam_index/hg38.fa
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr1_KI270706v1_random,length=175055>
@@ -483,7 +483,7 @@
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##bcftools_callVersion=1.2+htslib-1.2.1
-##bcftools_callCommand=call --consensus-caller --output-type v --variants-only /opt/galaxy/galaxy-app/database/datasets/000/dataset_972.dat
+##bcftools_callCommand=call --consensus-caller --output-type v --variants-only /opt/galaxy/galaxy-app/database/datasets/000/dataset_966.dat
#CHROM POS ID REF ALT
chr1 1452091 . A G
chr1 1452098 . A G
@@ -1312,6 +1312,7 @@
chr1 97720521 . T C
chr1 97740408 . G A
chr1 97854852 . C T
chr1 97883329 . A G
+chr1 99709114 . C T
chr1 99709320 . T A
chr1 99709686 . A G
chr1 99709817 . A C
@@ -1396,6 +1397,8 @@
chr1 108962477 . G A
chr1 108962554 . G A
chr1 108962891 . T A
chr1 108963067 . T C
+chr1 108992271 . A G
+chr1 108992318 . G A
chr1 109690516 . G C
chr1 109734111 . T C
chr1 109734781 . T C
@@ -1558,7 +1561,6 @@
chr1 154608419 . C G
chr1 154610494 . A G
chr1 154615341 . A G
chr1 154627896 . A G
-chr1 154962777 . G C
chr1 154963791 . C T
chr1 154965759 . G C
chr1 154966186 . T C
@@ -1666,6 +1668,7 @@
chr1 180188756 . T G
chr1 180188807 . T C
chr1 180188823 . T C
chr1 180190519 . C T
+chr1 180190522 . C T
chr1 180190593 . C G
chr1 180191099 . C T
chr1 180191349 . C G
@@ -1682,7 +1685,6 @@
chr1 180195400 . G A
chr1 180196261 . G T
chr1 180197427 . C T
chr1 180197429 . C T
-chr1 180197955 . C T
chr1 185169695 . C T
chr1 185170732 . G A
chr1 185174589 . A G
@@ -1916,6 +1918,7 @@
chr17 413596 . T C
chr17 413626 . C T
chr17 413831 . A G
chr17 423507 . G A
+chr17 1464342 . C T
chr17 1465784 . G A
chr17 1465839 . T C
chr17 1466044 . T C
@@ -1938,6 +1941,7 @@
chr17 1476324 . T C
chr17 1476337 . T C
chr17 1478227 . T C
chr17 1481283 . C A
+chr17 1483715 . C T
chr17 1487205 . T C
chr17 1489413 . A G
chr17 1500383 . G A
@@ -2119,7 +2123,6 @@
chr17 7941956 . C T
chr17 7942131 . G A
chr17 7943097 . A G
chr17 7943366 . G A
-chr17 7943518 . A T
chr17 7943795 . A G
chr17 7943861 . G A
chr17 7943886 . A C
@@ -2793,6 +2796,7 @@
chr17 78221931 . C T
chr17 78222597 . A G
chr17 78222639 . A T
chr17 78223510 . G A
+chr17 78223702 . T C
chr17 78225221 . C A
chr17 78225289 . A G
chr17 78225347 . T C |
Differences due to samtools. I think we can close this issue. |
Hi,
I just stumbled across an issue with mpileup:
sambamba (0.6.4) mpileup is generating a wrong mpileup file when used with a very high coverage BAM file.
Script to reproduce:
samtools mpileup -A -B -Q 0 -d 1000000000 mpileup_problem.bam > samtools.mpileup
sambamba mpileup mpileup_problem.bam --samtools -A -B -Q 0 -d 1000000000 > sambamba.mpileup
Testfile
https://dl.dropboxusercontent.com/u/14071835/mpileup_problem.bam
Relevant different (subset)
Sambamba output:
chr7 55249052 N 251897
chr7 55249053 N 254582
chr7 55249054 N 255802
chr7 55249055 N 257090
chr7 55249056 N 133237
chr7 55249057 N 134006
chr7 55249058 N 134696
chr7 55249059 N 134969
chr7 55249060 N 135233
Samtools output:
chr7 55249052 N 272056
chr7 55249053 N 274741
chr7 55249054 N 275961
chr7 55249055 N 277249
chr7 55249056 N 279120
chr7 55249057 N 279889
chr7 55249058 N 280579
chr7 55249059 N 280852
chr7 55249060 N 281116
Thanks in advance,
Stephan
The text was updated successfully, but these errors were encountered: