Benchmarks for Disq.
Running count reads on a 136.78 GiB BAM file in Google Cloud Storage (GCS). The file contains 68,064,542 reads.
Filesystem connector | Library | Time (s) |
---|---|---|
GCS Connector | Disq | 144 |
GCS Connector | Hadoop-BAM | 278 |
GCS NIO | Disq | 277 |
GCS NIO | spark-bam | 273 |
GCS NIO with pre-fetch | Disq | 152 |
HDFS | Disq | 167 |
HDFS | Hadoop-BAM | 173 |
Disq does better than Hadoop-BAM using the GCS Connector since it computes splits in parallel on the cluster, and it caches blocks of data to permit efficient seeks (forwards and backwards in the stream). On HDFS the difference is minimal (probably because HDFS itself does caching).
Disq is comparable to spark-bam when using the NIO filesystem connector for GCS, but is better when pre-fetch is used. It may be possible to improve the time further by tuning the size of the buffer (benchmarked at 4MB).
Hadoop-BAM is known to produce both false negatives and false positives when checking if a virtual offset in a BAM file is a record start, whereas spark-bam does not produce any false readings.
Using the DisqCheckBam
program on the source data, no false negatives or
false positives were recorded.
The download.sh script is for retrieving the source data and storing in the cloud. The paths will need changing if you want to store the data in a different bucket.
The run.sh script assembles the benchmarking code and runs the commands. Some of the variables may need changing for your environment.
Timing results are written to results/run.csv, except for spark-bam which writes to the console.