Longshot is a variant calling tool for diploid genomes using long error prone reads such as Pacific Biosciences (PacBio) SMRT and Oxford Nanopore Technologies (ONT). It takes as input an aligned BAM file and outputs a phased VCF file with variants and haplotype information. It can also genotype and phase input VCF files. It can output haplotype-separated BAM files that can be used for downstream analysis. Currently, it only calls single nucleotide variants (SNVs), but it can genotype indels if they are given in an input VCF.
If you use Longshot, please cite the publication:
Longshot has been tested using Ubuntu 16.04 and 18.04, CentOS 6.6, Manjaro Linux 17.1.11, and Mac OS 10.14.2 Mojave. It should work on any linux-based system that has Rust and Cargo installed.
- rust >= 1.40.0
- zlib >= 1.2.11
- xz >= 5.2.3
- clangdev >= 7.0.1
- gcc >= 7.3.0
- libc-dev
- make
- various rust dependencies (automatically managed by cargo)
(older versions may work but have not been tested)
It is recommended to install Longshot using Bioconda:
conda install longshot
This method supports Linux and Mac. If you do not have Bioconda, you can install it with these steps: First, install Miniconda (or Anaconda). Miniconda can be installed using the scripts here.
The Bioconda channel can then be added using these commands:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
If you are using Ubuntu 18.04, you can install the dependencies using apt. Then, the Rust cargo package manager is used to compile Longshot.
sudo apt-get install cargo zlib1g-dev xz-utils \
libclang-dev clang build-essential curl git # install dependencies
git clone https://github.com/pjedge/longshot # clone the Longshot repository
cd longshot # change directory
cargo install --path . # install Longshot
export PATH=$PATH:/home/$USER/.cargo/bin # add cargo binaries to path
Installation should take around 4 minutes on a typical desktop machine and will use between 400 MB (counting cargo) and 1.2 GB (counting all dependencies) of disk space.
It is recommended to add the line export PATH=$PATH:/home/$USER/.cargo/bin
to the end of your ~/.bashrc
file so that the longshot binary is in the PATH for future shell sessions.
After installation, execute the longshot binary as so:
$ longshot [FLAGS] [OPTIONS] --bam <BAM> --ref <FASTA> --out <VCF>
The directory example_data
contains a simulated toy dataset that can be used to test out Longshot:
- Reference genome containing 3 contigs each with length 200 kb (
example_data/genome.fa
) - 30x coverage simulated pacbio reads generated using SimLoRD (
example_data/pacbio_reads_30x.bam
) - The 714 "true" variants for validation (
example_data/ground_truth_variants.vcf
)
Run Longshot on the example data as so:
longshot --bam example_data/pacbio_reads_30x.bam --ref example_data/genome.fa --out example_data/longshot_output.vcf
Execution should take around 5 to 10 seconds on a typical desktop machine. The output can be compared to ground_truth_variants.vcf
for accuracy.
$ longshot --help
Longshot: variant caller (SNVs) for long-read sequencing data
USAGE:
longshot [FLAGS] [OPTIONS] --bam <BAM> --ref <FASTA> --out <VCF>
FLAGS:
-A, --auto_max_cov Automatically calculate mean coverage for region and set max coverage to mean_coverage +
5*sqrt(mean_coverage). (SLOWER)
-S, --stable_alignment Use numerically-stable (logspace) pair HMM forward algorithm. Is significantly slower but
may be more accurate. Tests have shown this not to be necessary for highly error prone
reads (PacBio CLR).
-F, --force_overwrite If output files (VCF or variant debug directory) exist, delete and overwrite them.
-x, --max_alignment Use max scoring alignment algorithm rather than pair HMM forward algorithm.
-n, --no_haps Don't call HapCUT2 to phase variants.
--output-ref print reference genotypes (non-variant), use this option only in combination with -v
option.
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-b, --bam <BAM> sorted, indexed BAM file with error-prone reads
-f, --ref <FASTA> indexed FASTA reference that BAM file is aligned to
-o, --out <VCF> output VCF file with called variants.
-r, --region <string> Region in format <chrom> or <chrom:start-stop> in which to call variants
(1-based, inclusive).
-v, --potential_variants <VCF> Genotype and phase the variants in this VCF instead of using pileup
method to find variants. NOTES: VCF must be gzipped and tabix indexed or
contain contig information. Use with caution because excessive false
potential variants can lead to inaccurate results. Every variant is used
and only the allele fields are considered -- Genotypes, filters,
qualities etc are ignored. Indel variants will be genotyped but not
phased. Triallelic variants and structural variants are currently not
supported.
-O, --out_bam <BAM> Write new bam file with haplotype tags (HP:i:1 and HP:i:2) for reads
assigned to each haplotype, any existing HP and PS tags are removed
-c, --min_cov <int> Minimum coverage (of reads passing filters) to consider position as a
potential SNV. [default: 6]
-C, --max_cov <int> Maximum coverage (of reads passing filters) to consider position as a
potential SNV. [default: 8000]
-q, --min_mapq <int> Minimum mapping quality to use a read. [default: 20]
-a, --min_allele_qual <float> Minimum estimated quality (Phred-scaled) of allele observation on read to
use for genotyping/haplotyping. [default: 7.0]
-y, --hap_assignment_qual <float> Minimum quality (Phred-scaled) of read->haplotype assignment (for read
separation). [default: 20.0]
-Q, --potential_snv_cutoff <float> Consider a site as a potential SNV if the original PHRED-scaled QUAL
score for 0/0 genotype is below this amount (a larger value considers
more potential SNV sites). [default: 20.0]
-e, --min_alt_count <int> Require a potential SNV to have at least this many alternate allele
observations. [default: 3]
-E, --min_alt_frac <float> Require a potential SNV to have at least this fraction of alternate
allele observations. [default: 0.125]
-L, --hap_converge_delta <float> Terminate the haplotype/genotype iteration when the relative change in
log-likelihood falls below this amount. Setting a larger value results in
faster termination but potentially less accurate results. [default:
0.0001]
-l, --anchor_length <int> Length of indel-free anchor sequence on the left and right side of read
realignment window. [default: 6]
-m, --max_snvs <int> Cut off variant clusters after this many variants. 2^m haplotypes must be
aligned against per read for a variant cluster of size m. [default: 3]
-W, --max_window <int> Maximum "padding" bases on either side of variant realignment window
[default: 50]
-I, --max_cigar_indel <int> Throw away a read-variant during allelotyping if there is a CIGAR indel
(I/D/N) longer than this amount in its window. [default: 20]
-B, --band_width <Band width> Minimum width of alignment band. Band will increase in size if sequences
are different lengths. [default: 20]
-D, --density_params <string> Parameters to flag a variant as part of a "dense cluster". Format
<n>:<l>:<gq>. If there are at least n variants within l base pairs with
genotype quality >=gq, then these variants are flagged as "dn" [default:
10:500:50]
-s, --sample_id <string> Specify a sample ID to write to the output VCF [default: SAMPLE]
--hom_snv_rate <float> Specify the homozygous SNV Rate for genotype prior estimation [default:
0.0005]
--het_snv_rate <float> Specify the heterozygous SNV Rate for genotype prior estimation [default:
0.001]
--ts_tv_ratio <float> Specify the transition/transversion rate for genotype grior estimation
[default: 0.5]
-P, --strand_bias_pvalue_cutoff <float> Remove a variant if the allele observations are biased toward one strand
(forward or reverse) according to Fisher's exact test. Use this cutoff
for the two-tailed P-value. [default: 0.01]
-d, --variant_debug_dir <path> write out current information about variants at each step of algorithm to
files in this directory
Call variants with default parameters:
longshot --bam pacbio.bam --ref ref.fa --out output.vcf
Call variants for chromosome 1 only using the automatic max coverage cutoff:
longshot -A -r chr1 --bam pacbio.bam --ref ref.fa --out output.vcf
Call variants in a 500 kb region and then output the reads into reads.bam
using a haplotype assignment threshold of 30:
longshot -r chr1:1000000-1500000 -y 30 -O reads.bam --bam pacbio.bam --ref ref.fa --out output.vcf
If a read has an assigned haplotype, it will get a tag HP:i:1
or HP:i:2
and tag PS:i:x
where x
is a phase set number of the variants it covers.
- It is highly recommended to use reads with at least 30x coverage.
- It is recommended to process chromosomes separately using the
--region
option. - Longshot has only been tested using data from humans. Results may vary with organisms with significantly higher or lower SNV rate.
- It is important to set a reasonable max read coverage cutoff (
-C
option) to filter out sites coinciding with genomic features such as CNVs which can be problematic for variant calling. If the-A
option is used, Longshot will estimate the mean read coverage and set the max coverage tomean_cov+5*sqrt(mean_cov)
, which we have found to be a reasonable filter in practice for humans. - CNVs and mapping issues can result in dense clusters of false positive SNVs. Longshot will attempt to find clusters like this and mark them as "dn" in the FILTER field. The
--density_params
option is used to control which variants are flagged as "dn". The default parameters have been found to be effective for human sequencing data, but this option may need to be tweaked for other organisms with SNV rates significantly different from human. - Oxford Nanopore Technology (ONT) SMS reads are now officially supported. It is recommended to use the default
--strand_bias_pvalue_cutoff
of 0.01 for ONT reads, since this option filters out false SNV sites prior to variant calling.
Check that the Rust version is 1.30.0 or higher:
rustc --version
If not, update Rust using this command:
rustup update
For example:
error: linking with `cc` failed: exit code: 1
...
...
...
= note: Non-UTF-8 output: /usr/bin/ld: /home/pedge/temp/longshot/target/release/build/longshot-347f3774e75b380c/out/libhapcut2.a(common.o)(.text.fprintf_time+0x81): unresolvable H\x89\\$\xe8H\x89l$\xf0H\x89\xf3L\x89d$\xf8H\x83\xec\x18H\x8bG\x10H\x89\xfdI\x89\xd4H\x89\xd6H\x8b;\xffPxH\x8bE\x10I\x8dt$\x08H\x8b{\x08\xffPxH\x8bE\x10H\x8b{\x10I\x8dt$\x10H\x8b\x1c$H\x8bl$\x08L\x8bd$\x10H\x8b@xH\x83\xc4\x18\xff\xe0f\x90H\x89\\$\xe8H\x89l$\xf0H\x89\xfbL\x89d$\xf8H\x83\xec\x18H\x8bG\x10I\x89\xd4H\x89\xf5H\x89\xf7\xffPhI\x89\x04$H\x8bC\x10H\x8d}\x08\xffPhH\x8b\x1c$I\x89D$\x08H\x8bl$\x08L\x8bd$\x10H\x83\xc4\x18\xc3\x0f\x1f relocation against symbol `time@@GLIBC_2.2.5\'\n/usr/bin/ld: BFD version 2.20.51.0.2-5.42.el6 20100205 internal error, aborting at reloc.c line 443 in bfd_get_reloc_size\n\n/usr/bin/ld: Please report this bug.\n\ncollect2: ld returned 1 exit status\n
...
...
...
Your system may have multiple versions of your linker that are causing a conflict. Rustc may be calling to a different or old version of the linker. In this case, specify the linker (in linux, gcc) as follows:
rustc -vV
Note the build target after "host: ", i.e. "x86_64-unknown-linux-gnu".
mkdir .cargo
nano .cargo/config
edit the config file to have these contents:
[target.<target-name>]
linker = "</path/to/linker>"
for example,
[target.x86_64-unknown-linux-gnu]
linker = "/opt/gnu/gcc/bin/gcc"
then,
cargo clean
cargo build --release