Skip to content

Commit

Permalink
this will become version 3 at arXiv
Browse files Browse the repository at this point in the history
  • Loading branch information
lh3 committed Nov 6, 2017
1 parent 2191ac5 commit 1617b87
Showing 1 changed file with 34 additions and 19 deletions.
53 changes: 34 additions & 19 deletions tex/minimap2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ \section{Motivation:} Recent advances in sequencing technologies promise
length. Existing alignment programs are unable or inefficient to process such data
at scale, which presses for the development of new alignment algorithms.

\section{Results:} Minimap2 is a general-purpose mapper to align DNA or long
\section{Results:} Minimap2 is a general-purpose alignment program to map DNA or long
mRNA sequences against a large reference database. It works with accurate short
reads of $\ge$100bp in length, $\ge$1kb genomic reads at error rate $\sim$15\%,
full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely
Expand Down Expand Up @@ -66,15 +66,22 @@ \section{Introduction}
approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}.
\citet{Suzuki130633} extended our work with a fast and novel algorithm on
generating base-level alignment, which in turn inspired us to develop minimap2
towards higher accuracy and more practical functionality.
with added functionality.

Both SMRT and ONT have been applied to the sequencing of spliced mRNAs (RNA-seq). While
traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not
optimized for long noisy sequence reads and are tens of times slower than
dedicated long-read aligners. When developing minimap2 initially for aligning
genomic DNA only, we realized minor modifications could make it competitive for
aligning mRNAs as well. Minimap2 is a first RNA-seq aligner specifically
designed for long noisy reads.
genomic DNA only, we realized minor modifications could enable the base
algorithm to map mRNAs as well. Minimap2 becomes a first RNA-seq aligner
specifically designed for long noisy reads. We have also extended the original
algorithm to map short reads at a speed faster than several mainstream
short-read mappers.

In this article, we will describe the minimap2 algorithm and its applications
to different types of input sequences. We will evaluate the performance and
accuracy of minimap2 on several simulated and real data sets and demonstrate
the versatility of minimap2.

\begin{methods}
\section{Methods}
Expand Down Expand Up @@ -366,12 +373,12 @@ \subsection{Aligning spliced sequences}

In the spliced alignment mode, minimap2 further increases the density of
minimizers and disables banded alignment. Together with the two-round DP-based
alignment, spliced alignment is several times slower than DNA sequence
alignment, spliced alignment is several times slower than genomic DNA
alignment.

\subsection{Aligning short paired-end reads}

During chainging, minimap2 takes a pair of reads as one read with a gap of
During chainging, minimap2 takes a pair of reads as one fragment with a gap of
unknown length in the middle. It applies a normal gap cost between seeds on the
same read but is a more permissive gap cost between seeds on different reads.
More precisely, the gap cost during chaining is:
Expand Down Expand Up @@ -423,9 +430,7 @@ \subsection{Aligning long genomic reads}
and LAMSA~\citep{Liu:2017aa} because they either
crashed or produced malformatted output. In this evaluation, minimap2 has
higher power to distinguish unique and repetitive hits, and achieves overall
higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate
even if we skip DP-based alignment (data not shown), confirming chaining alone
is sufficient to achieve high accuracy for approximate mapping. Minimap2 and
higher mapping accuracy (Fig.~\ref{fig:eval}a). Minimap2 and
NGMLR provide better mapping quality estimate: they rarely give repetitive hits
high mapping quality. Apparently, other aligners may
occasionally miss close suboptimal hits and be overconfident in wrong mappings.
Expand Down Expand Up @@ -498,10 +503,10 @@ \subsection{Aligning long spliced reads}
We have also evaluated spliced aligners on public Iso-Seq data (human Alzheimer
brain from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The
observation is similar: minimap2 is faster at higher junction accuracy.
On a private Nanopore Direct RNA data set with $>$20\% sequencing error rate
(M\"{u}ller et al, personal communication), minimap2 aligned 940,346 introns
from 239,976 mapped reads with 88.5\% of them consistent with human gene
annotations. In comparison, only 40.3\% of GMAP introns found in known gene
On a private Nanopore Direct RNA data set with $\sim$17\% sequencing error rate
(N. Loman, personal communication), minimap2 aligned 96\,467 introns
from 37\,068 mapped reads with 95.4\% of them consistent with human gene
annotations. In comparison, only 74.8\% of GMAP introns found in known gene
annotations.

We noted that GMAP and SpAln have not been optimized for noisy reads. We are
Expand Down Expand Up @@ -551,24 +556,23 @@ \subsection{Aligning short genomic reads}
ERR1341796. In this evaluation, minimap2 has higher SNP false negative rate
(FNR; 2.5\% of minimap2 vs 2.2\% of BWA-MEM), but fewer false positive SNPs per
million bases (FPPM; 3.0 vs 3.9), lower 2--50bp INDEL FNR (7.3\% vs 7.5\%) and
similar INDEL FPPM (both 1.0). In comparison, Bowtie2 has a SNP FNR of 4.7\%
and INDEL FNR of 10.4\%. Minimap2 is broadly similar to BWA-MEM in the context
of small variant calling.
similar INDEL FPPM (both 1.0). Minimap2 is broadly similar to BWA-MEM in the
context of small variant calling.

\subsection{Other applications}

Minimap2 retains minimap's functionality to find overlaps between long reads
and to search against large multi-species databases such as \emph{nt} from
NCBI. Minimap2 can also align similar genomes or different assemblies of the
same species. It took 7 wall-clock minutes over 8 CPU cores to align a human
SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times as fast as
SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times faster
MUMmer4~\citep{Kurtz:2004zr}.

\section{Discussions}

Minimap2 is a versatile mapper and pairwise aligner for nucleotide sequences.
It works with short reads, assembly contigs and long noisy genomic and RNA-seq
reads. It can be used as a read mapper, long-read overlapper or a full-genome
reads, and can be used as a read mapper, long-read overlapper or a full-genome
aligner. Minimap2 is also accurate and efficient, often outperforming other
domain-specific alignment tools in terms of both speed and accuracy.

Expand All @@ -586,6 +590,17 @@ \section{Discussions}
spliced reads and multiple reads per fragment. This gives us the opportunity to
extend the same base algorithm to a variety of use cases.

Modern mainstream aligners often use a full-text index, such as suffix array or
FM-index, to index reference sequences. An advantage of this approach is that
we can use exact seeds of arbitrary lengths, which helps to increase seed
uniqueness and reduce unsuccessful extensions. Minimap2 indexes reference
k-mers with a hash table instead. Such fixed-length seeds are inferior to
variable-length seeds in theory, but can be computed much more efficiently in
practice. When a query sequence has multiple seed hits, we can afford to skip
some highly repetitive seeds without affecting the final accuracy. This further
alleviates the concern with the uniqueness of seeds. Hash table is the ideal
data structure for mapping long query sequences.

\section*{Acknowledgements}
We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their
masterpiece and insightful notes before formal publication. We thank M.
Expand Down

0 comments on commit 1617b87

Please sign in to comment.