Skip to content

Latest commit

 

History

History
196 lines (127 loc) · 6.29 KB

README.rst

File metadata and controls

196 lines (127 loc) · 6.29 KB

Mappy: Minimap2 Python Binding

Mappy provides a convenient interface to minimap2, a fast and accurate C program to align genomic and transcribe nucleotide sequences.

Installation

Mappy depends on zlib. It can be installed with pip:

pip install --user mappy

or from the minimap2 github repo (Cython required):

git clone https://github.com/lh3/minimap2
cd minimap2
python setup.py install

Usage

The following Python script demonstrates the key functionality of mappy:

import mappy as mp
a = mp.Aligner("test/MT-human.fa")  # load or build index
if not a: raise Exception("ERROR: failed to load/build index")
s = a.seq("MT_human", 100, 200)     # retrieve a subsequence from the index
print(mp.revcomp(s))                # reverse complement
for name, seq, qual in mp.fastx_read("test/MT-orang.fa"): # read a fasta/q sequence
        for hit in a.map(seq): # traverse alignments
                print("{}\t{}\t{}\t{}".format(hit.ctg, hit.r_st, hit.r_en, hit.cigar_str))

APIs

Mappy implements two classes and two global function.

Class mappy.Aligner

mappy.Aligner(fn_idx_in=None, preset=None, ...)

This constructor accepts the following arguments:

  • fn_idx_in: index or sequence file name. Minimap2 automatically tests the file type. If a sequence file is provided, minimap2 builds an index. The sequence file can be optionally gzip'd. This option has no effect if seq is set.
  • seq: a single sequence to index. The sequence name will be set to N/A.
  • preset: minimap2 preset. Currently, minimap2 supports the following presets: sr for single-end short reads; map-pb for PacBio read-to-reference mapping; map-ont for Oxford Nanopore read mapping; splice for long-read spliced alignment; asm5 for assembly-to-assembly alignment; asm10 for full genome alignment of closely related species. Note that the Python module does not support all-vs-all read overlapping.
  • k: k-mer length, no larger than 28
  • w: minimizer window size, no larger than 255
  • min_cnt: mininum number of minimizers on a chain
  • min_chain_score: minimum chaing score
  • bw: chaining and alignment band width
  • best_n: max number of alignments to return
  • n_threads: number of indexing threads; 3 by default
  • extra_flags: additional flags defined in minimap.h
  • fn_idx_out: name of file to which the index is written. This parameter has no effect if seq is set.
  • scoring: scoring system. It is a tuple/list consisting of 4, 6 or 7 positive integers. The first 4 elements specify match scoring, mismatch penalty, gap open and gap extension penalty. The 5th and 6th elements, if present, set long-gap open and long-gap extension penalty. The 7th sets a mismatch penalty involving ambiguous bases.
mappy.Aligner.map(seq, seq2=None, cs=False, MD=False)

This method aligns seq against the index. It is a generator, yielding a series of mappy.Alignment objects. If seq2 is present, mappy performs paired-end alignment, assuming the two ends are in the FR orientation. Alignments of the two ends can be distinguished by the read_num field (see Class mappy.Alignment below). Argument cs asks mappy to generate the cs tag; MD is similar. These two arguments might slightly degrade performance and are not enabled by default.

mappy.Aligner.seq(name, start=0, end=0x7fffffff)

This method retrieves a (sub)sequence from the index and returns it as a Python string. None is returned if name is not present in the index or the start/end coordinates are invalid.

mappy.Aligner.seq_names

This property gives the array of sequence names in the index.

Class mappy.Alignment

This class describes an alignment. An object of this class has the following properties:

  • ctg: name of the reference sequence the query is mapped to
  • ctg_len: total length of the reference sequence
  • r_st and r_en: start and end positions on the reference
  • q_st and q_en: start and end positions on the query
  • strand: +1 if on the forward strand; -1 if on the reverse strand
  • mapq: mapping quality
  • blen: length of the alignment, including both alignment matches and gaps but excluding ambiguous bases.
  • mlen: length of the matching bases in the alignment, excluding ambiguous base matches.
  • NM: number of mismatches, gaps and ambiguous positions in the alignment
  • trans_strand: transcript strand. +1 if on the forward strand; -1 if on the reverse strand; 0 if unknown
  • is_primary: if the alignment is primary (typically the best and the first to generate)
  • read_num: read number that the alignment corresponds to; 1 for the first read and 2 for the second read
  • cigar_str: CIGAR string
  • cigar: CIGAR returned as an array of shape (n_cigar,2). The two numbers give the length and the operator of each CIGAR operation.
  • MD: the MD tag as in the SAM format. It is an empty string unless the MD argument is applied when calling mappy.Aligner.map().
  • cs: the cs tag.

An Alignment object can be converted to a string with str() in the following format:

q_st  q_en  strand  ctg  ctg_len  r_st  r_en  mlen  blen  mapq  cg:Z:cigar_str

It is effectively the PAF format without the QueryName and QueryLength columns (the first two columns in PAF).

Miscellaneous Functions

mappy.fastx_read(fn, read_comment=False)

This generator function opens a FASTA/FASTQ file and yields a (name,seq,qual) tuple for each sequence entry. The input file may be optionally gzip'd. If read_comment is True, this generator yields a (name,seq,qual,comment) tuple instead.

mappy.revcomp(seq)

Return the reverse complement of DNA string seq. This function recognizes IUB code and preserves the letter cases. Uracil U is complemented to A.