MMseqs2 Release 8-fac81
martin-steinegger
released this
01 Apr 01:18
·
1633 commits
to master
since this release
At a glance: Faster searches and clustering through improved IO and better seeding. More search modes like tblastx, reciprocal best hit and linsearch. New output format SAM. Support for compressed databases to reduce hard disk and memory requirements.
Known Issues
- Iterative search only works up to 2 iterations
Breaking Changes
- MMseqs2 now saves a lot on IO by not merging result datafiles
There is still a single.index
file, but the corresponding data files are split into multiple parts (as many as threads were used previously) - MMseqs2 now uses the VTML80 [1,2] substitution matrix to speed up the prefiltering (changeable by
--seed-sub-mat
), the final alignment is still computed with the Blosum62 (still changeable by--sub-mat
) - All databases have now a
.dbtype
file - MMseqs2 Docker image is now based on Debian instead of Alpine
- Changed Orf header format to be more space efficent. The new format is now
orignIdentifer startPos(-/+)len flag
prefilter
returns ungapped-alignment scores instead of e-valuescreateindex
the file extention is now.idx
instead of the previous.[s]k[6,7]
format
Features
- Support for tblastx-style nucl-nucl translated searches
mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 2
- Support for nucleotide searches
mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 3
convertalis
has learned to return SAM formatted output (preview)- Database can be compressed by applying zstd on each entry (
--compressed 1
)- Also added
compress
anddecompress
modules
- Also added
rbh
workflow for reciprocal best hit searches addedlinclust
can now cluster nucleotide sequences on both forward and reverse strand- Added
linsearch
, a lightning fast search for proteins and nucleotide sequences (preview; easy workflow varianteasy-linsearch
also added) createlinindex
computes an index forlinsearch
taxonomy
uses--orf-start-mode 1
to annotate more sequences- Added approx. 2bLCA to speed up computation, this is now the new default. The old mode can be turned on by
--lca-mode 2
createdb
recognizes sequences containing Uracil as DNA sequencescreatedb
is now faster through speeding up its shuffle operationsview
module to view single entry in an MMseqs2 databasealign
module has learned--min-aln-len
parameter to filter by minimal alignment length- Alignment modules (
rescorediagonal
,align
) can align longer sequences now (not limited to 2^15 length) - Input sequences can now be softmasked (lower letter masking) instead of only hard masking (replacing with X) ``--mask-lower-case
. The masking only applies to the prefilter stages
kmermatcher` or `prefilter` and can be combined with `--mask` filterdb
has learned--filter-expression
parameter and mode that allows filtering by simple mathematical expressionsalignbykmer
can be used for nucleotide searches- MMseqs2 did-you-mean functionality gives better suggestions
- MMseqs2 does not repeat the whole parameter list for each submodule call anymore
Bugs
- Default parameters of
map
workflow are now set correctly - Some modules were using the wrong coverage parameter
- Sliced profile search was losing high E-value hits
- Sliced profile search is now stable
- Profile-Sequence alignment E-values where slightly too high
result2msa
was crashing with profiles on the target sideresult2msa
should not crash with--alow-deletion
anymore- Some parameters were never visible (with or without
-h
) - Various issues with MPI were resolved
Developers
- Continous integration enforces no compile warnings now
- Continous integration now tries to build AArch64 builds with Docker and Qemu
- We added a first draft of our developer guide to the wiki
References
[1] Müller T & Martin Vingron, Modeling Amino Acid Replacement, J Comput Biol. 2000;7:761–76. doi: 10.1089/10665270050514918.
[2] Müller T, Spang R, Vingron M. Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol. 2002;19:8–13. doi: 10.1093/oxfordjournals.molbev.a003985