MMseqs2 Release 13-45111
New Taxonomy Workflow (new feature and breaking change)
We introduce a new taxonomy workflow for assigning taxonomic labels to nucleotide sequences by searching against protein reference databases. For details see:
The nucleotide-to-protein taxonomic assignment is now much faster and is optimized towards annotation of contigs. If you use MMseqs2 taxonomy to assign taxonomic labels to short reads, consider using the --orf-filter 0
parameter to disable the new filter stage as it can reject too many short query sequences. MMseqs2 is still considerably faster with this parameter set.
As our nucleotide-to-nucleotide taxonomic assignment does not support the 2bLCA assignment mode for stable lowest-common-ancestor computation, we previously set MMseqs2 to perform LCA assignment by top-hit (--lca-mode 4
) as default. Approximate (see manuscript) 2bLCA is now again the default and we automatically switch to top-hit if given nucleotide-to-nucleotide input.
Breaking changes
--slice-search
in now called--exhaustive-search
- Unify
--compress
--summarize
--omit-consensus
(inresult2msa
) to--msa-format-mode
Features
- Add GTDB and CDD to databases downloader #410
- Add
nrtotaxmapping
to create taxonomy mapping from NR - Add
unpackdb
to split a database into separate files #406 - Add
majoritylca
module for majority voting based taxonomy from alignment results - Add
cpdb
andlndb
- Taxonomy information is stored in binary format (a single
db_taxonomy
file, instead ofdb_{named,nodes,merged}.dmp,db_mapping
) to speed up read-in. Old format is still supported. --exhaustive-search
is usable with ungapped alignments (--alignment-mode 4
)- Allow sequence/result database input in
taxonomyreport
#401/#408 msa2profile/result
can skip the first sequence with--skip-query
createtaxdb
can create a taxdb by mapping through.source
in addition to.lookup
(--tax-mapping-mode 1
)splitsequence
can create a sequence database with original headersalign
can return short cluster format if only identifiers are required--alignment-output-mode
tar2db
can be used multi-threaded if input allows (e.g..tar
containing.gz
files)- Encode species names in taxonomy blocklist to make sure we don't block random nodes in * e.g. GTDB)
- Split non-index parts over additional files in split index case to reduce peak memory use
proteinaln2nucl
can now compute scores and e-valuescreatedb
can create a sequence database from a database containing fasta files (e.g. created bytar2db
)- Add
MMSEQS_FORCE_MERGE
environment variable to force generating fully merged databases - Improved many descriptions, warnings and error messages
Bugs fixed
- Fix
filterresult
off by one issue removing wrong sequences - Fix
addtaxonomy
always crashing due to invalid check #355 - Reduce numbers of calls to
posix_memalign
to fix lock contention on macOS extractorfs
doesn't flood warnings due to short sequences anymoreexpand2profile
--pca
is correctly set to0
msa2profile
always copies.lookup/source
files instead of symlinking- Clustering of clustering input would not work with set-cover or connected-component
- Short circuit
--cluster-reassign
if nothing can be reassigned - Fix temporary files not getting removed in
linclust/cluster
with--remove-tmp--files
- Fix
kmermatcher
setting user k-mer pattern in auto k-mer selection and breaking - Krona
taxonomyreport
was not working if no sequence was unclassified - Make
Matcher::resultToBuffer
buffer sizes consistent (could crash with very long backtraces, needs further refactoring) - Fix multiple locations where
Util::checkAllocation
could never be called as it would have crashed before - Whitespace containing parameters do not break workflows anymore (e.g. passing whitespaces to
--sub-mat
) taxonomyreport
andaddtaxonomy
parameter were not adjustable ineasy-taxonomy
- E-value parameters are now correctly parsed as doubles instead of floats #379
- Add symlinks to
splitdb
#376 - Increase maximum number of open files in
DBReader
- Include file size and modified date of inputs in
temporary
file hash calculation #372 --cov-mode 5
was not working #371- Database downloader deals correctly with redirects now
result2profile
could crash if target database contained much longer sequences than query database- Stop symlinking header database (and other ancillary files) in
filterresult
Developer
- Add vector of predefined substitution matrices to add additional matrices in subprojects
- Don't create false
_has_{builtin,attribute}
macros (see simd-everywhere/simde#691 (comment)) - Add
USE_SYSTEM_ZSTD
cmake flag to use system provided zstd #411 - Replace texlive with tectonic for faster/prettier userguide
- Add more instructions to
simd.h
- Add initial fixes to get MMseqs2 working on s390x (work in progress)
- Prebuilt macOS binary is now a Universal Mac Binary supporting SSE, AVX and Apple Silicon NEON
- Build ARM64/PPC64LE binaries by cross-compiling
- Add missing licenses and READMEs for vendored libraries #403
- Update ALP to 1.98
- Update xxhash to v0.8.0