Release MMseqs2 Release 13-45111 · soedinglab/MMseqs2

New Taxonomy Workflow (new feature and breaking change)

We introduce a new taxonomy workflow for assigning taxonomic labels to nucleotide sequences by searching against protein reference databases. For details see:

Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, doi: 10.1101/2020.11.27.401018 (2020)

The nucleotide-to-protein taxonomic assignment is now much faster and is optimized towards annotation of contigs. If you use MMseqs2 taxonomy to assign taxonomic labels to short reads, consider using the --orf-filter 0 parameter to disable the new filter stage as it can reject too many short query sequences. MMseqs2 is still considerably faster with this parameter set.

As our nucleotide-to-nucleotide taxonomic assignment does not support the 2bLCA assignment mode for stable lowest-common-ancestor computation, we previously set MMseqs2 to perform LCA assignment by top-hit (--lca-mode 4) as default. Approximate (see manuscript) 2bLCA is now again the default and we automatically switch to top-hit if given nucleotide-to-nucleotide input.

Breaking changes

--slice-search in now called --exhaustive-search
Unify --compress --summarize --omit-consensus (in result2msa) to --msa-format-mode

Features

Add GTDB and CDD to databases downloader #410
Add nrtotaxmapping to create taxonomy mapping from NR
Add unpackdb to split a database into separate files #406
Add majoritylca module for majority voting based taxonomy from alignment results
Add cpdb and lndb
Taxonomy information is stored in binary format (a single db_taxonomy file, instead of db_{named,nodes,merged}.dmp,db_mapping) to speed up read-in. Old format is still supported.
--exhaustive-search is usable with ungapped alignments (--alignment-mode 4)
Allow sequence/result database input in taxonomyreport #401/#408
msa2profile/result can skip the first sequence with --skip-query
createtaxdb can create a taxdb by mapping through .source in addition to .lookup (--tax-mapping-mode 1)
splitsequence can create a sequence database with original headers
align can return short cluster format if only identifiers are required --alignment-output-mode
tar2db can be used multi-threaded if input allows (e.g. .tar containing .gz files)
Encode species names in taxonomy blocklist to make sure we don't block random nodes in * e.g. GTDB)
Split non-index parts over additional files in split index case to reduce peak memory use
proteinaln2nucl can now compute scores and e-values
createdb can create a sequence database from a database containing fasta files (e.g. created by tar2db)
Add MMSEQS_FORCE_MERGE environment variable to force generating fully merged databases
Improved many descriptions, warnings and error messages

Bugs fixed

Fix filterresult off by one issue removing wrong sequences
Fix addtaxonomy always crashing due to invalid check #355
Reduce numbers of calls to posix_memalign to fix lock contention on macOS
extractorfs doesn't flood warnings due to short sequences anymore
expand2profile --pca is correctly set to 0
msa2profile always copies .lookup/source files instead of symlinking
Clustering of clustering input would not work with set-cover or connected-component
Short circuit --cluster-reassign if nothing can be reassigned
Fix temporary files not getting removed in linclust/cluster with --remove-tmp--files
Fix kmermatcher setting user k-mer pattern in auto k-mer selection and breaking
Krona taxonomyreport was not working if no sequence was unclassified
Make Matcher::resultToBuffer buffer sizes consistent (could crash with very long backtraces, needs further refactoring)
Fix multiple locations where Util::checkAllocation could never be called as it would have crashed before
Whitespace containing parameters do not break workflows anymore (e.g. passing whitespaces to --sub-mat)
taxonomyreport and addtaxonomy parameter were not adjustable in easy-taxonomy
E-value parameters are now correctly parsed as doubles instead of floats #379
Add symlinks to splitdb #376
Increase maximum number of open files in DBReader
Include file size and modified date of inputs in temporary file hash calculation #372
--cov-mode 5 was not working #371
Database downloader deals correctly with redirects now
result2profile could crash if target database contained much longer sequences than query database
Stop symlinking header database (and other ancillary files) in filterresult

Developer

Add vector of predefined substitution matrices to add additional matrices in subprojects
Don't create false _has_{builtin,attribute} macros (see simd-everywhere/simde#691 (comment))
Add USE_SYSTEM_ZSTD cmake flag to use system provided zstd #411
Replace texlive with tectonic for faster/prettier userguide
Add more instructions to simd.h
Add initial fixes to get MMseqs2 working on s390x (work in progress)
Prebuilt macOS binary is now a Universal Mac Binary supporting SSE, AVX and Apple Silicon NEON
Build ARM64/PPC64LE binaries by cross-compiling
Add missing licenses and READMEs for vendored libraries #403
Update ALP to 1.98
Update xxhash to v0.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMseqs2 Release 13-45111

New Taxonomy Workflow (new feature and breaking change)

Breaking changes

Features

Bugs fixed

Developer