Mumemto identifies maximal unique matches (multi-MUMs) present across a collection of sequences. Multi-MUMs are defined as maximally matching substrings present in each sequence in a collection exactly once. Additionally, this tool can identify multi-MEMs, maximal exact matches present across sequences, without the uniqueness property. This method is uses the prefix-free parse (PFP) algorithm for suffix array construction on large, repetitive collections of text.
This tool uses PFP to efficiently identify multi-MUM/MEMs. Note that this applies only to highly repetitive texts (such as a collection of closely related genomes, likely intra-species such as a pangenome). We plan to support multi-MUM/MEM finding in more divergent sequences (inter-species, etc.) soon, however this would be less efficient without the PFP pre-processing step.
The base code from this repo was adapted from pfp-thresholds repository written by Massimiliano Rossi and docprofiles repository written by Omar Ahmed.
Mumemto is available on docker
and singularity
:
### if using docker:
docker pull vshiv123/mumemto:latest
docker run vshiv123/mumemto:latest mumemto -h
### if using singularity:
singularity pull docker://vshiv123/mumemto:latest
./mumemto_latest.sif mumemto -h
For starting out, use the commands below to download the repository and build the executable. After running the make command below,
the mumemto
executable will be found in the build/
folder. The following are dependencies: cmake, g++, gcc, libboost, zlib
git clone https://github.com/vshiv18/mumemto
cd mumemto
mkdir build
cd build && cmake ..
make install
The basic workflow with mumemto
is to compute the PFP over a collection of sequences, and identify multi-MUMs while computing the SA/LCP/BWT of the input collection.
mumemto mum -o <output_prefix> [input_fasta [...]]
Alternatively, you can find all multi-MEMs:
mumemto mem -o <output_prefix> [input_fasta [...]]
The command above takes in a list of fasta files as positional arguments and then generates output files using the output prefix. Alternatively, you can provide a file-list, which specifies a list of fastas and which document/class each file belongs in. Passing in fastas as positional arguments will auto-generate a filelist that defines the order of the sequences.
Use the -h
flag to list the options for each mode: mumemto mum -h
.
Mumemto mode options enable the computation of various different classes of exact matches:
-k
allows for partial multi-MUM and MEMs (appearing in at least N-k
sequences) and --rare k
finds multi-MEMs that appear at most k
times in each sequences (can be used with -k
to find rare partial multi-MEMs).
Format of the *.mums file:
[MUM length] [comma-delimited list of offsets within each sequence, in order of filelist] [comma-delimited strand indicators (one of +/-)]
The *.mums
file contains each MUM as a separate line, where the first value is the match length, and the second is
a comma-delimited list of positions where the match begins in each sequence. An empty entry indicates that the MUM was not found in that sequence (only applicable with -k flag). The MUMs are sorted in the output file
lexicographically based on the match sequence.
Format of the *.mems file:
[MEM length] [comma-delimited list of offsets for each occurence] [comma-delimited list of sequence IDs, as defined in the filelist] [comma-delimited strand indicators (one of +/-)]
The *.mems
file contains each MEM as a separate line with the following fields: (1) the match length, (2)
a comma-delimited list of offsets within a sequence, (3) the corresponding sequence ID for each offset given in (2). The MEMs are sorted in the output file
lexicographically based on the match sequence.
Example of file-list file:
/path/to/ecoli_1.fna 1
/path/to/salmonella_1.fna 2
/path/to/bacillus_1.fna 3
/path/to/staph_2.fna 4
Potato pangenome (assemblies from [Tang et al., 2022])
Mumemto can visualize multi-MUMs in a synteny-like format, highlighting conservation and genomic structural diversity within a collection of sequences.After running mumemto mum
on a collection of FASTAs, you can generate a visualization using:
/path/to/mumemto_repo/analysis/viz_mums.py (-i PREFIX | -m MUMFILE)
Use viz_mums.py -h
to see options for customizability. As of now, only strict and partial multi-MUMs are supported (rare multi-MEM support coming soon).