MITNANEX's main purpose is to extract mitocondrial Nanopore reads De novo from the WGS, with no need for seeds or reference sequences. It will also returned a draft assembly of the mitogenome using Flye.
- First, you need to clone this repository and add to PATH:
git clone https://github.com/juanjo255/MITNANEX.git; cd MITNANEX; export PATH=$(pwd):$PATH
The best way to install MITNANEX's dependencies is throught a beautiful conda/mamba enviroment, first you must have Rust installed (https://www.rust-lang.org/tools/install).
-
For Mac M1 using mamba (you can change it for conda):
CONDA_SUBDIR=osx-64; mamba create -n mitnanex -c conda-forge -c bioconda seqkit seqtk fpa minimap2 miniasm flye gfastats samtools Filtlong mamba activate mitnanex pip install pandas maturin biopython scikit-learn utils-mitnanex
It's posible to have problem with the pip module utils-mitnanex, in that case:
pip uninstall utils-mitnanex cd src/utils_rs; maturin develop
-
For Linux
conda create -n mitnanex -c conda-forge -c bioconda Seqkit Seqtk fpa Minimap2 Miniasm Flye Gfastats Samtools Filtlong conda activate mitnanex pip install pandas maturin biopython scikit-learn utils-mitnanex
MITNANEX needs the following tools:
- Seqkit
- Seqtk
- fpa
- Minimap2
- Miniasm
- Flye
- Pandas
- Gfastats
- Samtools
- Filtlong
- Maturin
- Biopython
- scikit-learn
- utils-mitnanex
Notes:
-
This has only been tested on MacOS M1 using a x86 env architecture.
-
setup.sh
will create a mamba enviroment with all the dependencies in the.yml
file.
- Quick start
./mitnanex_cli.sh -i path/to/fastQ -p 15000 -m 1000 -t 8 -s 0.6 -g GenomeSize(g|m|k) -w path/to/output
Notes:
- It only receives fastQ files.
-
For help message
./mitnanex_cli.sh -h
Options: -i Input file. [required] -t Threads. [4]. -p Proportion. For sampling. It can be a proportion or a number of reads (0.3|10000). [0.3]. -m Min-len. Filter reads by minimun length. Read seqkit seq documentation. [-1]. -M Max-len. Filter reads by maximun length. Read seqkit seq documentation. [-1]. -w Working directory. Path to create the folder which will contain all mitnanex information. [./mitnanex_results]. -r Prefix name add to every produced file. [input file name]. -c Coverage. Minimum coverage per cluster accepted. [-1]. -d Different output directory. Create a different output directory every run (it uses the date and time). [False] -s Mapping identity. Minimun identity between two reads to be store in the same cluster.[0.6] -q Min mapping quality (>=). This is for samtools. [-1]. -f Flye mode. [--nano-hq] -g GenomeSize. This is your best estimation of the mitogenome for read correction with Canu. [required] * Help.
- MITNANEX is a pipeline that depends on other open source tools (see dependencies).
- Through this I will show the results that belong to the assemble of Talaromyces santanderensis mitogenome using MITNANEX from a Nanopore run performed at EAFIT university.
- First, it will use seqkit and seqkt to subsample the reads, after that MITNANEX starts with minimap2 finding overlaps between reads. MITNANEX will group reads that have at least certain level of identity (tweakable parameter), each read will be counted for the "coverage" of the group and each cluster will be represented only by its largest read.
- Once all reads are grouped, MITNANEX will only keep at least 3 groups with the highest coverage (tweakable parameter). Given the short length of the mitchondrial genome and its high coverage during WGS, we expect to have most of it in these clusters.
- Now with the selected clusters MITNANEX will use the representative read of each cluster and get its trinucleotidic composition (codon) which will be reduce is normalized by the read length, and reduce its dimensionality to 2 with a PCA such as the classic strategy during metagenomic binning. Here, given the difference between mitochondrial and the nuclear genome, we expect the mitochondrial reads to have an oligocomposition different enough to be separated from the nuclear. The known weakness of Kmean for outliers made the selection of this clustering algortihm attractive. Thus, using the clustering algorithm Kmeans, with a k set to 2, is selected the cluster with the highest coverage. Below the cluster in yellow was selected.
- With the reads collected from the selected clusters, miniasm will assemble unitigs, where we expect to assemble most of the mitogenome (or even longer given the problems that miniasm has). Miniasm is useful in this steps for 2 main reasons:
- It can work with low coverage.
- It's extremely fast and the unitigs produced are enough for the next step.
- The unitigs are used to collect more reads from the total of reads to perform a final assembly with Flye.
- Flye is almost the assembler par excellence for Nanopore reads and it's among the best at circularizing genomes. An important characteristic for mitochondrial genomes.
So far, it has only been tested on Fungi. For animals and plant I think it would require some adjustments.
The following datasets were retrieved from the NCBI. Assembled with MITNANEX and validated with blastn against the NCBI. All of then got 100% query with identity >99%.
- Datasets tested:
- Ascochyta lentis: SRR14075486
- Asperguillus fumigatus: ERR10820709
- Candida auris: SRR25455202