MITochondrial NANopore reads EXtractor

🧐 About

MITNANEX's main purpose is to extract mitocondrial Nanopore reads De novo from the WGS, with no need for seeds or reference sequences. It will also returned a draft assembly of the mitogenome using Flye.

🏁 Getting Started

Installing

First, you need to clone this repository and add to PATH:

  git clone https://github.com/juanjo255/MITNANEX.git; cd MITNANEX; export PATH=$(pwd):$PATH

Conda/mamba

The best way to install MITNANEX's dependencies is throught a beautiful conda/mamba enviroment, first you must have Rust installed (https://www.rust-lang.org/tools/install).

For Mac M1 using mamba (you can change it for conda):

CONDA_SUBDIR=osx-64; mamba create -n mitnanex -c conda-forge -c bioconda seqkit seqtk fpa minimap2 miniasm flye gfastats samtools Filtlong
mamba activate mitnanex
pip install pandas maturin biopython scikit-learn utils-mitnanex

It's posible to have problem with the pip module utils-mitnanex, in that case:

pip uninstall utils-mitnanex
cd src/utils_rs; maturin develop

For Linux

conda create -n mitnanex -c conda-forge -c bioconda Seqkit Seqtk fpa Minimap2 Miniasm Flye Gfastats Samtools Filtlong
conda activate mitnanex
pip install pandas maturin biopython scikit-learn utils-mitnanex

Dependencies

MITNANEX needs the following tools:

Seqkit
Seqtk
fpa
Minimap2
Miniasm
Flye
Pandas
Gfastats
Samtools
Filtlong
Maturin
Biopython
scikit-learn
utils-mitnanex

Notes:

This has only been tested on MacOS M1 using a x86 env architecture.
setup.sh will create a mamba enviroment with all the dependencies in the .yml file.

🎈 Usage

Quick start

./mitnanex_cli.sh -i path/to/fastQ  -p 15000 -m 1000 -t 8 -s 0.6 -g GenomeSize(g|m|k) -w path/to/output

Notes:

It only receives fastQ files.

For help message
```
./mitnanex_cli.sh -h
```
```
  Options:
      -i        Input file. [required]
      -t        Threads. [4].
      -p        Proportion. For sampling. It can be a proportion or a number of reads (0.3|10000). [0.3].
      -m        Min-len. Filter reads by minimun length. Read seqkit seq documentation. [-1].
      -M        Max-len. Filter reads by maximun length. Read seqkit seq documentation. [-1].
      -w        Working directory. Path to create the folder which will contain all mitnanex information. [./mitnanex_results].
      -r        Prefix name add to every produced file. [input file name].
      -c        Coverage. Minimum coverage per cluster accepted. [-1].
      -d        Different output directory. Create a different output directory every run (it uses the date and time). [False]
      -s        Mapping identity. Minimun identity between two reads to be store in the same cluster.[0.6]
      -q        Min mapping quality (>=). This is for samtools. [-1].
      -f        Flye mode. [--nano-hq]
      -g        GenomeSize. This is your best estimation of the mitogenome for read correction with Canu. [required]
      *         Help.
```
Algorithm overview

How does MITNANEX work?
- MITNANEX is a pipeline that depends on other open source tools (see dependencies).
- Through this I will show the results that belong to the assemble of Talaromyces santanderensis mitogenome using MITNANEX from a Nanopore run performed at EAFIT university.
- First, it will use seqkit and seqkt to subsample the reads, after that MITNANEX starts with minimap2 finding overlaps between reads. MITNANEX will group reads that have at least certain level of identity (tweakable parameter), each read will be counted for the "coverage" of the group and each cluster will be represented only by its largest read.
- Once all reads are grouped, MITNANEX will only keep at least 3 groups with the highest coverage (tweakable parameter). Given the short length of the mitchondrial genome and its high coverage during WGS, we expect to have most of it in these clusters.
- Now with the selected clusters MITNANEX will use the representative read of each cluster and get its trinucleotidic composition (codon) which will be reduce is normalized by the read length, and reduce its dimensionality to 2 with a PCA such as the classic strategy during metagenomic binning. Here, given the difference between mitochondrial and the nuclear genome, we expect the mitochondrial reads to have an oligocomposition different enough to be separated from the nuclear. The known weakness of Kmean for outliers made the selection of this clustering algortihm attractive. Thus, using the clustering algorithm Kmeans, with a k set to 2, is selected the cluster with the highest coverage. Below the cluster in yellow was selected.
- With the reads collected from the selected clusters, miniasm will assemble unitigs, where we expect to assemble most of the mitogenome (or even longer given the problems that miniasm has). Miniasm is useful in this steps for 2 main reasons:
  1. It can work with low coverage.
  2. It's extremely fast and the unitigs produced are enough for the next step.
- The unitigs are used to collect more reads from the total of reads to perform a final assembly with Flye.
- Flye is almost the assembler par excellence for Nanopore reads and it's among the best at circularizing genomes. An important characteristic for mitochondrial genomes.

Testing

So far, it has only been tested on Fungi. For animals and plant I think it would require some adjustments.

The following datasets were retrieved from the NCBI. Assembled with MITNANEX and validated with blastn against the NCBI. All of then got 100% query with identity >99%.

Datasets tested:

Ascochyta lentis: SRR14075486
Asperguillus fumigatus: ERR10820709
Candida auris: SRR25455202

Name		Name	Last commit message	Last commit date
Latest commit History 464 Commits
example_vcf_plots		example_vcf_plots
images		images
refseqMT		refseqMT
scripts		scripts
src		src
test		test
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
environment_mac.yml		environment_mac.yml
environment_mac_x86.yml		environment_mac_x86.yml
index.html		index.html
main.py		main.py
mitnanex.sh		mitnanex.sh
mitnanex_denovo.sh		mitnanex_denovo.sh
mitnanex_reference.sh		mitnanex_reference.sh
requirements.txt		requirements.txt
requirements_macOS.txt		requirements_macOS.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MITochondrial NANopore reads EXtractor

Table of Contents

🧐 About

🏁 Getting Started

Installing

Conda/mamba

Dependencies

🎈 Usage

Algorithm overview

How does MITNANEX work?

Testing

About

Releases 1

Packages

Languages

juanjo255/MITNANEX

Folders and files

Latest commit

History

Repository files navigation

MITochondrial NANopore reads EXtractor

Table of Contents

🧐 About

🏁 Getting Started

Installing

Conda/mamba

Dependencies

🎈 Usage

Algorithm overview

How does MITNANEX work?

Testing

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages