Pipeline

Rhett M. Rautsaw & Pedro G. Nachtigall

MitoSIS (Mitochondrial Species Identification System) is a wrapper for mitochondrial genome assembly and identification of sample contamination or mislabeling. Specifically, MitoSIS maps raw or trimmed reads to a database of reference mitochondrial sequences. It calculates the percentage of reads that map to different species using Kallisto to assess potential sample contamination. It then uses MITGARD to assemble and MitoZ annotate the full mitochondrial genome. It then BLASTs the resulting mitogenome or barcoding genes (e.g., CYTB, COX1, ND4, 16S, etc.) to check for sample mislabeling. Finally, MitoSIS uses a MAFFT and IQ-TREE to calculate phylogenetic distance to closely related species.

Pipeline

Map fastq reads to reference fasta using kallisto
- Calculate total reads/tpm for each species in database
Identify the best reference sequence
Assemble the mitogenome using MITGARD
Annotate mitogenome using MitoZ
- Extract protein coding/barcoding genes
Blast mitogenome or genes to reference database
- Calculate mean percent identity for each species
Align sequences and build phylogeny
- Calculate mean/minimum phylogenetic distance for each species

Arguments

flag	description
-h, --help	Show this help message and exit.
-f1, --fastq1	fastq read pair 1 (forward). Default: None
-f2, --fastq2	fastq read pair 2 (reverse). Default: None
-s, --single	single-end fastq. Default: None
-r, --reference	`genbank` OR `fasta+sp` database Default: None See section below on fasta & custom databases Recommend downloading all mitochondrial data for your clade of interest e.g., snakes; Genbank Example Send to > Complete Record > Genbank
-o, --output	Prefix for output files. Default: 'ZZZ'
-c, --cpu	Number of threads to be used in each step. Default: 8
-M, --memory	Max memory for Trinity (see Trinity for format). Default: '30G'
--clade	Clade used for MitoZ. Options: 'Chordata' or 'Arthropoda'. Default: 'Chordata'
--convert	Only perform Genbank to Fasta conversion and create a tab-delimited taxa id file
--version	Show program's version number and exit

Reference Databases

The user can download nucleotide sequences from the taxonomic group of interest from the database of NCBI. For instance, the user can search for "snakes[porgn]AND mitochondrion[filter]" and send all complete records to a Genbank formatted file. Then the GenBank fromat file is used as input in the option -r to be used as reference in MitoSIS pipeline.

The GenBank format file is converted into two files to generate a fasta+sp database, which is used in all steps of MitoSIS workflow. To improve the reference database by adding custom/private sequences, see the section below.

Fasta & Custom Reference Databases

Fasta reference databases must be accompanied by a tab-delimited taxa id (.sp) file. We refer to this combination of files as a fasta+sp database. The tab-delimited taxa id (.sp) file must occur in the same directory as the fasta file and have the same filename with .sp appended (i.e., ReferenceDB.fasta and ReferenceDB.fasta.sp).

If you have a Genbank database and only want to add additional or custom/private sequences, we recommend first running --convert.

MitoSIS.py -r ReferenceDB.gb --convert

--convert will convert your Genbank file to a fasta+sp database without running the rest of MitoSIS. Output will be:

ReferenceDB.fasta
ReferenceDB.fasta.sp

With the initial fasta+sp database created...

Manually add your additional or custom sequences to the fasta and the identifer/taxa information to the .sp file.

`fasta+sp` Format

Each fasta sequences must have unique identifiers (similar to Genbank Accession Numbers) and those identifiers must match in the tab-delimited taxa id file. Ensure to not have descriptions in the fasta header (i.e., no spaces " " in the header, only the sequence id).

{ReferenceDB}.fasta

>ID_1
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ID_2
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG

{ReferenceDB}.fasta.sp

ID_1    Genus species
ID_2    Genus species

Installation

System Requirement

Linux

Conda Installation

# Clone this Repository
git clone https://github.com/RhettRautsaw/MitoSIS.git
cd MitoSIS
echo "export PATH=\$PATH:$PWD" >> ~/.bash_profile

# Clone MITGARD Repository 
git clone https://github.com/pedronachtigall/MITGARD.git
# Fix shebangs in MITGARD supporting scripts
sed -i '1 s/^.*$/\#\!\/usr\/bin\/env python/' MITGARD/bin/sam2msa.py
sed -i '1 s/^.*$/\#\!\/usr\/bin\/env python/' MITGARD/bin/msa2consensus.py
echo "export PATH=\$PATH:$PWD/MITGARD/bin" >> ~/.bash_profile

# Clone MitoZ Repository
git clone https://github.com/linzhi2013/MitoZ.git
tar -jxvf MitoZ/version_2.4-alpha/release_MitoZ_v2.4-alpha.tar.bz2
echo "export PATH=\$PATH:$PWD/release_MitoZ_v2.4-alpha" >> ~/.bash_profile

# Make sure everythig has proper permissions and source your bash_profile
chmod -R 755 *
source ~/.bash_profile

# Create Conda Environment
conda env create -f mitosis_env.yml
conda activate mitosis_env

# Install dfply
pip install dfply

# Install Taxonomy Database for MitoZ
python MITGARD/install_NCBITaxa.py

# YOU'RE READY TO GO
# Check if MitoSIS.py is in your path
MitoSIS.py -h

Example

Before running, we recommend testing MitoSIS with our Tutorial dataset.

We also recommend trimming your own data first prior to running this program. Example trimming using Trim-Galore shown below. Depending on whether you are working with DNA or RNA-Seq data, you may want to change the length/quality parameters.

# Trimming
trim_galore --paired --phred33 --length 30 -q 20 -o 02_trim 00_raw/{}_F.fastq.gz 00_raw/{}_R.fastq.gz &> {}_tg.log

Below are outlines for running MitoSIS.

# MitoSIS - paired-end
MitoSIS.py -f1 {}_F_trim.fastq.gz -f2 {}_R_trim.fastq.gz -r ReferenceDB.gb -o {} -c 16 -M 55G &> MitoSIS.log

# MitoSIS - single
MitoSIS.py -s {}_merged.fastq.gz -r ReferenceDB.gb -o {} -c 16 -M 55G &> MitoSIS.log

# MitoSIS - paired-end & fasta+sp reference database
# NOTE: MitoSIS expects ReferenceDB.fasta.sp to occur in the same directory as ReferenceDB.fasta
MitoSIS.py -f1 {}_F_trim.fastq.gz -f2 {}_R_trim.fastq.gz -r ReferenceDB.fasta -o {} -c 16 -M 55G &> MitoSIS.log

Output

The user can find a detailed results in the MitoSIS_summary_output.html with the potential contamination, percent identity and alignment distance across genes and all phylogenetic trees build. Moreover, during processing MitoSIS print messages at the terminal summarizing all results, which may also be used by the user to check the results. Find below an example of the printed message and a detailed information about all files generated by MitoSIS pipeline that can be used/analyzed a posteriori by the user.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
Archive		Archive
Tutorial		Tutorial
MitoSIS.py		MitoSIS.py
MitoSIS_Flowchart.png		MitoSIS_Flowchart.png
README.md		README.md
logo.png		logo.png
mitosis_env.yml		mitosis_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rhett M. Rautsaw & Pedro G. Nachtigall

Pipeline

Arguments

Reference Databases

Fasta & Custom Reference Databases

`fasta+sp` Format

Installation

Example

Output

RhettRautsaw/MitoSIS

Folders and files

Latest commit

History

Repository files navigation

Rhett M. Rautsaw & Pedro G. Nachtigall

Pipeline

Arguments

Reference Databases

Fasta & Custom Reference Databases

fasta+sp Format

Installation

Example

Output

`fasta+sp` Format