Skip to content
/ raspir Public

Raspir, the rare species identifier

License

Notifications You must be signed in to change notification settings

mmpust/raspir

Repository files navigation

Alt text

Background

In shotgun metagenomic sequencing experiments, the total DNA is extracted from complex samples. The DNA of rare species is scarce, and poor coverage of these genomes is often observed. Scientists commonly define thresholds to exclude 1-10% of the least abundant species in a given sample from further analyses. On the one hand, this filtering step allows for robust investigations of core communities. On the other hand, valuable information on the community structure is lost. The rare biosphere harbours more species than the core microbial community and hence provides the environment of interest with high functional flexibility.

However, if only a few short DNA reads are detected that are unique to species A, there are at least three explanations:
a) Sample contamination;
b) Rare species A was present in the environment of interest, so it is a true positive species. In this case, the reads are expected to spread across the entire reference genome in a fairly uniform manner due to random DNA sequencing; or
c) Rare species A was absent (false positive) but rare species B was present, which acquired genes of species A during past events. In this case, the reads are expected to cluster at specific sides of the reference genome of species A.

The raspir tool calculates a position-domain signal based on the distances of reads aligning to a circular reference genome and converts the information into a frequency signal through Discrete Fourier Transforms (DFT). In addition, a reference frequency signal is constructed with the same number of reads, but with an ideal uniform distribution of reads across the reference genome. Both frequency signals are compared using Pearson's correlation measures.

Implementation

Raspir is implemented in Python 3.7. Using this tool, a real-world dataset (5 GB) containing information on hundreds of species can be processed on a single node server. The input data must be structured in the following manner: genome length of the corresponding reference genome, organism, read position, read depth.

See the following section for further information on how to convert .FASTQ files into the .CSV input files to sucessfully execute raspir.

Get started

Set up the environment

Install conda packages

conda create --name raspir_env
conda activate raspir_env

# As combined commands: 
conda install -c bioconda trimmomatic samtools bwa 

# Install trimmomatic [1] 
conda install -c bioconda trimmomatic
# Install samtools [2] 
conda install -c bioconda samtools

# Install your alignment tool of choice
# Burrows-Wheeler aligner [3]
conda install -c bioconda bwa
# Bowtie2 [4]
conda install -c bioconda/label/cf201901 bowtie2

# Install python packages for raspir
conda install pandas
conda install -c conda-forge statsmodels matplotlib -y

Create working directory

mkdir raspir/
cd raspir/

YOURPATH=$PWD
echo "$YOURPATH"

mkdir reference_database/
mkdir run_raspir/

Set-up the reference database

The reference database has to be downloaded and indexed only once. The database should contain only complete, circular genomes with one strain per species. Note: You may use a customised reference database. It is however strongly recommended to avoid draft or low-quality reference genomes and use complete sequences of circular microorganisms only.

# Load database into your working directory
cd reference_database/
# Get high quality (default) database from Wochenende tool:
https://drive.google.com/drive/folders/1q1btJCxtU15XXqfA-iCyNwgKgQq0SrG4
# Source: https://github.com/MHH-RCUG/nf_wochenende.git

# Unzip the reference database 
gunzip 2021_12_human_bact_arch_fungi_vir.fa.gz

# Generate an index of the reference fasta depending on the alignment tool of your choice
samtools faidx 2021_12_human_bact_arch_fungi_vir.fa.gz
bwa index 2021_12_human_bact_arch_fungi_vir.fa.gz
bowtie2-build 2021_12_human_bact_arch_fungi_vir.fa.gz 2021_12_human_bact_arch_fungi_vir
cd ..

Data cleaning, alignment & sorting

Trimmomatic