RNA-seq is a special type of sequencing that aims to look at the expression patterns of genes. It focuses primarily on sequencing messenger RNA (mRNA), which will be translated into an amino acid sequence and yield a protein as the final product. This technique facilitates the observation of transcripts resulting from alternative splicing, post-transcriptional modification, gene changes and single nucleotide polymorphisms.
RNA-seq differential expression analysis practical demostration for undergrad students. This demonstration uses data from Rossini et al, 2019.
In short the data comes from Galdieria sulphuraria, a polyextremophile unicellular red alga that lives in hot, acidic, toxic metal-rich, volcanic environments, in Rossini's study that alga was expossed to cold stress to investigate the transcriptional changes this stress induced in the alga. Other stuff is evaluated regarding horizontal gene transfer from bacteria/archea to Galdieria sulphuraria. Reading the paper is strongly suggested.
In this demonstration a differential expression analysis between a "hot" normal state vs a "cold" state of Galdieria sulphuraria is done. The purpose of this is to confirm some of the observations reported in Rossini's paper. It should be considered that 5 out of 25 libraries available from Rossini's data are used in this demonstration 3 from a cold state and 2 from a hot state.
Note: This pipeline is probably far from ideal and is just presented as a good enough demonstration of the concept.
- OS:Linux (Debian, Ubuntu and similar), emulated Linux enviroments should be fine
- Software: conda or mamba package manager, and bash
- Disk space: around 10 Gb
- RAM: 4Gb (estimated), 8Gb should be enough
Some of the bash scripts(".sh" files) have a "threads" variable set to 4, you should change this paramater so it fits all the threads your machine has. The pipeline was tested with 4 threads, ~33 mins of runtime and ~2Gb of RAM usage were recorded .
First, you should download this repository and then open a terminal in the downloaded repository. Then you have to create a conda environment from the .yml file, this environment will contain the software required to run this demonstration. For that you can run the following command:
conda env create -f Workshop_conda_environment.yml
If you have mamba you may use:
mamba env create -f Workshop_conda_environment.yml
Before running the next scripts you should activate the created environment with
conda activate work_comp
First the genome fasta file and the annotation gtf file of Galdieria sulphuraria is downloaded. This data comes from the following NCBI assembly accession: GCF_000341285.1. To do this you can run the following command:
bash 1-Download_genome.sh
This will create a "genome" folder containing these files:
Galdieria_sulphuraria.fasta
Galdieria_sulphuraria.gtf
Then the RNA-seq pair-end data is downloaded. For this step you need ~7Gb of disk space and it may take quite a while. To do this you can run the following command:
2-Download_RNA-seq_data.sh
This will create a "data" folder containing these files:
SRR7734599_GSM3349831_Cold_T12_3_Rep_A_Galdieria_sulphuraria_RNA-Seq_1.fastq.gz
SRR7734599_GSM3349831_Cold_T12_3_Rep_A_Galdieria_sulphuraria_RNA-Seq_2.fastq.gz
SRR7734600_GSM3349832_Cold_T12_3_Rep_B_Galdieria_sulphuraria_RNA-Seq_1.fastq.gz
SRR7734600_GSM3349832_Cold_T12_3_Rep_B_Galdieria_sulphuraria_RNA-Seq_2.fastq.gz
SRR7734601_GSM3349833_Cold_T12_3_Rep_C_Galdieria_sulphuraria_RNA-Seq_1.fastq.gz
SRR7734601_GSM3349833_Cold_T12_3_Rep_C_Galdieria_sulphuraria_RNA-Seq_2.fastq.gz
SRR7734607_GSM3349839_Hot_T12_6_Rep_A_Galdieria_sulphuraria_RNA-Seq_1.fastq.gz
SRR7734607_GSM3349839_Hot_T12_6_Rep_A_Galdieria_sulphuraria_RNA-Seq_2.fastq.gz
SRR7734608_GSM3349840_Hot_T12_6_Rep_B_Galdieria_sulphuraria_RNA-Seq_1.fastq.gz
SRR7734608_GSM3349840_Hot_T12_6_Rep_B_Galdieria_sulphuraria_RNA-Seq_2.fastq.gz
After completing RNA-seq data download some Quality control of the data is advised, for this fastqc and Multiqc fit this purpose well. To do the quality control you can run the following command:
bash 3-QC_RNA-seq_data.sh
This will create a "qc" folder and within it a "multiqc" folder(among other files) containing these files:
multiqc_data
multiqc_report.html
The importan file is "multiqc_report.html" which summarizes all the quality control metrics from the RNA-seq fastq files.
In this step a reference needed to run RSEM and bowtie2 is created. To do this you can run the following command:
4-RSEM_reference.sh
This will create a "rsem_ref" folder containing the reference files created from the genome file and the annotation file.
In the next step RSEM with the aid of bowtie2 aligns the RNA-seq reads to the genome and then generates gene counts. This step is the most computationally intensive of the whole pipeline so it will take quite a while, patience is adviced. To do this step you can run the following command:
bash 5-RSEM_run.sh
Thus creating a "rsem_run" folder containing gene count results, in particular .genes.results files contain the data used in the next step:
In this step a reference needed to run RSEM and bowtie2 is created. To do this you can run the following command:
bash 6-DEseq2_run.sh
This will create a "DEseq" folder. In this folder, the most important are "qc" and "out". The first one will contain the QC of the libraries confirming the validity of the analysis. The second one contains the differential gene expression results, in particular "differential_expression.csv"
This workshop was heavily inspired by NASA'sGeneLab bioinformatics processing pipeline for Illumina RNA-sequencing data and this Differential gene expression workshop
Alessandro W Rossoni, Gerald Schönknecht, Hyun Jeong Lee, Ryan L Rupp, Samantha Flachbart, Tabea Mettler-Altmann, Andreas P M Weber, Marion Eisenhut. Cold Acclimation of the Thermoacidophilic Red Alga Galdieria sulphuraria: Changes in Gene Expression and Involvement of Horizontally Acquired Genes, Plant and Cell Physiology, Volume 60, Issue 3, March 2019, Pages 702–712, https://doi.org/10.1093/pcp/pcy240