Skip to content

Pipelines for analysis of RNA sequencing data using bash scripting of command line tools, Python and R scripts

Notifications You must be signed in to change notification settings

focyte/Bash-RNAseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 

Repository files navigation

RNA Sequencing Analysis Pipelines

This GitHub repository contains two pipelines for RNA Sequencing analysis: one for initial anlysis of RNA sequencing read data (Quality Control) and the other for alignment and mapping of reads to reference genome and counting of features (genes) (Main Pipeline). Each pipeline consists of a series of Bash scripts that automate key steps in RNA sequencing data analysis, along with additional Python and R scripts for downstream analysis.

Requirements

Software

Files

  • Sequencing read data in the fastq.gz format
  • Index files for the reference genome of interest, in this case Human Genome hg38
  • Ideally perform your own indexing using software such as STAR aligner
  • A .gtf file of annotated features for your indexed genome
  • Splice Site file for your indexed genome to improve alignment accuracy across exon-exon boundaries

Quality Control Usage

./runQC.sh <input_dir> <output_dir>

Individual Steps

  1. FastQC Analysis

    • Script: fastqc.sh
    • Usage:
    fastqc.sh "$INPUT_DIR" "$OUTPUT_DIR"
  2. Trimming with Trimgalore

    • Script: trim_fastq.sh
    • Usage:
    trim_fastq.sh "$INPUT_DIR" "$OUTPUT_DIR"
  3. FastQC Analysis on Trimmed Data

    • Script: fastqcTrimmed.sh
    • Usage:
    fastqcTrimmed.sh "$INPUT_DIR" "$OUTPUT_DIR"

Main Pipeline Usage

./runPipeline.sh <input_dir> <output_dir> <index_path> <splice_sites_file> <gtf_file> <read_type> <data_type>
  1. Mapping to Human Genome using Hisat2

    • Script: mapPP.sh, mapPU.sh, mapRP.sh, mapRE.sh
    • When specifying the read_type and data_type in runPipeline.sh, IF statements determine which mapping script to use
    • Read_types = Unpaired OR Paired
    • Data_types = Raw OR Processed (Raw will use files processed by runQC.sh in the Quality Control step)
    • Usage:
    map.sh "$INPUT_DIR" "$OUTPUT_DIR" "$INDEX_PATH" "$SPLICE_SITES"
  2. Conversion of SAM to BAM

    • Script: samToBam.sh
    • Usage:
    samToBam.sh "$INPUT_DIR" "$OUTPUT_DIR"
  3. Indexing BAM Files

    • Script: indexBam.sh
    • Usage:
    indexBam.sh "$INPUT_DIR" "$OUTPUT_DIR"
  4. Counting Reads for Each Gene Feature using FeatureCounts

    • Script: featureCount.sh
    • Usage:
    featureCount.sh "$INPUT_DIR" "$OUTPUT_DIR" "$GTF_FILE"

Downstream Analysis

Python Script for Merging FeatureCounts Results

  • Script: merge_featureCounts.py
  • Usage:
 merge_featureCounts.py file_paths output_path

R Script for DSeq2 Analysis

  • Script: DSeq2_analysis.R
  • Usage: Execute in an R environment

About

Pipelines for analysis of RNA sequencing data using bash scripting of command line tools, Python and R scripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published