Documentation for TCGA Analysis Pipeline

Introduction

Dockerized pipeline to retrieve gene expression (including pseudoenes and noncoding RNA genes) of sgRNA targets from TCGA db.

The pipeline is implemented using Nextflow, a data-driven computational workflow framework. It utilizes Docker containers for managing software dependencies, ensuring reproducibility and portability.

The attched "expression_matrix.txt" file consist of expression of genes in specific TGCA samples ("TCGA-A7-A13D-01A-13R-A12P-07" and "TCGA-E9-A1RH-11A-34R-A169-07" from TCGA-BRCA dataset.) found by mapping sgRNAs to human genome GRCh38 with Ensembl annotation v. 109. First column are Ensembl IDs of mapped genes.

In requested step of comparison between sgRNA fasta names and genes to which these sgRNAs were mapped, I did not make any filtering, for the expression matrix I have taken all the genes to which sgRNA mapped. This could reveal possible off target effects. See file "compared_genes.txt". The file contains 3 columns: 1st: Original FASTA file gene name of the sgRNA, 2nd: Gene name of the gene to which sgRNA was mapped, 3rd: Ensembl ID of the gene to which sgRNA was mapped.

In the files attached you can see also file: expression_matrix_annotated.txt, this is the expression matrix having additional gene names annotations from input sgRNAs (as well as gene names of mapped genes). Unfortunately I was not able to make my scripts working in Nexflow for that process, I was only able to get it on the command line. (in Nexflow version all the references to columns should have "" in front to be properly recognized as such (here is the Nexflow version showed and it will give error in command line, so one should remove extra "" before $.

gawk 'BEGIN{FS=OFS="\t"; getline expression_column_names < "'${expression_matrix}'"; getline compared_column_names < "'${compared_genes}'"; print compared_column_names, expression_column_names} NR==FNR && NR>1{a[\$3]=(\$3 in a) ? a[\$3]"\n"\$1"\t"\$2 : \$1"\t"\$2} NR!=FNR && \$1 in a{split(a[\$1], mappings, "\n"); for (i in mappings) print mappings[i], \$0}' ${compared_genes} ${expression_matrix} | gawk '!seen[\$0]++' > expression_matrix_annotated.txt

Results from running the nexflow pipeline will be saved to the "results" directory.

Prerequisites

Install Docker
Install Nextflow
Download Human genome Bowtie2 index files from directory under this link: https://drive.google.com/drive/folders/1Ez6-pZeoBVSJuzAhqKcNI31nkJrZALwv?usp=share_link These indexes were made on the basis of Human Genome Sequence GRCh38 primary assembly downloaded from Ensembl: https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
The index files should be saved in the TCGA_pipeline directory.

Notes: I was not able to test the pipeline with the indexing stage, on my laptop the indexing step was taking over 3 hours. I have made the indexing on another workstation with 20 threads and copied the index files.

Dependencies contenarized:

This pipeline depends on several bioinformatics tools, including:

Bowtie2
SAMtools
Gawk
Bedtools
R with Bioconductor and packages (TCGAbiolinks, GenomicFeatures and SummarizedExperiment)

Overview

MapSequences: This process maps sequences using bowtie2. The sequences are read from a FASTA file and mapped against a reference genome.
FilterSequneces1Mismatch: This process filters out sequences from the mapped sequences that have more than 1 mismatch.
ExtractInfo: This process converts the filtered SAM file into a BAM file and sorts it.
ExtractGFF: This process extracts GFF3 annotation from a gzipped file.
GetGeneAnnotations: This process extracts gene annotations from the GFF3 file and creates a BED file.
AnnotateGenes: This process annotates the sorted BAM file with the gene annotations from the BED file.
CompareGeneNames: This process compares gene names between different sources.
ExtractGeneIDs: This process extracts gene IDs from the compared genes.
RetrieveExpression: This process retrieves gene expression data for the extracted gene IDs.

Input

The pipeline requires multiple input files:

FASTA file: Contains the sequences to be mapped.
Index Prefix: Prefix of the index files for the reference genome.
GFF3.gz file: Contains the gene annotations in GFF3 format.
Expression Script: An R script to retrieve expression data.
Samples: A list of samples to be analyzed.

Output

The pipeline produces several output files:

aligned.sam: Contains the sequences aligned to the reference genome.
filtered.sam: Contains the sequences with 1 or fewer mismatches.
sorted.bam: Contains the sorted sequences in BAM format.
reference.gff3: Contains the gene annotations in GFF3 format.
genes.bed: Contains the gene annotations in BED format.
annotated.bed: Contains the sequences annotated with gene information.
compared_genes.txt: Contains the compared gene names.
gene_ids.txt: Contains the extracted gene IDs.
expression_matrix.txt: Contains the gene expression data.

Usage

Clone the Repository:

git clone https://github.com/rafalwoycicki/TCGA_Pipeline.git

Navigate to the Repository Directory:
```
cd TCGA_Pipeline
```
**Place the index_prefix files in the directory.
**Docker Setup

This pipeline uses Docker to manage these dependencies. Remember to have Docker running. To create a Docker image for this pipeline, a Dockerfile is provided in the repository. Here's how you can build and use the Docker image.

Build Docker Image: Navigate to the directory containing the Dockerfile and run the following command to build the Docker image:

docker build -f Dockerfile -t tcga_pipeline .

This command will create a Docker image named 'tcga_pipeline' that includes all the software dependencies needed for the pipeline.

Run Docker Image: You can test the Docker image by running the following command:

docker run -it tcga_pipeline

This command starts a new Docker container using the 'tcga_pipeline' image and opens an interactive terminal inside the container.

Run the Pipeline:

nextflow run main.nf --fasta library.fa --index_prefix grch38prim --gffzipped Homo_sapiens.GRCh38.109.gff3.gz --expression_script retrieve_expression.R --samples TCGA-A7-A13D-01A-13R-A12P-07,TCGA-E9-A1RH-11A-34R-A169-07

Adjust the parameters if necessary.

Results

The pipeline will output several files, including an expression matrix file. This file contains gene expression values for the specified TCGA samples. The rows represent genes, and the columns represent samples.

Conclusion

This documentation should provide a solid foundation to understand the TCGA analysis pipeline. If you have any further questions or run into issues, please refer to the code or consider reaching out for support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Documentation for TCGA Analysis Pipeline

Introduction

Prerequisites

Dependencies contenarized:

Overview

Input

Output

Usage

Results

Conclusion

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Dockerfile		Dockerfile
Homo_sapiens.GRCh38.109.gff3.gz		Homo_sapiens.GRCh38.109.gff3.gz
LICENSE.md		LICENSE.md
README.md		README.md
compared_genes.txt		compared_genes.txt
expression_matrix.txt		expression_matrix.txt
expression_matrix_annotated.txt		expression_matrix_annotated.txt
library.fa		library.fa
main.nf		main.nf
nextflow.config		nextflow.config
nf_start.sh		nf_start.sh
retrieve_expression.R		retrieve_expression.R

License

rafalwoycicki/TCGA_Pipeline

Folders and files

Latest commit

History

Repository files navigation

Documentation for TCGA Analysis Pipeline

Introduction

Prerequisites

Dependencies contenarized:

Overview

Input

Output

Usage

Results

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages