Skip to content

Two-Tier Ensemble Aggregation Gene Co-expression Network (TEA-GCN): Capturing tissue/condition-specific co-expression from large transcriptomic datasets

License

Notifications You must be signed in to change notification settings

pengkenlim/TEA-GCN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

58 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

banner

TEA-GCN: Two-Tier Ensemble Aggregation Gene Co-expression Network

banner

What does this pipeline do?

This pipeline generates high-quality Gene Co-expression Networks (TEA-GCN ) that capture tissue/condition-specific co-expression.

For in-depth explanation and evaluation of the TEA-GCN methodology, please refer to our preprint πŸ“°

Navigation

Generate TEA-GCN from your transcriptomic dataset

banner

Step 1. Setting up

Clone repository to your local machine

$ git clone https://github.com/pengkenlim/TEA-GCN.git

Create an environment, and install packages

$ cd TEA-GCN
$ virtualenv -p python3 .venv
$ source ./.venv/bin/activate
$ pip install --upgrade pip
$ pip install -r ./setup/requirements.txt

Activate environment for subsequent use

$ cd TEA-GCN
$ source ./.venv/bin/activate

Step 2. Generating partitions for your dataset

The TEA-GCN method uses of k-means clustering algorithm to divide gene expression data into partitions before gene co-expression determination. Expression data must be provided in the form of an expression matrix where 1) expression abundances are in the form of Transcript per Million (TPM), 2) rows correspond to different genes, and 3) columns correspond to different samples. The first row of the input expression matrix must consist of column headers (i.e. sample names) and the first column must consist of unique indices (i.e. gene identifiers).

Downloading sample data

We provide sample data of a gene expression matrices containing 500 and 5,000 Arabidopsis thaliana public RNA-seq samples, respectively. (used in our preprint ).

You can download it from your browser using these links:

  1. 500 Arabidopsis thaliana public RNA-seq samples
  2. 5,000 Arabidopsis thaliana public RNA-seq samples

Simplest implementation

$ python main/Generate_partitions.py --output_dir /path/to/output_directory --input_matrix_path /path/to/taxid3702_500n_expression_matrix.tsv

Full options

usage: Generate_partitions.py [-h] [-ks] [-ke] [-st] [-t] -o  -im  [-de] [-pc] [-rs]

Generate_partitions.py Load expression matrix. Perform sample standardization to correct batch effect. Embed samples from gene-space into PC-space. Perform k-means clustering across a
range of K. Outputs principal component embeddings of samples (PCA data), mean silhouette coefficients from each clustering iteration, k-means cluster assignments of samples in pickle
format.

options:
  -h, --help            show this help message and exit
  -ks , --k_start       Starting k for K-means clustering. The default is 10.
  -ke , --k_end         Ending k for K-means clustering. The Default is 50
  -st , --step          Step to increase between each k. The user is suggested to increase the step if the range is very big to decrease search space.
  -t , --threads        Number of threads to use for PCA and Kmeans clustering. Handled by Sklearn's parallelization. Try to specify lower than 64 for now.
  -o , --output_dir     Working directory to output data.
  -im , --input_matrix_path
                        Path of expression matrix to input
  -de , --delimiter     Delimiter for expression matrix. -de="t" for tab-separated (.tsv). -de="c" for comma separated (.csv). TSV by default.
  -pc , --pc_no         Number of Principal components to keep. Default = 100. It cannot be more than the number of samples and the number of genes in the expression matrix.
  -rs , --randomstate   randomstate (seed) for random seeding during k-means clustering. By default, 42 will be used.

Step 3. Building TEA-GCN

After data partitioning, you can start building TEA-GCN by determining co-expression strength between every gene pair. Said co-expression strength is calculated based on measured correlation coefficients between genes from every dataset partition. If you would like more information, please refer to our preprint.

Simplest implementation

$ python main/Run_TEA-GCN.py  --output_dir /path/to/output_directory --input_matrix_path /path/to/taxid3702_500n_expression_matrix.tsv

Intermediate TEA-GCN will be output to /path/to/output_directory/RAW_GCN/<METHOD>/<NETWORK_NAME>. <NETWORK_NAME> is generated automatically based on the <METHOD>_<K>k_<AGGREGATION_FUNCTION> format (e.g. TEA_8k_RAvg)

Full options

usage: Run_TEA-GCN.py [-h] [-w] [-t] -o  [-de] [-cc] [-am] [-k] -im

Run_TEA-GCN.py. Construct TEA-GCN from expression data (in the form of an expression matrix). Requires prior data partitioning using Generate_partitions.py in order to run.

options:
  -h, --help            show this help message and exit
  -w , --workers        Number of workers for parallelization. Affects precalc of all coefficients. But only affects the calc of bicor.
  -t , --threads        Number of threads for numpy linear algebra operations. Affects PCC and SCC calc.
  -o , --output_dir     Directory to output. Must be the same as for Generate_partitions.py
  -de , --delimiter     Delimiter for expression matrix. -de="t" for tab-separated (.tsv). -de="c" for comma-separated (.csv). TSV by default.
  -cc , --correlation_coefficient
                        select 'TEA' to enable Coefficient Aggregation. Select "PCC","SCC","bicor" for constructing Partition Aggregation-only GCN
  -am , --aggregation_method
                        The default is RAvg.
  -k , --k_clusters     Number of clusters to partition the expression data. For k=0, will use best k as determined in Generate_partitions.py step
  -im , --input_matrix_path
                        Path of expression matrix to input

Step 4. Post-processing TEA-GCN

After building a TEA-GCN that describes the co-expression strengths between every gene pair (edges). We can then load back in all edges for rank transformation (into Mutual Ranks) and generate standardized ranks for meaningful cross-comparison of TEA-GCN across different species.

Simplest implementation

$ python ./main/Rank_transform.py --output_dir /path/to/output_directory

Completed TEA-GCN will be located in /path/to/output_directory/Completed_GCN/<NETWORK_NAME>.

Full options

usage: Rank_transform.py [-h] -o  [-n] [-d] [-full]

Rank_transform.py calculates Mutual Ranks (MR) and Highest reciprocal ranks (HRR) and their standardized forms raw correlation scores. Rank_transform.py outputs the final complete TEA-GCN
as directory separate from the intermediate TEA-GCN generated by Run_TEA_GCN.py

options:
  -h, --help            show this help message and exit
  -o , --output_dir     Working directory containing required data of which to output data. Must be same as output_dir for Run_TEA_GCN.py and Generate_partitions.py
  -n , --net_name       name of the network to calculate MR and HRR for. By default, all networks will be calculated
  -d , --delete_net     If set to True, Rank_transform.py will delete intermediate TEA-GCN generated by Run_TEA_GCN.py. False by default.
  -full , --full_attributes
                        If set to True, Rank_transform.py will output a full set of attributes for every edge which includes: raw co-exp. strengths, MR, HRR, and their Z score variants.
                        False by default where only raw co-exp, MR, and Z(MR) are reported.

Data structure of Network

The resultant network is contained within a directory consisting of files that each describe edges connecting to each gene (source gene). The files are named after genes.

Example of one file:

$ less /path/to/<NETWORK_NAME>/AT5G65360.1
banner

Description of columns:

  • Target
  • Describes the target gene of the edge
  • Co-exp_Str
  • Co-expression Strength of edge calculated by Run_TEA-GCN.py
  • Co-exp_Str_MR
    • Co-expression Strength of edge transformed into Mutual Ranks (MR)
  • zScore(Co-exp_Str_MR)
    • z-score standardized MR -- Helpful for cross-comparing GCNs

Back to Navigation

Gene Function Prediction using TEA-GCN

banner

Step 1. Generating Co-expression Neighbourhoods of your genes-of-interest

Simplest implementation

$ python ./main/Prepare_neighbourhoods.py --output_dir /path/to/output_directory --gene_list AT3G02480.1

Full Options

usage: Prepare_neighbourhoods.py [-h] -o  -g  [-n]

Prepare_neighbourhoods.py. Prepare co-expression neighbourhoods of Genes-of-intrests from completed TEA-GCNs for gene function prediction using the GSEA algorithm.

options:
  -h, --help          show this help message and exit
  -o , --output_dir   Directory to output. Must be the same as for Generate_partitions.py, Run_TEA-GCN.py, and Rank_transform.py.
  -g , --gene_list    list of comma-separated genes. spaces will be removed. E.g. 'GENE1,GENE2,GENE3'
  -n , --net_name     name of the TEA-GCN to extract. Set to "auto" if you only generated one TEA-GCN. "auto" by default

Format of Co-expression Neighbourhood files

Example of one file:

$ less /path/to/output_directory/Co-exp_Neighbourhoods/AT5G65360.1
banner

Description of columns:

  • Target
  • Describes the target gene of the edge
  • Score
  • z-score standardized MR determined by Rank_transform.py
  • Centered_rank
    • Score ranked inversely (higher values, have larger ranks) and centered (median rank = 0).

Step 2. GSEA using Google Colab notebook

Refer to the Colab notebook (linked below) to predict the biological function of your gene-of-interest based on their co-expression neighbourhood:

TEA-GCN_Function_pred_using_GSEA.ipynb

note_1

note_2

note_3

note_4

Back to Navigation

Discover experimental contexts underpinning TEA-GCN co-expression edges

banner

Step 1. Downloading Metadata of RNA-seq samples from the European Nucleotide Archive (ENA)

This script downloads the metadata of RNA-seq samples from the European Nucleotide Archive (ENA).

The metadata downloaded includes these search fields*:

  • sample_description
  • dev_stage
  • study_title
  • sample_title
  • tissue_lib
  • tissue_type

*for more information regarding search fields and metadata hosted by ENA, click here

Simplest implementation

$ python ./main/Download_metadata.py --output_dir /path/to/output_directory --input_matrix_path /path/to/taxid3702_500n_expression_matrix.tsv

Full options

usage: Download_metadata.py [-h] -o  [-de] -im

Download_metadata.py Download metadata of SRR accessions from European Nucleotide Archive (ENA).

options:
  -h, --help            show this help message and exit
  -o , --output_dir     Directory to output. Must be the same as for Generate_partitions.py, Run_TEA-GCN.py and Rank_transform.py.
  -de , --delimiter     Delimiter for expression matrix. -de="t" for tab-separated (.tsv). -de="c" for comma separated (.csv). TSV by default.
  -im , --input_matrix_path
                        Path of expression matrix to input

Step 2. Annotating Partitions with overrepresented lemmas

banner

Simplest implementation

Download and install your spaCy model of choice for Natural Language Processing (NLP). Below we chose the en_core_web_sm model. This model will be used to lemmatize the metadata of RNA-seq samples downloaded from ENA.

$ python -m spacy download en_core_web_sm

Running this script will find overrepresented lemmas to annotate partitions

$ python ./main/Annotate_partitions.py  --output_dir /path/to/output_directory --input_matrix_path /path/to/taxid3702_500n_expression_matrix.tsv

Full options

usage: Annotate_partitions.py [-h] -o  [-de] -im  -k  [-m] [-p]

Annotate_partitions.py Lemmatize metadata of RNA-seq samples. Annotate dataset partitions with overrepresented lemmas

options:
  -h, --help            show this help message and exit
  -o , --output_dir     Directory to output. Must be the same as for Generate_partitions.py, Run_TEA-GCN.py, and Rank_transform.py.
  -de , --delimiter     Delimiter for expression matrix. -de="t" for tab-separated (.tsv). -de="c" for comma separated (.csv). TSV by default.
  -im , --input_matrix_path
                        Path of expression matrix to input
  -k , --k_clusters     Number of clusters to partition the expression data. For k=0, will use best k as determined in Generate_partitions.py step.
  -m , --model          spaCy model to use for lemmatization. 'en_core_web_sm' will be used by default.
  -p , --permutations   Number of permutations to shuffle during the calculation of overrepresentation p-values. Default = 10,000

Step 3. Generating Partition Rankings for your edges-of-interest

banner

Defining edges-of-interest

Prepare edges-of-interest as input like so.

$ less /path/to/edges_of_interest.csv
banner

Each row describes one edge. Two comma-separated columns for the two genes that connect the edge.

Simplest implementation

python ./main/Get_Partition_Rankings.py -k 93 --input_matrix_path /path/to/taxid3702_500n_expression_matrix.tsv --edges_path /path/to/edges_of_interest.tsv --output_dir /path/to/output_directory

Full options

usage: Get_Partition_Rankings.py [-h] [-w] [-t] -o  [-de] [-k] -im  -ep  [-on]

Get_Partition_Rankings.py.

options:
  -h, --help            show this help message and exit
  -w , --workers        Number of workers for parallelization. Affects precalc of all coefficients. But only affects the calc of bicor.
  -t , --threads        Number of threads for numpy linear algebra operations.
  -o , --output_dir     Directory to output.
  -de , --delimiter     Delimiter for expression matrix. -de="t" for tab-separated (.tsv). -de="c" for comma separated (.csv). TSV by default.
  -k , --k_clusters     Number of clusters to partition the expression data. For k=0, will use best k as determined in Generate_partitions.py step.
  -im , --input_matrix_path
                        Path of expression matrix to input
  -ep , --edges_path    Path to list of edges to calculate.
  -on , --outname_prefix
                        prefix of to output calculations.

Step 4. Experimental context discovery using Google Colab notebook

banner

Removing irrelevant lemmas

To discover experimental contexts underpinning gene co-expression, we use a GSEA-like approach to detect overrepresented lemmas that annotate partitions with high co-expression. However, not all overrepresented lemmas are informative or relevant to experimental conditions. Including them in the downstream analysis will only confound the interpretation of experimental contexts and decrease the sensitivity via harsher multiple-testing correction.

Step 2 generates a list of overrepresented lemmas that are detected in your dataset partitions in /path/to/output_directory/Lemma_annotations/overrepresented_lemmas.tsv. You can remove irrelevant lemmas from this file. It will be used as input in the Colab notebook below to exclude them from the analysis.

$ less /path/to/output_directory/Lemma_annotations/overrepresented_lemmas.tsv
banner

Colab notebook

Refer to the Colab notebook (linked below) to uncover the Experimental contexts underpinning gene co-expression:

Uncover_the_Experimental_contexts_Underpinning_Gene_Co-expression.ipynb

banner

Back to Navigation

Citing TEA-GCN

You can cite our preprint for now :)

Lim, P. K., Wang, R., Velankanni, J. P. A., & Mutwil, M. (2024).
Constructing Ensemble Gene Functional Networks Capturing Tissue/condition-specific Co-expression from Unlabled Transcriptomic Data with TEA-GCN
(p. 2024.07.22.604713). bioRxiv.
https://doi.org/10.1101/2024.07.22.604713

Complementary tools

  1. LSTrAP-Kingdom
    • Download and process public RNA-seq samples to generate gene expression matrices for your species of interest
    • Github page
    • Publication
    • Cite
      Goh W, Mutwil M. LSTrAP-Kingdom: an automated pipeline to generate annotated gene expression atlases for kingdoms of life.
      Bioinforma Oxf Engl. 2021;37(18):3053-3055.
      doi:10.1093/bioinformatics/btab168
      
  2. LSTrAP-denovo
    • Download and process public RNA-seq samples to generate gene expression matrices for your species of interest
    • Designed for use for eukaryotic species without sequenced genomes. Employs de novo transcriptome assembly to derive reference transcripts.
    • Github page
    • Publication
    • Cite
      Lim, P. K., Wang, R., & Mutwil, M. (2024).
      LSTrAP-denovo: Automated Generation of Transcriptome Atlases for Eukaryotic Species Without Genomes.
      Physiologia Plantarum, 176(4), e14407.
      https://doi.org/10.1111/ppl.14407
      

Back to Navigation

Author and contact information

Marek Mutwil πŸ§”

banner
I am the project supervisor. I advise the students and buy them meals sometimes.

Principal Investigator and Assoc. Prof.,
School of Biological Sciences, Nanyang Technological University
mutwil@ntu.edu.sgπŸ“§
Lab websiteπŸ”—

Peng Ken Lim πŸ‘¦

banner
I am the project co-supervisor. I wrote most of the code.

PhD student,
School of Biological Sciences, Nanyang Technological University
pengken001@e.ntu.edu.sgπŸ“§

Ruoxi Wang πŸ‘Ά

I wrote some of the code and conducted comparative-GCN analysis for this project.

Undergraduate student,
Shanghai Jiao Tong University

Jenet Princy Antony Velankanni πŸ‘Ά

I helped with mining Human and Yeast gene annotations for this project.

Undergraduate student,
School of Biological Sciences, Nanyang Technological University

Back to Navigation

About

Two-Tier Ensemble Aggregation Gene Co-expression Network (TEA-GCN): Capturing tissue/condition-specific co-expression from large transcriptomic datasets

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published