This repository contains custom R functions for comprehensive pathway analysis visualisation. They are inteded to be used on the GSEA, gseGO and other pathway analysis methods results from the ClusterProfiler
R library.
The functions are:
- gsea_comparison_plot
- custom_dotplot
- custom_dotplot_comparison
- analyze_pathway_volcano
Some parts of the README.md are written by Claude 3.5 Sonnet.
This repository contains an R function gsea_comparison_plot
for comparing and visualizing Gene Set Enrichment Analysis (GSEA) results between two datasets.
The gsea_comparison_plot
function generates a comprehensive comparison of GSEA results from two datasets, including a scatter plot visualization, data preparation, and extraction of common pathways.
x
,y
: GSEA result dataframes for comparisonx_label
,y_label
: Labels for the two datasetsuse_normalized
: Boolean to use Normalized Enrichment Score (NES) instead of Enrichment Score (default: TRUE, i.e. use NES)percentile_threshold
: Threshold for determining significant pathways (default: 0.95)color
: Color scheme for the plot ("all", "common", or "distinct")max_overlaps
: Maximum number of label overlaps in the plot (default: 20)
The function requires the following R packages:
- dplyr
- ggplot2
- ggrepel
- tidyr
source("gsea_comparison_plot.R")
# Assuming you have two GSEA result dataframes named 'gsea_result1' and 'gsea_result2'
comparison_results <- gsea_comparison_plot(gsea_result1, gsea_result2,
x_label = "Dataset1", y_label = "Dataset2",
use_normalized = TRUE,
percentile_threshold = 0.95,
color = "all")
# Access the plot
print(comparison_results$plot)
# Access the results dataframe
head(comparison_results$data)
# Access common pathways
common_pathways <- comparison_results$common_pathways
# Access GMT-formatted dataframes
gmt_data <- comparison_results$gmt
res <- gsea_comparison_plot(
gsea_sum_GOMF_th1,
gsea_sum_GOMF_th17,
"Th1",
"Th17",
color = "all",
max_overlaps = Inf,
percentile_threshold = 0.90,
use_normalized = TRUE
)
# Display the plot
print(res$plot)
- Generates a scatter plot comparing enrichment scores or NES between two datasets
- Identifies and categorizes common and distinct pathways
- Provides flexible color schemes for visualizing results
- Implements a colorblind-friendly palette
- Labels top pathways using ggrepel for improved readability
- Generates GMT-formatted dataframes for further analysis
- Returns a comprehensive list object with plot, data, common pathways, and GMT-formatted results
You can customize the plot and analysis by adjusting various parameters:
use_normalized
to toggle between Enrichment Score and Normalized Enrichment Scorepercentile_threshold
to adjust the significance thresholdcolor
to change the color scheme of the plotmax_overlaps
to control the density of labels in the plot
The function returns a list containing:
data
: A dataframe with combined results from both datasetsplot
: The ggplot object for the comparison plotcommon_pathways
: A list of common pathways (positive, negative, mixed, and all). This may be used downstream for further visualisation methodsgmt
: A list of GMT-formatted dataframes for different pathway categories.
Note!: Various packages might expect the gmt matrix to be in specific formats to work with.
For example, to feed the GMT matrix to the library TcGSA
I had to convert the output gmt into a specific format. I used this script of mine to convert it:
# First, convert your matrix to a data.frame
gmt_df <- as.data.frame(res$gmt$gmt_mix) # Taking the GMT dataframe out of the NESscatter results function
# Assuming your GMT-like dataframe is `gmt_df`
gmt_list <- gmt_df %>%
# Gather all the gene columns into a long format, removing NAs
pivot_longer(cols = starts_with("Gene"), names_to = NULL, values_to = "Gene", values_drop_na = TRUE) %>%
# Group by ID and Description to form the list for genesets
group_by(ID, Description) %>%
summarise(genesets = list(Gene), .groups = 'drop') %>%
ungroup() %>%
# Convert the result into a list containing the three required components
summarise(
genesets = list(genesets),
geneset.names = list(ID),
geneset.descriptions = list(Description)
) %>%
# Turn it into a named list that TcGSA can work with
{
list(
genesets = .$genesets[[1]],
geneset.names = .$geneset.names[[1]],
geneset.descriptions = .$geneset.descriptions[[1]]
)
}
For the GSVA
package I had to convert it once again using this little data tyding script below:
gmt_df <- as.data.frame(res$gmt$gmt_mix)
# Ensure the necessary libraries are loaded
require(dplyr)
# Assuming your GMT dataframe is called `gmt_df`
gmt_list <- gmt_df %>%
# Pivot the gene columns into a long format
pivot_longer(cols = starts_with("Gene"), names_to = NULL, values_to = "Gene", values_drop_na = TRUE) %>%
# Group by the ID (gene set names)
group_by(ID) %>%
# Summarize into a list of gene identifiers for each gene set
summarise(geneset = list(Gene)) %>%
# Turn into a named list where the element names are the gene set names
tibble::deframe()
# Now `gmt_list` is a list where each element is a vector of gene identifiers, and the names are the gene set names.
The R function custom_dotplot
is a custom ggplot2
GSEA dotplot. I was unsatisied with the standard ClusterProfiler
functon dotplot
, and so this function was born.
The custom_dotplot
function generates a visually appealing and informative dotplot from GSEA results, allowing for various customizations including text processing, filtering, and sorting options.
gsea_obj
: A GSEA result objectshowCategory
: Number of categories to display (default: 10)font.size
: Base font size for the plot (default: 10)title
: Plot title (default: "GSEA Dotplot")replace_
: Boolean to replace underscores with spaces in descriptions (default: TRUE)capitalize_1
: Boolean to capitalize the first word in descriptions (default: TRUE)capitalize_all
: Boolean to capitalize all words in descriptions (default: FALSE)filterBy
: Method to filter results ("qvalue", "NES", "NES_positive", "NES_negative")sortBy
: Method to sort results ("GeneRatio" or "qvalue")q_cut
: Q-value cutoff for significance (default: 0.05)min.dotSize
: Minimum dot size in the plot (default: 2)
The function requires the following R packages:
- ggplot2
- dplyr
- stringr
- scales
source("custom_dotplot.R")
# Assuming you have a GSEA result object named 'gsea_result'
plot <- custom_dotplot(gsea_result, showCategory = 15, filterBy = "NES", sortBy = "GeneRatio")
print(plot)
- Flexible filtering options based on q-value, NES (Normalized Enrichment Score), or direction of enrichment
- Customizable sorting of results by Gene Ratio or q-value
- Text processing options for pathway descriptions (capitalization, underscore replacement)
- Visually appealing dotplot with:
- Dot size representing significance (-log10(q-value))
- Color indicating direction of enrichment (positive or negative NES)
- Customizable appearance (font size, title, etc.)
You can customize the plot by adjusting various parameters:
showCategory
to control the number of pathways displayedfilterBy
andsortBy
for different result filtering and sorting strategiesq_cut
to set the significance thresholdfont.size
andtitle
for visual customization
Feel free to fork this repository and submit pull requests with improvements or bug fixes.
This repository contains an R function custom_dotplot_comparison
for creating a comparative dotplot visualization of Gene Set Enrichment Analysis (GSEA) results from two samples.
The custom_dotplot_comparison
function generates a side-by-side dotplot comparison of GSEA results from two samples, allowing for easy visualization of enriched pathways across both datasets.
gsea_obj_x
,gsea_obj_y
: GSEA result objects for comparisonpathway_ids
: Vector of pathway IDs to include in the plotfont.size
: Base font size for the plot (default: 10)title
: Plot title (default: "Combined GSEA Dotplot")replace_
: Boolean to replace underscores with spaces in descriptions (default: TRUE)capitalize_1
: Boolean to capitalize the first word in descriptions (default: FALSE)capitalize_all
: Boolean to capitalize all words in descriptions (default: FALSE)min.dotSize
: Minimum dot size in the plot (default: 2)sample_x_name
,sample_y_name
: Names for the two samples (default: "Sample X" and "Sample Y")sortBy
: Method to sort pathways ("importance_score", "qvalue", "x", or "y") (default: "importance_score") Note! Choosing the sorting method might be tricky, so just take an look to see what works best for your specific use case. The x and y methods sort either for x or y samples. Theqvalue
sorting usually looks the most awkward. Theimportance_score
sorting is a multiply ofqvalue
andGeneRatio
and so results in a visual hybrid sorting.
The function requires the following R packages:
- ggplot2
- dplyr
- stringr
- tidyr
source("custom_dotplot_comparison.R")
# Assuming you have two GSEA result objects named 'gsea_result1' and 'gsea_result2'
# and a vector of pathway IDs you want to compare
plot <- custom_dotplot_comparison(
gsea_obj_x = gsea_result1,
gsea_obj_y = gsea_result2,
pathway_ids = c("PATHWAY1", "PATHWAY2", "PATHWAY3"),
sample_x_name = "Treatment",
sample_y_name = "Control",
sortBy = "importance_score"
)
print(plot)
Note! That I`m using the common_mix
pathways from the NESscatter function as an input. Thus I define the pathways, which I want to visually compare between two GSEA runs. You may choose, whether you want to compare the common_mix
(top 10 negative & top 10 positive pathways), or just negative or positive pathways only.
# Example usage with your GSEA objects
plot <- custom_dotplot_comparison(
gsea_obj_x = gsea_GOMF_th1,
gsea_obj_y = gsea_GOMF_th17,
pathway_ids = res$common_pathways$common_mix,
font.size = 10,
title = "Combined common GSEA pathways dotplot
Top 10 common negative & Top 10 common positive pathways",
sortBy = "y",
replace_ = TRUE,
capitalize_1 = FALSE,
capitalize_all = FALSE,
min.dotSize = 3,
sample_x_name = "Th1 Sample",
sample_y_name = "Th17 Sample"
)
# Display the plot
print(plot)
- Generates a side-by-side dotplot comparison of GSEA results from two samples
- Allows customization of pathway descriptions (replacing underscores, capitalization)
- Provides flexible sorting options for pathways
- Uses color to distinguish between positive and negative Normalized Enrichment Scores (NES)
- Represents significance with dot size (-log10(q-value))
- Handles missing pathways gracefully with warnings
You can customize the plot by adjusting various parameters:
font.size
for changing the base font sizereplace_
,capitalize_1
, andcapitalize_all
for pathway description formattingmin.dotSize
to set the minimum dot sizesortBy
to change the ordering of pathways in the plot
The function returns a ggplot object representing the comparative dotplot. The plot includes:
- Two facets, one for each sample
- Dots representing pathways, with size indicating significance and color indicating NES direction
- Customizable pathway descriptions on the y-axis
- Gene Ratio on the x-axis
R function analyze_pathway_volcano
creates customized volcano plots that highlight genes from a specific GSEA analysis pathway within the context of differential expression results. The function offers multiple styling options to suit different visualization preferences. Its usefeul if you need to show the \
behaviour` of pathway genes against the universe of all the background genes.
The analyze_pathway_volcano
function generates visually informative volcano plots that emphasize genes from a specified pathway, allowing for various customizations including significance thresholds, labeling methods, visual parameters, and plotting styles.
pathway_name
: Name of the pathway to highlight. Must be equal to the DESCRIPTION of the pathway in the GSEA resultsgsea_results
: GSEA results object containing pathway informationde_results
: Differential expression results data framep_cutoff
: P-value cutoff for significance (default: 0.05)fc_cutoff
: Fold change cutoff for significance (default: 2.0)label_method
: Method to determine which genes to label ("default", "fc", "p", or "all")max_overlaps
: Maximum number of label overlaps allowed (default: 100)style
: Plotting style to use ("clean", "claude", or "gpt")
The function requires the following R packages:
- ggplot2
- dplyr
- stringr
- ggrepel
source("analyze_pathway_volcano.R")
# Assuming you have GSEA results and DE results objects
plot <- analyze_pathway_volcano("Pathway Name", gsea_results, de_results,
p_cutoff = 0.05, fc_cutoff = 2.0,
label_method = "all", style = "clean")
print(plot)
p <- analyze_pathway_volcano(
pathway_name = "single-stranded DNA binding",
gsea_results = gsea_sum_GOMF_th1,
de_results = limma_r_Th17.39_vs_Th1.39,
p_cutoff = 0.05,
fc_cutoff = 2.0,
label_method = "p", # or "fc"
max_overlaps = 10,
style = "claude"
)
print(p)
- Highlights genes from a specific pathway within the context of background DE genes. Be carefull choosing your background genes, and DE results, so that the visuals would make biological sense.
- Flexible labeling options for genes of interest
- Customizable significance thresholds for p-value and fold change
- Multiple styling options: "clean", "claude", "gpt"
- Visually appealing volcano plots with:
- Color-coded points for pathway genes and significant genes
- Dashed lines indicating significance thresholds
- Labels for selected genes using
ggrepel
for optimal placement - Customizable appearance (title, subtitle, axis labels)
You can customize the plot by adjusting various parameters:
p_cutoff
andfc_cutoff
to set different significance thresholdslabel_method
to control which genes are labeledmax_overlaps
to adjust label density and prevent overcrowdingstyle
to choose between different visual presentations
-
Clean Style ("clean"):
- Minimalist design with no legend
- Pathway genes highlighted in distinct colors based on significance
- Non-pathway genes in grey
-
Claude Style ("claude"):
- Enhanced volcano-style legend
- Colorblind-friendly palette
- Legend positioned in the top-right corner
- Generated by Claude 3.5 Sonnet
-
GPT Style ("gpt"):
- Distinct colors for different gene statuses
- Legend positioned near the top-right corner
- Generated by ChatGPT-4o
Feel free to fork this repository and submit pull requests with improvements or bug fixes. We welcome contributions for new styling options or additional customization features.
MIT License
Copyright (c) [2024] [Anton Zhelonkin]
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.