Skip to content

Practical 2

Nicolas Alcala edited this page Nov 16, 2023 · 81 revisions

Practical 2: Performing a multi-omic analysis of cancer data with R

  • performing uni-omic and multi-omic molecular classifications
  • interpreting the results and finding clinical implications

Intro

The slides, along with the relevant slides from lecture 2 are available here.

Prerequisites

  • The data for the practical is in the github repository, in medical_genomics/Practical2/Data (e.g., retrieved with the command line git clone https://github.com/IARCbioinfo/medical_genomics_course)
  • R (v>=4.0) and python (v>3.0) installed
  • MOFA+ installed

Preferred way: local machine

MOFA should already be installed in your local machine. Check the installation of mofapy2 in python3:

pip3 install mofapy2==0.6.7

Alternative: use rstudio cloud

On https://rstudio.cloud/, login and join the practical's workspace at https://posit.cloud/spaces/448143/join?access_code=g7t64WIH7yhm0Id-n0WFyNVuUEVCxiQ6dkTWGOfe or create a session and install relevant packages

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MOFA2")
 system("pip install mofapy2")

Note: To retrieve the data for the practical within rstudio server cloud, you can type inside your rstudio session:

system("git clone https://github.com/IARCbioinfo/medical_genomics_course")

Alternative 2: Singularity

You can use the singularity container to avoid installing R and MOFA2

singularity pull docker://iarcbioinfo/medical_genomics_course:practical2

Then run the container and R within the container:

singularity shell medical_genomics_course_practical2.sif R

Note: if you want a particular folder to be visible within singularity (in particular shared drives not in your working computer), use the singularity option "-B path:/mnt", where path is the original path of the folder in your computer, and /mnt is the location you will see within the singularity container

Check MOFA2 install

Tu make sure that MOFA2 is well installed with all dependencies, you can run the simple code from the help from the run_mofa function:

library(MOFA2)
library(reticulate)
use_python(system("which python3",intern=T), required=TRUE)
file <- system.file("extdata", "test_data.RData", package = "MOFA2")
load(file)
# Create the MOFA object
MOFAmodel <- create_mofa(dt)
# Prepare the MOFA object with default options
MOFAmodel <- prepare_mofa(MOFAmodel)
# Run the MOFA model
MOFAmodel <- run_mofa(MOFAmodel)

If this does not work, try using the use_basilisk = T option to install all necessary dependencies:

file <- system.file("extdata", "test_data.RData", package = "MOFA2")
load(file)
# Create the MOFA object
MOFAmodel <- create_mofa(dt)
# Prepare the MOFA object with default options
MOFAmodel <- prepare_mofa(MOFAmodel)
# Run the MOFA model
MOFAmodel <- run_mofa(MOFAmodel,use_basilisk = T)

Steps

Data preprocessing

Question 1: Loading the data Get the clinical data and the methylation data https://github.com/IARCbioinfo/medical_genomics_course/tree/main/Practical2/Data

a) Load the tab-separated-values files containing the clinical data (medical_genomics_course/Practical2/Data/Data.Clin.txt), the RNA expression data matrix (medical_genomics_course/Practical2/Data/RNA.tsv), and the 3 DNA mehtylation matrices (medical_genomics_course/Practical2/Data/DNAMeth_promoter.tsv , DNAMeth_genebody.tsv , and DNAMeth_enhancer.tsv) using your favorite file reading function (e.g., the read.table function from base R with sep = "\t" as argument, or the read_tsv function from package dplyr)

b) Check the objects you just created by using head and dim functions. What does each dimension represent? Does it matter if the dimensions of the matrices are different?

Question 2: Create a MOFA object

a) Visually check that each data set distribution approximately follows a Gaussian or a mixture of Gaussian distributions. Why is this assumption important?

b) Using the create_mofa function, create a MOFA object containing the resulting data sets.

Training the model

Question 3: MOFA run settings and training the MOFA model

a) By using get_default_xxx_options functions, set the data, model, and training options to the default values except for the number of factors, choose 5 factors and the convergence mode, choose “slow” mode.

b) Using the prepare_mofa function, prepare the created MOFA object for proper training.

c) Run MOFA using the run_mofa function and save the resulting MOFA model locally. What is printed during the model training? When does the training stop?

Note: if MOFA crashes or does not run, load the already trained model in the github repo with MOFAmodel <- load_model("medical_genomics_course/Practical2/Data/MOFA.hdf5")

Quality control

Question 4: MOFA run quality controls

a) Using the plot_variance_explained function, get the contribution of each data set in the MOFA axes definition. Write a short description about the resulting plot.

b) Check for outliers on all the MOFA axes, in your opinion, which axes should be removed from the downstream analyses? What can be the cause of such outliers?

c) Using the plot_factor_cor function, assess the uniqueness of each MOFA axis. What would you do if some factors were highly correlated?

Bonus question: Comparison with uni-omic unsupervised analyses

a) Using the ade4 package, perform PCA using, first, RNA-seq data (PCA-exp) and second, DNA methylation data (PCA-meth).

b) Compare the resulting PCA-exp and PCA-meth with MOFA. Complete your quality control on MOFA.

Analysis and interpretation

Question 5: Data visualisation

a) Plot the LF1-LF2 space and describe the distribution of the samples. Note: mofa factors can either be plotted using the plot_factors function or accessed using the get_factors function

b) Using the clinical data in medical_genomics_course/Practical2/Data/Data.Clin.txt, plot the LFs using different colors or shapes for each variable. Are histopathological types separated on the different LFs? What about molecular clusters? Note: Typical carcinoids is the lowest grade, least aggressive type, while atypical carcinoids are higher grade and more aggressive

c) Using the run_umap function, compare your first observation of the LF1-LF2 space with umap using the 5 first LFs defined by MOFA. Are histopathological types better separated? What about molecular clusters?

Question 6: Biological interpretation of the latent factors

a) Using the plot_weights, plot_top_weights, and get_weights functions, and the gene annotation file medical_genomics_course/Practical2/Data/RNA_gene_annotation.tsv, identify the 10 genes for which the expression data contribute the most to LF1. Using google and google scholar, what can you say about the gene with the strongest weight?

b) Same with the top 10 enhancer CpGs contributing the most to LF1 (annotations in "medical_genomics_course/Practical2/Data/DNAMeth_enhancer_annotation.tsv")?

c) Using the run_enrichment function with the pathway database MSigDB_v6.0_C2_human from the MOFAdata package (data("MSigDB_v6.0_C2_human")) and the RNA-seq data set, perform a gene set enrichment analysis giving the most contributing pathways to LF1. What are the most significant pathways? What is your interpretation given the tumor type we are investigating?

Question 7: Clinical profile and latent factors

Provide the statistical association test results between clinical annotations given in the Clinical table and the first 3 LFs defined by MOFA. Which clinical associations can you underline?

Resources

Clone this wiki locally