-
Notifications
You must be signed in to change notification settings - Fork 0
/
MOFA_to_COSMOS.Rmd
1341 lines (1068 loc) · 57.9 KB
/
MOFA_to_COSMOS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "MOFA-COSMOS pipeline"
author: "Pascal Lafrenz and Aurelien Dugourd"
date: "`r Sys.Date()`"
output: md_document
---
## MOFA-COSMOS pipeline
```{r Setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Installation and dependency
In this tutorial, we use the new Meta-Footprint Analysis MOON to score a mechanistic network.
```{r COSMOS_installation, message=FALSE}
#install cosmosR from github directly to obtain the newest version
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
if (!requireNamespace("cosmosR", quietly = TRUE)) devtools::install_github("saezlab/cosmosR")
```
DecoupleR is used to score various regulatory interactions.
```{r decoupleR_installation}
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if (!requireNamespace("decoupleR", quietly = TRUE)) BiocManager::install("saezlab/decoupleR")
```
We are using the LIANA package to process the LR interactions.
```{r LIANA_installation}
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
if (!requireNamespace("liana", quietly = TRUE)) remotes::install_github('saezlab/liana')
```
We are using MOFA2 (Argelaguet et al., 2018) to find decompose variance across
different omics and use COSMOS to find mechanistic interpretations of its
factors. However, since we are using the python version of MOFA2, please make
sure to have [mofapy2](https://github.com/bioFAM/mofapy2) as well as panda and
numpy installed in your (conda) environment
(e.g. by using reticulate: reticulate::use_condaenv("base", required = T) %\>% reticulate::conda_install(c("mofapy2","panda","numpy")).
For downstream analysis, the R version of MOFA2 is used.
```{r MOFA2_installation, message=FALSE}
# Install MOFA2 from bioconductor (stable version)
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if (!requireNamespace("MOFA2", quietly = TRUE)) BiocManager::install("MOFA2")
```
General information (tutorials) on how to use COSMOS and MOFA2,
see [COSMOS](https://saezlab.github.io/cosmosR/articles/tutorial.html)
and [MOFA2](https://biofam.github.io/MOFA2/tutorials.html).
For data manipulation, visualization, etc. further packages are needed which
are loaded here along with MOFA2 and COSMOS.
```{r Loading_packages, message=FALSE}
#imports
library(readr)
#core methods
library(cosmosR)
library(MOFA2)
library(liana)
library(decoupleR)
#data manipulations
library(dplyr)
library(reshape2)
library(GSEABase)
library(tidyr)
#plotting
library(ggplot2)
library(ggfortify)
library(pheatmap)
library(gridExtra)
library(RColorBrewer)
# library(RCy3)
```
### From omics data to MOFA ready input
Since extensive pre-processing is beyond the scope of this tutorial, here only
the transformation of pre-processed single omics data to a long data.frame
(columns: "sample", "group", "feature", "view", "value") is shown. For more
details and information regarding appropriate input, please refer to the MOFA
tutorial [MOFA2: training a model in R](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/getting_started_R.html).
The raw data is accessed through [NCI60 cellminer](https://discover.nci.nih.gov/cellminer/home.do)
and the scripts for pre-processing can be found in the associated folder.
Here, we have the following views (with samples as columns and genes as rows):
transcriptomics (RNA-seq [<PMID:31113817>](https://pubmed.ncbi.nlm.nih.gov/31113817/)),
proteomics (SWATH (Mass spectrometry) [<PMID:31733513>](https://pubmed.ncbi.nlm.nih.gov/31733513/>)}
and metabolomics (LC/MS & GC/MS (Mass spectrometry) [DTP NCI60 data](https://wiki.nci.nih.gov/display/NCIDTPdata/Molecular+Target+Data)).
The group information is stored inside the RNA metadata and based on
transcription factor clustering (see scripts/RNA/Analyze_RNA.R). Further, only
samples are kept for which each omic is available (however this is not
necessary since MOFA2 is able to impute data).
```{r Preparing_MOFA_input, message=FALSE}
# Transcriptomics
## Load data
RNA <- as.data.frame(read_csv("data/RNA/RNA_log2_FPKM_clean.csv"))
rownames(RNA) <- RNA[,1]
RNA <- RNA[,-1]
## Remove genes with excessive amount of NAs (only keep genes with max. amount
#of NAs = 33.3 % across cell lines)
RNA <- RNA[rowSums(is.na(RNA))<(dim(RNA)[2]/3),]
## Only keep highly variable genes (as suggested by MOFA): here top ≈70% genes
#to keep >75% of TFs (103 out of 133). For stronger selection, we can further
#reduce the number of genes (e.g. top 50%)
RNA_sd <- sort(apply(RNA, 1, function(x) sd(x,na.rm = T)), decreasing = T)
hist(RNA_sd, breaks = 100)
hist(RNA_sd[1:6000], breaks = 100)
RNA_sd <- RNA_sd[1:6000]
RNA <- RNA[names(RNA_sd),]
# Proteomics
## Load data
proteo <- as.data.frame(read_csv("data/proteomic/Prot_log10_SWATH_clean.csv"))
rownames(proteo) <- proteo[,1]
proteo <- proteo[,-1]
## Only keep highly variable proteins (as suggested by MOFA): here top ≈60%.
#For stronger selection, we can further reduce the number of proteins (e.g. only keep top 50%)
proteo_sd <- sort(apply(proteo, 1, function(x) sd(x,na.rm = T)), decreasing = T)
hist(proteo_sd, breaks = 100)
hist(proteo_sd[1:round(dim(proteo)[1]*3/5)], breaks = 100)
proteo_sd <- proteo_sd[1:round(dim(proteo)[1]*3/5)]
proteo <- proteo[names(proteo_sd),]
# Metabolomics
## Load data
metab <- as.data.frame(read_csv("data/metabolomic/metabolomic_clean_vsn.csv"))
rownames(metab) <- metab[,1]
metab <- metab[,-1]
## Since we have only a limited number of metabolite measurements,
#all measurements are kept here
metab_sd <- apply(metab, 1, function(x) sd(x,na.rm=T))
hist(metab_sd, breaks = 100)
# Create long data frame
## Only keep samples with each view present
overlap_patients <- intersect(intersect(names(RNA),names(proteo)),names(metab))
RNA <- RNA[,overlap_patients]
proteo <- proteo[,overlap_patients]
metab <- metab[,overlap_patients]
## Create columns required for MOFA
### RNA
RNA <- melt(as.data.frame(cbind(RNA,row.names(RNA))))
RNA$view <- "RNA"
RNA <- RNA[,c(2,1,4,3)]
names(RNA) <- c("sample","feature","view","value")
### Proteomics
proteo <- melt(as.data.frame(cbind(proteo,row.names(proteo))))
proteo$view <- "proteo"
proteo <- proteo[,c(2,1,4,3)]
names(proteo) <- c("sample","feature","view","value")
### Metabolomics
metab <- melt(as.data.frame(cbind(metab,row.names(metab))))
metab$view <- "metab"
metab <- metab[,c(2,1,4,3)]
names(metab) <- c("sample","feature","view","value")
## Merge long data frame to one
mofa_ready_data <- as.data.frame(do.call(rbind,list(RNA,proteo,metab)))
# Save MOFA metadata information and MOFA input
write_csv(mofa_ready_data, file = "data/mofa/mofa_data.csv")
```
The final data frame has the following structure:
```{r MOFA_input}
head(mofa_ready_data)
```
We have successfully combined single omic data sets to a long multi-omics data
frame that can be used as a MOFA input.
In this part we assess the correlation between the RNA and proteomic data. We
both look for correlation across samples AND across features. When we check
across sample, we interrogate the correlation at the level of single genes, that
is for given gene, how is a change is transcript abundance is reflected in the
the level of its protein. This correlation can be different between genes, as
there are all subject to different regulatory mechanisms. when we check across
features, we interrogate the correlation at the level of single samples. that is,
we check if overall the protein abundance are good proxy of transcript abundance,
and vice-versa. The imperfect correlation between trnaxcripts and proteins
reflects the complexity of the translation process, as well as protein and RNA
stability.
```{r Supp_figure_condition_prot_RNA_correlation}
condition_prot_RNA_correlation <- sapply(unique(mofa_ready_data$sample), function(condition){
merged_data <- merge(mofa_ready_data[which(mofa_ready_data$sample == condition & mofa_ready_data$view == "RNA"),-c(1,3)],
mofa_ready_data[which(mofa_ready_data$sample == condition & mofa_ready_data$view == "proteo"),-c(1,3)],
by = "feature")
return(cor(merged_data[,2],merged_data[,3], use = "pairwise.complete.obs"))
})
hist(condition_prot_RNA_correlation, breaks = 20)
```
```{r Supp_figure_single_prot_RNA_correlation}
overlap_genes <- unique(intersect(proteo$feature,RNA$feature))
single_prot_RNA_correlation <- sapply(overlap_genes, function(gene){
merged_data <- merge(mofa_ready_data[which(mofa_ready_data$feature == gene & mofa_ready_data$view == "RNA"),-c(2,3)],
mofa_ready_data[which(mofa_ready_data$feature == gene & mofa_ready_data$view == "proteo"),-c(2,3)],
by = "sample")
return(cor(merged_data[,2],merged_data[,3], use = "pairwise.complete.obs"))
})
hist(single_prot_RNA_correlation, breaks = 100)
```
### MOFA run
After pre-processing the data into MOFA appropriate input, let's perform the
MOFA analysis. To change specific MOFA options the function
write_MOFA_options_file is used. For more information regarding option
setting see [MOFA option tutorial](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/getting_started_R.html).
Since we don't know how many factors are needed to explain our data well,
various maximum numbers of factors are tested.
The final results can be found in the results/mofa/ folder.
```{r MOFA_options_function, include=FALSE}
write_MOFA_options_file <- function(input_file,
output_file,
scale_groups = "False",
scale_views = "False",
likelihoods = c('gaussian','gaussian','gaussian','gaussian'),
factors = "5",
spikeslab_weights = "True",
spikeslab_factors = "False",
ard_factors = "True",
ard_weights = "True",
iter = "1000",
convergence_mode = "fast",
startELBO = "1",
freqELBO = "1",
dropR2 = "0.001",
gpu_mode = "True",
verbose = "False",
seed = "1") {
wd <- getwd()
options_MOFA <- data.frame("input_file" = paste0(wd,input_file),
"output_file" = paste0(wd,output_file),
scale_groups,
scale_views,
factors,
spikeslab_weights,
spikeslab_factors,
ard_factors,
ard_weights,
iter,
convergence_mode,
startELBO,
freqELBO,
dropR2,
gpu_mode,
verbose,
seed,
"length" = length(likelihoods))
for (i in 1:length(likelihoods)){
cache <- paste0("likelihood","_",i)
options_MOFA[,cache] <- likelihoods[i]
}
write_csv(options_MOFA,
"options_MOFA.csv"
)
}
```
Only run this chunk if you want to re-perform the mofa optimisation.
The results are already available in the
results/mofa folder. We are testing various amount of maximum number of factors
that MOFA is allowed to explore.
You need to instal mofapy2 in python3 to run this chunk. On windows, you can simply
run "python3" the terminal to make sure its installed or be automatically redirected
to the windows store page if it's not. Then, you can run "python3 -m pip install --user mofapy2"
in a terminal to install the module to python3.
```{r MOFA_run, eval=FALSE}
for(i in c(5:15,20,length(unique(mofa_ready_data$sample)))){
wd <- getwd()
write_MOFA_options_file(input_file = '/data/mofa/mofa_data.csv', output_file = paste0('/results/mofa/mofa_res_',i,'factor.hdf5'), factors = i, convergence_mode = "slow", likelihoods = c('gaussian','gaussian','gaussian'))
system(paste0(paste('python3', wd, sep = " "), paste('/scripts/mofa/mofa2.py', "options_MOFA.csv")))
}
```
### Investigate MOFA output
We first load the different MOFA models containing the MOFA results together
with the metadata information. Since we don't know a-priori which is the ideal
number of factors, we can settle on the highest number of factor where the
model stabilises.
```{r Loading_MOFA_output, message=FALSE, warning=F}
# Load MOFA output
for(i in c(5:15,20,length(unique(mofa_ready_data$sample)))){
filepath <- paste0('results/mofa/mofa_res_',i,'factor.hdf5')
model <- load_model(filepath)
meta_data <- read_csv("data/metadata/RNA_metadata_cluster.csv")[,c(1,2)]
colnames(meta_data) <- c("sample","cluster")
mofa_metadata <- merge(samples_metadata(model), meta_data, by = "sample")
names(mofa_metadata) <- c("sample","group","cluster")
samples_metadata(model) <- mofa_metadata
assign(paste0("model",i),model)
}
```
Then, we can compare the factors from different MOFA models and check on consistency (here: model 7-13).
```{r Compare_MOFA_models}
compare_factors(list(model7,model8,model9,model10,model11,model12,model13), cluster_rows = F, cluster_cols = F,)
```
Here, we can see that the inferred MOFA factor weights do not change drastically
around 9. For the sake of this tutorial, we choose the 9-factor MOFA model.
```{r Select_MOFA_model}
model <- model10
```
The next part deals with the analysis and interpretation of the MOFA factors
with specific focus on the factor weights. The classic tutorial on how to
perform downstream analysis after MOFA model training can be found here [MOFA+: downstream analysis (in R)](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/downstream_analysis.html).
In this context, it is important to emphasize that the factors can be
interpreted similarly to principal components (PCs) in a principal component
analysis (PCA).
An initial metric we can explore is the total variance ($R^2$) explained per
view by our MOFA model. This helps us understand how well the model represents the data.
```{r Total_variance_per_factor}
# Investigate factors: Explained total variance per view
plot_variance_explained(model, x="group", y="factor", plot_total = T)[[2]]
calculate_variance_explained(model)
```
The bar plot demonstrates that the nine factors for the view RNA explain more
than 50% of the variance. In contrast, for both metabolomics as well as
proteomics around 20% of the variance can be explained by all factors.
It isn't unexpected that more variance of the RNA dataset can be reconstructed.
Usually, more features available in a given view will lead to
a higher % of variance explained. It is interesting to notice though that
similar amount of variance are explained for metabolic and proteomic views,
despite having very different number of features. This may be due to the fact
that metabolite abundances are in general more cross-correlated than protein
abundances (due to the underlying regulatory structures, e.i. reaction networks
vs protein synthesis and degratation regulatory mechanisms).
```{r Check_cross_correlation_of_omics_compared_to_variance_epxlained}
#Get the cross correlation for RNA and proteomic data
RNA <- dcast(mofa_ready_data[which(mofa_ready_data$view == "RNA"),-3], feature~sample, value.var = "value")
row.names(RNA) <- RNA[,1]
RNA <- RNA[,-1]
RNA_crosscor <- cor(t(RNA), use = "pairwise.complete.obs")
# hist(RNA_crosscor[upper.tri(RNA_crosscor)], breaks = 1000)
prot <- dcast(mofa_ready_data[which(mofa_ready_data$view == "proteo"),-3], feature~sample, value.var = "value")
row.names(prot) <- prot[,1]
prot <- prot[,-1]
prot_crosscor <- cor(t(prot), use = "pairwise.complete.obs")
# hist(prot_crosscor[upper.tri(prot_crosscor)], breaks = 1000)
metabo <- dcast(mofa_ready_data[which(mofa_ready_data$view == "metab"),-3], feature~sample, value.var = "value")
row.names(metabo) <- metabo[,1]
metabo <- metabo[,-1]
metabo_crosscor <- cor(t(metabo), use = "pairwise.complete.obs")
# hist(metabo_crosscor[upper.tri(metabo_crosscor)], breaks = 1000)
# plot(calculate_variance_explained(model)$r2_total$single_group, c(mean(RNA_crosscor),mean(metabo_crosscor),mean(prot_crosscor)))
plot(calculate_variance_explained(model)$r2_total$single_group, c(dim(RNA_crosscor)[1],dim(metabo_crosscor)[1],dim(prot_crosscor)[1]))
df_variance_crosscor <- as.data.frame(cbind(calculate_variance_explained(model)$r2_total$single_group,
c(mean(RNA_crosscor),mean(metabo_crosscor),mean(prot_crosscor)),
c(dim(RNA_crosscor)[1],dim(metabo_crosscor)[1],dim(prot_crosscor)[1])))
names(df_variance_crosscor) <- c("var_expl","mean_crosscor","n_features")
df_variance_crosscor$prod <- sqrt(df_variance_crosscor$mean_crosscor) * df_variance_crosscor$n_features
#Just plot them as scatter plot with regression lines
mofa_solved <- ggplot(df_variance_crosscor, aes(x = prod, y = var_expl)) +
geom_point() +
geom_smooth(method='lm', formula= y~x) +
theme_minimal() +
xlab("sqrt(mean cross-corelation of view) * number of features") +
ylab("variance explained for each view") +
ggtitle("MOFA size and cross cor influence")
mofa_solved
mofa_nfeatures <- ggplot(df_variance_crosscor, aes(x = n_features, y = var_expl)) +
geom_point() +
geom_smooth(method='lm', formula= y~x) +
theme_minimal() +
xlab("sqrt(mean cross-corelation of view) * number of features") +
ylab("variance explained for each view") +
ggtitle("MOFA size influence")
mofa_nfeatures
#Get the actual coeficient signficance
summary(lm(data= df_variance_crosscor, var_expl~prod))
summary(lm(data= df_variance_crosscor, var_expl~n_features))
```
Since we would like to use the found correlation for downstream COSMOS analysis,
we first investigate the variance ($R^2$) each factor can explain per view as
well as investigate the factor explaining the most variance across all views.
```{r Variance_per_view_per_factor}
# Investigate factors: Explained variance per view for each factor
pheatmap(model@cache$variance_explained$r2_per_factor[[1]], display_numbers = T, angle_col = "0", legend_labels = c("0","10", "20", "30", "40", "Variance\n\n"), legend = T, main = "", legend_breaks = c(0,10, 20, 30, 40, max(model@cache$variance_explained$r2_per_factor[[1]])), cluster_rows = F, cluster_cols = F, color = colorRampPalette(c("white","red"))(100), fontsize_number = 10)
pheatmap(model@cache$variance_explained$r2_per_factor[[1]], display_numbers = T, angle_col = "0", legend_labels = c("0","10", "20", "30", "40", "Variance\n\n"), legend = T, main = "", legend_breaks = c(0,10, 20, 30, 40, max(model@cache$variance_explained$r2_per_factor[[1]])), cluster_rows = F, cluster_cols = F, color = colorRampPalette(c("white","red"))(100), fontsize_number = 10,filename = "results/mofa/variance_heatmap.pdf",width = 4, height = 2.5)
```
By analyzing this heatmap, we can observe that Factor 1 and 5 explain a large
proportion of variance of the RNA, Factor 2 and 4 mainly explains the variance of
the metabolomics and Factor 3 highlights variability from the proteomics view.
Consequently, no factor explains a large portion of the variance across all
views, but factor 2 and 4 capture variance across all views. We can also
analyze the correlation between factors. The next plot highlights the spearman
correlation between each factor.
```{r Correlation_between_factors}
# Investigate factors: Correlation between factors
pdf(file="results/mofa/plot_factor_cor.pdf", height = 5, width = 5)
plot_factor_cor(model, type = 'upper', method = "spearman", addCoef.col = "black")
```
Next, we can characterize the factors with the available metadata of samples.
Indeed, it can be interesting to see if some factors are associated with
metadata classes, as they can help to guide our intepretation of the factors.
This part rely on an interpretation of the Z matrix of the mofa model, while
ignoring the factor weight themselves. The factor weights will be analysed subsequently.
To begin simply, we can check if some tissue of origin are associated with any
of the factors. There are different ways thsi can be done, but we find that
using a simple linear association between a given tissues category in a factor
compared to any other tissue categories (e.g. breast tissue vs every other
tissue types) provide an easy to interpret insight that serves our purpose
well. To do so, we use the Univariate Linear Model (ULM) function of decoupleR.
```{r tissue_type_analysis}
NCI_60_metadata <- as.data.frame(read_csv(file = "data/metadata/RNA_metadata_cluster.csv"))
#discretize age category so that regression can be applied
NCI_60_metadata$`age over 50` <- NCI_60_metadata$`age a` > 50
tissues <-NCI_60_metadata[,c(3,1)]
names(tissues) <- c("source","target")
Z_matrix <- as.data.frame(model@expectations$Z$single_group)
#Check specifically tissue meta-data association with Z matrix
tissue_enrichment <- decoupleR::run_ulm(Z_matrix, tissues)
tissue_enrichment <- reshape2::dcast(tissue_enrichment, source~condition, value.var = "score")
row.names(tissue_enrichment) <- tissue_enrichment$source
tissue_enrichment <- tissue_enrichment[,-1]
#plot them as heatmaps
palette <- make_heatmap_color_palette(tissue_enrichment)
pheatmap(t(tissue_enrichment), show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315,display_numbers = F, filename = "results/mofa/mofa_tissue_enrichment.pdf", width = 3, height = 2.6)
pheatmap(tissue_enrichment, show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315,display_numbers = T)
```
Here, we can see that factor 4 seems to be strongly associated with eukemia, and factor 2 with melanoma.
```{r all_metadata_analysis}
#format meta-data names
all_metadata <- reshape2::melt(NCI_60_metadata[,c(1,3,17,5,7,8,10,13,14)], id.vars = "cell_line")
all_metadata$variable <- gsub(" a$","",all_metadata$variable)
all_metadata$variable <- gsub(" a,c$","",all_metadata$variable)
all_metadata$variable <- gsub(" d$","",all_metadata$variable)
all_metadata$variable <- gsub("histology","histo",all_metadata$variable)
all_metadata$variable <- gsub("tissue of origin","tissue",all_metadata$variable)
all_metadata$value <- gsub("[(]81-103[)]","",all_metadata$value)
all_metadata$value <- gsub("[(]35-57[)]","",all_metadata$value)
all_metadata$value <- gsub("Memorial Sloan Kettering Cancer Center","Memorial SKCC",all_metadata$value)
all_metadata$value <- gsub("Malignant melanotic melanoma","Malig. melanotic melan.",all_metadata$value)
all_metadata$value <- gsub("Ductal carcinoma-mammary gland","Duct. carci-mam. gland",all_metadata$value)
all_metadata$target <- paste(all_metadata$variable, all_metadata$value, sep = ": ")
all_metadata <- all_metadata[,c(4,1)]
names(all_metadata) <- c("source","target")
Z_matrix <- as.data.frame(model@expectations$Z$single_group)
#run ULM on the Z matrix meta data to find associated metadata features
all_metadata_enrichment <- decoupleR::run_ulm(Z_matrix, all_metadata, minsize = 3)
all_metadata_enrichment <- reshape2::dcast(all_metadata_enrichment, source~condition, value.var = "score")
row.names(all_metadata_enrichment) <- all_metadata_enrichment$source
all_metadata_enrichment <- all_metadata_enrichment[,-1]
all_metadata_enrichment_top <- all_metadata_enrichment[
apply(all_metadata_enrichment, 1, function(x){max(abs(x)) > 2}),
]
#Plot the heatmap
palette <- make_heatmap_color_palette(all_metadata_enrichment_top)
pheatmap(t(all_metadata_enrichment_top), show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315,display_numbers = F, filename = "results/mofa/mofa_metadata_enrichment.pdf", width = 4.5, height = 4)
pheatmap(t(all_metadata_enrichment_top), show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315,display_numbers = T)
```
Further, we can inspect the composition of the factors by plotting the top 10 weights per factor and view.
```{r Factor_weights_per_view, fig.height=2.4, fig.width=10}
## Top 20 factor weights per view
grid.arrange(
### RNA
plot_top_weights(model,
view = "RNA",
factor = 4,
nfeatures = 10) +
ggtitle("RNA"),
### Proteomics
plot_top_weights(model,
view = "proteo",
factor = 4,
nfeatures = 10) +
ggtitle("Proteomics"),
### Metabolomics
plot_top_weights(model,
view = "metab",
factor = 4,
nfeatures = 10) +
ggtitle("Metabolomics"), ncol =3
)
```
Using this visualization, features with strong association with the factor
(large absolute values) can be easily identified and further inspected by
literature investigation.
Further tools to investigate the latent factors are available inside the MOFA framework ([MOFA+: downstream analysis (in R)](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/downstream_analysis.html)).
### From MOFA output to TF activity, ligand-receptor activity
In the next step, using the weights of Factor 4 (highest shared $R^2$ across
views), different databases (liana and dorothea) and decoupleR, the
transcription factor activities, ligand-receptor scores are estimated.
First, we have to extract the weights from our MOFA model and keep the
weights from Factor 4 in the RNA view.
```{r Extract_weights_from_Model}
weights <- get_weights(model, views = "all", factors = "all")
save(weights, file = "results/mofa/mofa_weights.RData")
# Extract Factor with highest explained variance in RNA view
RNA <- data.frame(weights$RNA[,4])
row.names(RNA) <- gsub("_RNA","",row.names(RNA))
```
Then we load the consensus networks from our databases, as well as the TF-targets regulons from collectri.
```{r Load_networks, message=FALSE, warning=FALSE, error=FALSE, results='hide'}
# Load LIANA (receptor and ligand) consensus network
# ligrec_ressource <- distinct(liana::decomplexify(liana::select_resource("Consensus")[[1]]))
# save(ligrec_ressource, file = "support/ligrec_ressource.RData")
#process the LR interaction prior knowledge to make suitable input for decoupleR
load(file = "support/ligrec_ressource.RData")
ligrec_geneset <- format_LR_ressource(ligrec_ressource)
# Load Dorothea (TF) network
# dorothea_df <- decoupleR::get_collectri()
# save(dorothea_df, file = "support/dorothea_df.RData")
load(file = "support/dorothea_df.RData")
```
We can display what are the top weights of the model and to which factor they are associated.
```{r RNA_across_factors, fig.width=4, fig.height=4.3}
RNA_all <- as.data.frame(weights$RNA)
row.names(RNA_all) <- gsub("_RNA","",row.names(RNA_all))
#Extract the top weights
RNA_top <- RNA_all[order(apply(RNA_all,1,function(x){max(abs(x))}), decreasing = T)[1:25],]
#plot them as heatmaps
palette <- make_heatmap_color_palette(RNA_top)
pheatmap(RNA_top, show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315, filename = "results/mofa/mofa_top_RNA.pdf", width = 4, height = 4.3)
pheatmap(RNA_top, show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315)
```
We can do the same for metabolites
```{r Metabs_across_factors, fig.width=4, fig.height=2.2}
metabs_all <- as.data.frame(weights$metab)
row.names(metabs_all) <- gsub("_metab","",row.names(metabs_all))
#extract top metab weights
metabs_top <- metabs_all[order(apply(metabs_all,1,function(x){max(abs(x))}), decreasing = T)[1:10],]
#plot them as heatmaps
palette <- make_heatmap_color_palette(metabs_top)
pheatmap(metabs_top, show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315, filename = "results/mofa/mofa_top_metabs.pdf", width = 4, height = 2.2)
pheatmap(metabs_top, show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315)
```
Then, by using decoupleR and prior knowledge networks, we calculate the
different regulatory activities inferred by the ULM approach. Depending
on the task, the minimum number of targets per source varies required to
maintain the source in the output (e.g. two targets by definition for
ligand-receptor interactions). Here can display them as heatmaps.
```{r LR_score_estimations_across_factors, fig.width=4, fig.height=4.3}
# Calculate regulatory activities from Receptor and Ligand network
ligrec_factors <- run_ulm(mat = as.matrix(RNA_all), network = ligrec_geneset, .source = set, .target = gene, minsize = 2)
ligrec_factors_df <- wide_ulm_res(ligrec_factors)
ligrec_factors_df <- ligrec_factors_df[order(apply(ligrec_factors_df,1,function(x){max(abs(x))}), decreasing = T)[1:25],]
#plot them as heatmaps
palette <- make_heatmap_color_palette(ligrec_factors_df)
pheatmap(ligrec_factors_df, show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315, filename = "results/mofa/mofa_top_ligrec.pdf", width = 4, height = 4.3)
pheatmap(ligrec_factors_df, show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315)
```
```{r TF_activity_estimations_across_factors, fig.width=4, fig.height=4.3}
# Calculate regulatory activities from TF network
TF_factors <- run_ulm(mat = as.matrix(RNA_all), network = dorothea_df, minsize = 10)
TF_factors <- wide_ulm_res(TF_factors)
TF_factors <- TF_factors[order(apply(TF_factors,1,function(x){max(abs(x))}), decreasing = T)[1:25],]
#Plot them as heatmaps
palette <- make_heatmap_color_palette(TF_factors)
pheatmap(TF_factors, show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315, filename = "results/mofa/mofa_top_TF.pdf", width = 4, height = 4.3)
pheatmap(TF_factors, show_rownames = T, cluster_cols = F, cluster_rows = F,color = palette, angle_col = 315)
```
These activity estimations are saved in an input file for further downstream
analysis as well.
```{r Activity_estimations_for_factor_4}
# Calculate regulatory activities from Receptor and Ligand network
ligrec_high_vs_low <- run_ulm(mat = as.matrix(RNA), network = ligrec_geneset, .source = set, .target = gene, minsize = 2)
ligrec_high_vs_low <- ligrec_high_vs_low[ligrec_high_vs_low$statistic == "ulm",]
ligrec_high_vs_low_vector <- ligrec_high_vs_low$score
names(ligrec_high_vs_low_vector) <- ligrec_high_vs_low$source
ligrec_high_vs_low_top <- ligrec_high_vs_low[order(abs(ligrec_high_vs_low$score),decreasing = T)[1:15],c(2,4)]
ligrec_high_vs_low_top$source <- factor(ligrec_high_vs_low_top$source, levels = ligrec_high_vs_low_top$source)
ggplot(ligrec_high_vs_low_top, aes(x=source, y = score)) + geom_bar(stat= "identity",position = "dodge") + theme_minimal() + theme(axis.text.x = element_text(angle = 315, vjust = 0.5, hjust=0))
# Calculate regulatory activities from TF network
TF_high_vs_low <- run_ulm(mat = as.matrix(RNA), network = dorothea_df, minsize = 10)
TF_high_vs_low <- TF_high_vs_low[TF_high_vs_low$statistic == "ulm",]
TF_high_vs_low_vector <- TF_high_vs_low$score
names(TF_high_vs_low_vector) <- TF_high_vs_low$source
# Prepare moon analysis input
TF_high_vs_low <- as.data.frame(TF_high_vs_low[,c(2,4)])
row.names(TF_high_vs_low) <- TF_high_vs_low[,1]
TF_high_vs_low_top_10 <- TF_high_vs_low[order(abs(TF_high_vs_low$score), decreasing = T)[1:10],]
TF_high_vs_low_top_10$source <- factor(TF_high_vs_low_top_10$source, levels = TF_high_vs_low_top_10$source)
ggplot(TF_high_vs_low_top_10, aes(x=source, y = score)) + geom_bar(stat= "identity",position = "dodge") + theme_minimal()
# Combine results
ligrec_TF_moon_inputs <- list("ligrec" = ligrec_high_vs_low_vector,
"TF" = TF_high_vs_low_vector)
save(ligrec_TF_moon_inputs, file = "data/cosmos/ligrec_TF_moon_inputs.Rdata")
```
We have now successfully used the MOFA weights (Factor 4, RNA view) to infer
not only TF activities but also the ligand-receptor interactions.
### Extracting network with COSMOS to formulate mechanistic hypotheses
In this next parts the COSMOS run is going to be performed. We can use the
output of decoupleR (the activities) next to the factor weights as an input
for COSMOS to specifically focus the analysis on pre-computed data-driven results.
Here, in the first step, the data is loaded. Besides the activity estimations
and the factor weights, the previous filtered out expressed gene names are loaded.
```{r Load_COSMOS_input, message=FALSE, warning=FALSE, error=FALSE, results='hide'}
# Load data
load(file = "data/cosmos/ligrec_TF_moon_inputs.Rdata")
signaling_input <- ligrec_TF_moon_inputs$TF
ligrec_input <- ligrec_TF_moon_inputs$ligrec
RNA_input <- weights$RNA[,4]
prot_input <- weights$proteo[,4]
metab_inputs <- weights$metab[,4]
#remove MOFA tags from feature names
names(RNA_input) <- gsub("_RNA","",names(RNA_input))
names(prot_input) <- gsub("_proteo","",names(prot_input))
RNA_log2_FPKM <- as.data.frame(read_csv("data/RNA/RNA_log2_FPKM_clean.csv"))$Genes #since only complete cases were considered in the first part, here the full gene list is loaded
```
We can be interested in looking how RNA and corresponding proteins correlate in
each factor. We can plot this using scatter plots for each factors.
```{r correlation_RNA_prot, fig.width=7, fig.height=7}
weights_RNA <- as.data.frame(weights$RNA)
weights_prot <- as.data.frame(weights$proteo)
weights_prot$ID <- gsub("_proteo$","",row.names(weights_prot))
weights_RNA$ID <- gsub("_RNA$","",row.names(weights_RNA))
plot_list <- list()
r2_list <- list()
for(i in 1:(dim(weights_prot)[2]-1))
{
merged_weights <- merge(weights_RNA[,c(i,10)],weights_prot[,c(i,10)], by = "ID")
names(merged_weights) <- c("ID","RNA","prot")
r2_list[[i]] <- cor(merged_weights[,2],merged_weights[,3])^2
plot_list[[i]] <- ggplot(merged_weights,aes(x = RNA,y = prot)) + geom_point() +
geom_smooth(method='lm', formula= y~x) +
theme_minimal() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
}
r2_vec <- unlist(r2_list)
r2_vec
ggsave(filename = "results/mofa/RNA_prot_cor.pdf",plot = do.call("grid.arrange", c(plot_list, ncol = 3)), device = "pdf")
```
Coherently, only the factors where there is variance explained for both RNA and
proteins are correlated.
Next, we perform filtering steps in order to identify top deregulated features.
Here, the individual filtering is based on absolute factor weight or activity
score. The first step involves the visualization of the respective distribution
to define an appropriate threshold.
First, the RNA factor weights are filtered based on a threshold defined by their
distribution as well a threshold defined by the distribution of the protein
factor weights.
```{r Plot_RNA_weights_and_protein_weights}
##RNA
{plot(density(RNA_input))
abline(v = -0.2)
abline(v = 0.2)}
{plot(density(prot_input))
abline(v = -0.05)
abline(v = 0.05)}
```
Based on the plots and indicated by the straight lines, the RNA weights
threshold is set to -0.2 and 0.2, and the protein weights threshold to -0.05
and 0.05. RNA factor weight values lying inside this threshold are set to 0
and previously filtered out genes (before MOFA analysis) are also included with
value 0.
```{r Filtering_RNA_weights}
for(gene in names(RNA_input)) {
if (RNA_input[gene] > -0.2 & RNA_input[gene] < 0.2)
{
RNA_input[gene] <- 0
} else
{
RNA_input[gene] <- sign(RNA_input[gene]) * 10
}
if (gene %in% names(prot_input)) #will need to add a sign consistency check between RNA and prot as well
{
if (prot_input[gene] > -0.05 & prot_input[gene] < 0.05)
{
RNA_input[gene] <- 0
} else
{
RNA_input[gene] <- sign(RNA_input[gene]) * 10
}
}
}
RNA_log2_FPKM <- RNA_log2_FPKM[!(RNA_log2_FPKM %in% names(RNA_input))]
genes <- RNA_log2_FPKM
RNA_log2_FPKM <- rep(0,length(RNA_log2_FPKM))
names(RNA_log2_FPKM) <- genes
RNA_input <- c(RNA_input, RNA_log2_FPKM)
```
The same procedure is repeated for the transcription factor activities.
```{r Plot_TF_activity_estimates}
## Transcription factors
{plot(density(signaling_input))
abline(v = -0.5)
abline(v = 3.5)}
```
The threshold is set to -0.5 and 3.5 respectively. Further, the
TF_to_remove variable is later used to remove transcription factors with
activities below this threshold from the prior knowledge network.
```{r Filtering_TF_activity_estimates}
TF_to_remove <- signaling_input[signaling_input > -0.5 & signaling_input < 3.5]
signaling_input <- signaling_input[signaling_input < -0.5 | signaling_input > 3.5]
```
The same procedure is repeated for the ligand-receptor activities.
```{r Plot_Ligand_receptor_activity_estimates}
## Ligand-receptor interactions
{plot(density(ligrec_input))
abline(v = -0.5)
abline(v = 2.5)}
```
Here, only activities higher than 2.5 or lower than -0.5 are kept
(see straight lines in plot). This is an arbitrary theshold aimed at keeping a
relativelly high number of interesting fetures to connect in the rest of the
pipeline. Since at this point we are interested in the receptors, the names
are adjusted accordingly and if there are multiple mentions of a receptor,
its activity is calculated by the mean of the associated ligand-receptor
interactions.
```{r Filtering_ligand_receptor_activity_estimates}
ligrec_input <- ligrec_input[ligrec_input > 2.5 | ligrec_input < -0.5]
rec_inputs <- ligrec_input
names(rec_inputs) <- gsub(".+___","",names(rec_inputs))
rec_inputs <- tapply(rec_inputs, names(rec_inputs), mean)
receptors <- names(rec_inputs)
rec_inputs <- as.numeric(rec_inputs)
names(rec_inputs) <- receptors
```
Finally, the same procedure is repeated for the metabolite factor weights.
```{r Plot_Metabolite_factor_weights}
## Metabolites
{plot(density(metab_inputs))
abline(v = -0.2)
abline(v = 0.2)}
```
The threshold is set to 0.2 and -0.2 respectively. Further, the metab_to_exclude
variable is later used to remove metabolites with activities below this
threshold from the prior knowledge network. Moreover, metabolite names are
translated into HMDB format and metabolic compartment codes
(m = mitochondria, c = cytosol) as well as metab\_\_ prefix is added to HMDB
IDs. More information regarding compartment codes can be
found [here](http://bigg.ucsd.edu/compartments).
```{r Filtering_metabolite_factor_weights, message=FALSE, warning=FALSE, error=FALSE, results='hide'}
metab_to_HMDB <- as.data.frame(
read_csv("data/metabolomic/MetaboliteToHMDB.csv"))
metab_to_HMDB <- metab_to_HMDB[metab_to_HMDB$common %in% names(metab_inputs),]
metab_inputs <- metab_inputs[metab_to_HMDB$common]
names(metab_inputs) <- metab_to_HMDB$HMDB
metab_inputs <- cosmosR::prepare_metab_inputs(metab_inputs, compartment_codes = c("m","c"))
metab_to_exclude <- metab_inputs[abs(metab_inputs) < 0.2]
metab_inputs <- metab_inputs[abs(metab_inputs) > 0.2]
```
Next, the prior knowledge network (PKN) is loaded. To see how the full meta
PKN was assembled, see [PKN](https://github.com/saezlab/meta_PKN_BIGG.git).
Since we are not interested in including transcription factors and metabolites
that were previously filtered out, we also remove the respective nodes from
the PKN.
```{r Meta_network, warning=F}
## Load and filter meta network
# data("meta_network")
# save(meta_network, file = "support/meta_network.RData")
load("support/meta_network.RData")
meta_network <- meta_network_cleanup(meta_network)
meta_network <- meta_network[!(meta_network$source %in% names(TF_to_remove))
& !(meta_network$target %in% names(TF_to_remove)),]
meta_network <- meta_network[!(meta_network$source %in% names(metab_to_exclude))
& !(meta_network$target %in% names(metab_to_exclude)),]
```
Then the datasets are merged, entries that are not included in the PKN are
removed and the filtering steps are completed. The merging in this case is
based on the idea of deriving the forward network via the receptor to
transcription factor/metabolite direction. We will perform first the first
cosmos run between recptors and downstream TFs and metabolites. The second
run will focus on connecting TFs awith downstream ligands.
Thus, we first define the upstream input as receptors, and the downstream input
as TFs and metabolites.
Since the scoring of the network is sensitive to the scale of the downstream
measurments, we are multiplying
the metabolite weights by 10 so that they are on a similar scale as the TF
scores. Other scaling methods can be performed at the user'S discretion,
this one is just a rough scaling, but good enough in this case.
```{r Merge_datasets_to_input}
# upstream_inputs <- c(rec_inputs)
# downstream_inputs <- c(metab_inputs*10, signaling_input)
#
# upstream_inputs_filtered <- cosmosR:::filter_input_nodes_not_in_pkn(upstream_inputs, meta_network)
# downstream_inputs_filtered <- cosmosR:::filter_input_nodes_not_in_pkn(downstream_inputs, meta_network)
#
# rec_to_TF_cosmos_input <- list(upstream_inputs_filtered, downstream_inputs_filtered)
# save(rec_to_TF_cosmos_input, file = "data/cosmos/rec_to_TF_cosmos_input.Rdata")
```
### MOON (meta footprint analysis)
In this part, we will run the network scoring to generate mechanistic hypotheses
connecting ligands, receptors, TFs and metabolites, while pruning the
network with information from both RNA and protein abundance to constrain the
flux of signal consistent with the molecular data available.
In this approach, the network is compressed to avoid having redundant paths in
the network leading to inflating the importance of some input measurements.
```{r Prepare_moon_input_for_receptor_to_TFs_and_metabs_MOON}
##filter expressed genes from PKN
# data("meta_network")
# save(meta_network, file = "support/meta_network.RData")
# load("support/meta_network.RData")
# meta_network <- meta_network_cleanup(meta_network)
expressed_genes <- as.data.frame(read_csv("data/RNA/RNA_log2_FPKM_clean.csv"))
expressed_genes <- setNames(rep(1,length(expressed_genes[,1])), expressed_genes$Genes)
meta_network_filtered <- cosmosR:::filter_pkn_expressed_genes(names(expressed_genes), meta_pkn = meta_network)
##format metab inputs
metab_inputs <- as.numeric(scale(weights$metab[,4], center = F))
names(metab_inputs) <- row.names(weights$metab)
metab_to_HMDB <- as.data.frame(
read_csv("data/metabolomic/MetaboliteToHMDB.csv"))
metab_to_HMDB <- metab_to_HMDB[metab_to_HMDB$common %in% names(metab_inputs),]
metab_inputs <- metab_inputs[metab_to_HMDB$common]
names(metab_inputs) <- metab_to_HMDB$HMDB
metab_inputs <- cosmosR::prepare_metab_inputs(metab_inputs, compartment_codes = c("m","c"))
#prepare upstream inputs
upstream_inputs <- c(rec_inputs) #the upstream input should be filtered for most significant
TF_inputs <- scale(ligrec_TF_moon_inputs$TF, center = F)
TF_inputs <- setNames(TF_inputs[,1], row.names(TF_inputs))
downstream_inputs <- c(metab_inputs, TF_inputs) #the downstream input should be complete
upstream_inputs_filtered <- cosmosR:::filter_input_nodes_not_in_pkn(upstream_inputs, meta_network_filtered)
downstream_inputs_filtered <- cosmosR:::filter_input_nodes_not_in_pkn(downstream_inputs, meta_network_filtered)
#Filter inputs and prune the meta_network to only keep nodes that can be found downstream of the inputs
#The number of step is quite flexible, 7 steps already covers most of the network
n_steps <- 6
# in this step we prune the network to keep only the relevant part between upstream and downstream nodes
meta_network_filtered <- cosmosR:::keep_controllable_neighbours(meta_network_filtered
, n_steps,
names(upstream_inputs_filtered))
downstream_inputs_filtered <- cosmosR:::filter_input_nodes_not_in_pkn(downstream_inputs_filtered, meta_network_filtered)
meta_network_filtered <- cosmosR:::keep_observable_neighbours(meta_network_filtered, n_steps, names(downstream_inputs_filtered))
upstream_inputs_filtered <- cosmosR:::filter_input_nodes_not_in_pkn(upstream_inputs_filtered, meta_network_filtered)
write_csv(meta_network_filtered, file = "results/cosmos/moon/meta_network_filtered.csv")
#compress the network to avoid redundant master controllers
meta_network_compressed_list <- compress_same_children(meta_network_filtered, sig_input = upstream_inputs_filtered, metab_input = downstream_inputs_filtered)
meta_network_compressed <- meta_network_compressed_list$compressed_network
meta_network_compressed <- meta_network_cleanup(meta_network_compressed)
load(file = "support/dorothea_df.RData")
# RNA_input <- as.numeric(weights$RNA[,4])
# names(RNA_input) <- gsub("_RNA$","",row.names(weights$RNA))
```
In this step, we format the MOFA weight and activity score information so that it can be appended to the moon network result afterward.
```{r Add_activities_and_weights_to_network}
data("HMDB_mapper_vec")