-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.qmd
1576 lines (1234 loc) · 147 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Cell type-specific contextualisation of the human phenome: towards developing treatments for all rare diseases"
author:
- name: Brian M. Schilder
orcid: 0000-0001-5949-2191
corresponding: true
email: brian_schilder@alumni.brown.edu
roles:
- Investigation
- Project administration
- Software
- Visualization
affiliations:
- name: Imperial College London
city: London
country: United Kingdom
department: Department of Brain Sciences
- name: UK Dementia Research Institute
city: London
country: United Kingdom
- name: Kitty B. Murphy
orcid: 0000-0002-8669-3076
corresponding: false
roles:
- Investigation
- Software
- Visualization
affiliations:
- name: Imperial College London
city: London
country: United Kingdom
department: Department of Brain Sciences
- name: UK Dementia Research Institute
city: London
country: United Kingdom
- name: Robert Gordon-Smith
orcid: 0000-0001-6698-7387
corresponding: false
roles:
- Investigation
- Software
- Visualization
affiliations:
- name: Imperial College London
city: London
country: United Kingdom
department: Department of Brain Sciences
- name: UK Dementia Research Institute
city: London
country: United Kingdom
- name: Jai Chapman
corresponding: false
roles:
- Investigation
- Software
- Visualization
affiliations:
- name: Imperial College London
city: London
country: United Kingdom
department: Department of Brain Sciences
- name: UK Dementia Research Institute
city: London
country: United Kingdom
- name: Momoko Otani
corresponding: false
roles:
- Investigation
affiliations:
- name: Imperial College London
city: London
country: United Kingdom
department: National Heart and Lung Institute
- name: Nathan G. Skene
orcid: 0000-0002-6807-3180
corresponding: true
email: n.skene@imperial.ac.uk
roles:
- Project administration
affiliations:
- name: Imperial College London
city: London
country: United Kingdom
department: Department of Brain Sciences
- name: UK Dementia Research Institute
city: London
country: United Kingdom
keywords:
- rare disease
- phenotype
- single-cell
- gene therapy
key-points:
- We used the Human Phenotype Ontology and single-cell RNA-seq references to characterise the phenome.
- We then demonstrated how these results can be applied to clinical diagnosis, prognosis and therapeutics development.
date: last-modified
citation:
container-title: medRxiv
bibliography: references.bib
---
```{r setup, cache=FALSE, include=TRUE}
library(data.table)
# ?formatdown::formatdown_options()
options(digits=2)
formatter <- function(x, ...) {
# prettyNum(x, digits=2)
# paste0("$",knitr:::format_sci(x),"$")
if(is.numeric(x)){
if(!is.integer(x)){
formatdown::format_sci(x = x, digits = 2)
} else {
formatC(x, big.mark = ",", mode = "integer")
}
} else {
x
}
}
knitr::knit_hooks$set(document = formatter,
chunk = formatter,
inline = formatter)
```
```{r load_phenotype_to_genes}
tag <- "v2024-02-08" ## For version control
p2g <- HPOExplorer::load_phenotype_to_genes(tag=tag)
annot <- HPOExplorer::load_phenotype_to_genes(3, tag=tag)
per_disease <- p2g[,list(ng=data.table::uniqueN(gene_symbol),
np=data.table::uniqueN(hpo_id)),by="disease_id"]
per_phenotype <- p2g[,list(ng=data.table::uniqueN(gene_symbol),
nd=data.table::uniqueN(disease_id)),by="hpo_id"]
```
```{r load_example_results}
hpo <- HPOExplorer::get_hpo(tag=tag)
## Import precomputed results for reporting summaries
results <- MSTExplorer::load_example_results()
results <- HPOExplorer::add_hpo_name(results, hpo = hpo)
results <- HPOExplorer::add_ont_lvl(results)
results <- HPOExplorer::add_ancestor(results, hpo = hpo)
results <- MSTExplorer::map_celltype(results)
MSTExplorer::add_logfc(results)
results[,effect:=estimate]
## Substitute B for \beta for now since Quarto doesn't seem to support
## Greek letters after they've been stored in a data.table...
results[,summary:=paste0(
"$",
"FDR=",q,",",
"B=",estimate,
"$"
)]
```
```{r summarise_results}
res_summ <- MSTExplorer::summarise_results(results = results)
res_summ_all <- res_summ$tmerged[ctd=="all"]
```
```{r load_example_ctd, cache=FALSE}
## Must use `cache.lazy=FALSE` because sparse matrices not yet supported for caching
ctd_list <- MSTExplorer::load_example_ctd(c("ctd_DescartesHuman.rds",
"ctd_HumanCellLandscape.rds"),
multi_dataset=TRUE)
```
```{r validate_associations_mkg}
validate_associations_mkg_out <- MSTExplorer::validate_associations_mkg(results = results)
kg_hp <- validate_associations_mkg_out$kg[grepl("^HP:",from)]
```
{{< pagebreak >}}
## Abstract
Rare diseases (RDs) are an extremely heterogeneous and underserved category of medical conditions. While the majority of RDs are strongly genetic, it remains largely unknown via which physiological mechanisms genetics cause RD. Therefore, we sought to systematically characterise the cell type-specific mechanisms underlying all RD phenotypes with a known genetic cause by leveraging the Human Phenotype Ontology and transcriptomic single-cell atlases of the entire human body from embryonic, foetal, and adult samples. In total we identified significant associations between `r res_summ_all[["cell types significant"]]` cell types and `r res_summ_all[["phenotypes significant"]]`/`r res_summ_all[["phenotypes tested"]]` (`r res_summ_all[["phenotypes significant (%)"]]`%) unique phenotypes across `r res_summ_all[["diseases significant"]]` RDs. This greatly the collective knowledge of RD phenotype-cell type mechanisms. Next, developed a pipeline to identify cell type-specific targets for phenotypes ranked by metrics of severity (e.g. lethality, motor/mental impairment) and compatibility with gene therapy (e.g. filtering out physical malformations). Furthermore, we have made these results entirely reproducible and freely accessible to the global community to maximise their impact. To summarise, this work represents a significant step forward in the mission to treat patients across an extremely diverse spectrum of serious RDs.
## Introduction {#sec-introduction}
While rare diseases (RDs) are individually uncommon, they collectively account for an enormous global disease burden with over 10,000 recognised RDs affecting at least 300-400 million people globally [@Ferreira2019-jp] (1 in 10-20 people) [@Zhu2020-vo] . Over 75% of RDs primarily affect children with a 30% mortality rate by 5 years of age [@noauthor_undated-kp]. Despite the prevalence and severity of RDs, patients suffering from these conditions are vastly underserved due to several contributing factors. First, diagnosis is extremely challenging due to the highly variable clinical presentations of many of these diseases. The diagnostic odyssey can take patients and their families decades, with an average time to diagnosis of 5 years [@Marwaha2022-uy]. Of those, \~46% receive at least one incorrect diagnosis and over 75% of all patients never receive any diagnosis @Molster2016-da. Second, prognosis is also made difficult by high variability in disease course and outcomes which makes matching patients with effective and timely treatment plans even more challenging. Finally, even for patients who receive an accurate diagnosis/prognosis, treatments are currently only available for less than 5% of all RDs [@Halley2022-pd]. In addition to the scientific challenges of understanding RDs, there are strong financial disincentives for pharmaceutical and biotechnology companies to develop expensive therapeutics for exceedingly small RD patient populations with little or no return on investment [@Institute_of_Medicine_US_Committee_on_Accelerating_Rare_Diseases_Research_and_Orphan_Product_Development2010-vj; @Yates2022-ra]. Those that have been produced are amongst the world’s most expensive drugs, greatly limiting patients’ ability to access it [@Nuijten2022-yc; @Thielen2022-ud]. New high-throughput approaches for the development of rare disease therapeutics could greatly reduce costs (for manufacturers and patients) and accelerate the timeline from discovery to delivery.
A major challenge in both healthcare and scientific research is the lack of standardised medical terminology. Even in the age of electronic healthcare records (EHR) much of the information about an individual’s history is currently fractured across healthcare providers, often with differing nomenclatures for the same conditions. The Human Phenotype Ontology (HPO) is a hierarchically organised set of controlled clinical terms that provides a much needed common framework by which clinicians and researchers can precisely communicate patient conditions [@Gargano2024-fc; @Kohler2019-pc @Robinson2008-ys; @Kohler2021-wk]. The HPO spans all domains of human physiology and currently describes `r hpo@n_terms` phenotypes across 10,300 RDs. Each phenotype and disease is assigned its own unique identifier and organised as a hierarchical graph, such that higher-level terms describe broad phenotypic categories or *branches* (e.g. *HP:0033127*: 'Abnormality of the musculoskeletal system' which contains `r data.table::uniqueN(KGExplorer::get_ontology_descendants(ont = hpo, terms = "HP:0033127")|>unlist())` unique phenotypes) and lower-level terms describe increasingly precise phenotypes (e.g. *HP:0030675*: 'Contracture of proximal interphalangeal joints of 2nd-5th fingers'). It has already been integrated into healthcare systems and clinical diagnostic tools around the world, with increasing adoption over time [@Gargano2024-fc]. Standardised frameworks like the HPO also allow us to aggregate relevant knowledge about the molecular mechanisms underlying each RD.
Over 80% of RDs have a known genetic cause [@Nguengang_Wakap2020-cz; @noauthor_2022-ok]. Since 2008, the HPO has been continuously updated using curated knowledge from the medical literature, as well as by integrating databases of expert validated gene-phenotype relationships, such as OMIM [@Amberger2019-vl; @Amberger2017-tg; @McKusick2007-di], Orphanet [@Maiella2013-oo; @Weinreich2008-wm], and DECIPHER @Firth2009-qg. Mappings between HPO terms to other commonly used medical ontologies (e.g. SNOMED CT [@Chang2021], UMLS [@Kim2020; @Humphreys2020], ICD-9/10/11 [@Krawczyk2020]) make the HPO even more valuable as a clinical resource (provided in Mappings section of Methods). Many of these gene annotations are manually or semi-manually curated by expert clinicians from case reports of rare disease patients in which the causal gene is identified through whole exome or genome sequencing. Currently, the HPO contains gene annotations for `r res_summ_all[["phenotypes"]]` phenotypes across `r res_summ_all[["diseases"]]` diseases. Yet genes alone do not tell the full story of how RDs come to be, as their expression and functional relevance varies drastically across the multitude of tissues and cell types contained within the human body. Our knowledge of the physiological mechanisms via which genetics cause pathogenesis is lacking for most RDs, severely hindering our ability to effectively diagnose, prognose and treat RD patients.
Our knowledge of cell type-specific biology has exploded over the course of the last decade and a half, with numerous applications in both scientific and clinical practices [@Baysoy2023-vt; @Haque2017-bn; @Qi2023-ev]. In particular, single-cell RNA-seq (scRNA-seq) has allowed us to quantify the expression of every gene (i.e. the transcriptome) in individual cells. More recently, comprehensive single-cell transcriptomic atlases across tissues have also emerged [@CZI_Single-Cell_Biology_Program2023-fs; @Svensson2020-lg]. In particular, the Descartes Human @Cao2020-qz and Human Cell Landscape @Han2020-iq projects provide comprehensive multi-system scRNA-seq atlases in embryonic, foetal, and adult human samples from across the human body. These datasets provide data-driven gene signatures for hundreds of cell subtypes. Given that many disease-associated genes are expressed in some cell types but not others, we can infer that disruptions to these genes will have varying impact across cell types. By comparing the aggregated disease gene annotations with cell type-specific expression profiles, we can therefore uncover the cell types and tissues via which diseases mediate their effects.
Here, we combine and extend several of the most comprehensive genomic and transcriptomic resources currently available to systematically uncover the cell types underlying granular phenotypes across `r res_summ_all[["diseases significant"]]` diseases [Fig. @fig-study-design]. This information is essential for the development of novel therapeutics, especially gene therapy modalities such as adeno-associated viral (AAV) vectors in which advancement have been made in their ability selectively target specific cell types [@kawabataImprovingCellspecificRecombination2024; @ocarrollAAVTargetingGlial2021]. Precise knowledge of relevant cell types and tissues causing the disease can improve safety by minimising harmful side effects in off-target cell types and tissues. It can also enhance efficacy by efficiently delivering expensive therapeutic payloads to on-target cell types and tissues. For example, if a phenotype primarily effects retinal cells, then the gene therapy would be optimised for delivery to retinal cells of the eye. Using this information, we developed a high-throughput pipeline for comprehensively nominating cell type-resolved gene therapy targets across thousands of RD phenotypes. As a prioritisation tool, we sorted these targets based on the severity of their respective phenotypes, using a generative AI-based approach [@murphyHarnessingGenerativeAI2024]. Together, our study dramatically expands the available knowledge of the cell types, organ systems and life stages underlying RD phenotypes.
::: {#fig-study-design}
![](img/study_design.png)
Multi-modal data fusion reveals the cell types underlying thousands of human phenotypes. Schematic overview of study design in which we numerically encoded the strength of evidence linking each gene and each phenotype (using the Human Phenotype Ontology and GenCC databases). We then created gene signature profiles for all cell types in the Descartes Human and Human Cell Landscape scRNA-seq atlases. Finally, we iteratively ran generalised linear regression tests between all pairwise combinations of phenotype gene signatures and cell type gene signatures. The resulting associations were then used to nominate cell type-resolved gene therapy targets for thousands of rare diseases.
:::
## Results {#sec-results}
### Phenotype-cell type associations
```{r run_phenomix, eval=FALSE}
## Create phenotype-gene matrix filled with aggregated GenCC evidence scores
ymat <- HPOExplorer::hpo_to_matrix(formula = "gene_symbol ~ hpo_id")
## Run phenomix with DescartesHuman CellTypeDataset
lm_res1 <- MSTExplorer::run_phenomix(ctd_name = "DescartesHuman",
annotLevel = 2,
test_method = "glm_univariate",
ymat = ymat)
## Run phenomix with HumanCellLandscape CellTypeDataset
lm_res2 <- MSTExplorer::run_phenomix(ctd_name = "HumanCellLandscape",
annotLevel = 3,
test_method = "glm_univariate",
ymat = ymat)
## Merge results
results <- data.table::rbindlist(list(DescartesHuman=lm_res1,
HumanCellLandscape=lm_res2),
idcol = "ctd")
## Apply multiple testing correction
results[,q:=stats::p.adjust(p,method="fdr")]
```
In this study we systematically investigated the cell types underlying phenotypes across the HPO. We hypothesised that genes which are specifically expressed in certain cell types will be most relevant for the proper functioning of those cell types. Thus, phenotypes caused by disruptions to specific genes will have greater or lesser effects across different cell types. To test this, we computed associations between the weighted gene lists for each phenotype with the gene expression specificity for each cell type in our transcriptomic reference atlases.
More precisely, for each phenotype we created a list of associated genes weighted by the strength of the evidence supporting those associations, imported from the Gene Curation Coalition (GenCC) [@distefanoGeneCurationCoalition2022]. Analogously, we created gene expression profiles for each cell type in our scRNA-seq atlases and then applied normalisation to compute how specific the expression of each gene is to each cell type. To assess consistency in the phenotype-cell type associations, we used multiple scRNA-seq atlases: Descartes Human (\~4 million single-nuclei and single-cells from `r length(unique(MSTExplorer::tissue_maps[ctd=="DescartesHuman"]$uberon_name))` fetal tissues) [@Cao2020-qz] and Human Cell Landscape (\~703,000 single-cells from `r length(unique(MSTExplorer::tissue_maps[ctd=="HumanCellLandscape"]$uberon_name))` embryonic, fetal and adult tissues) [@Han2020-iq]. We ran a series of linear regression models to test for the relationship between every unique combination of phenotype and cell type. We applied multiple testing correction to control the false discovery rate (FDR) across all tests.
Within the results using the Descartes Human single-cell atlas, `r res_summ$tmerged[ctd=="DescartesHuman"][["tests significant"]]`/`r res_summ$tmerged[ctd=="DescartesHuman"][["tests"]]` (`r res_summ$tmerged[ctd=="DescartesHuman"][["tests significant (%)"]]`%) tests across `r res_summ$tmerged[ctd=="DescartesHuman"][["cell types significant"]]`/`r res_summ$tmerged[ctd=="DescartesHuman"][["cell types"]]` (`r res_summ$tmerged[ctd=="DescartesHuman"][["cell types significant (%)"]]`%) cell types and `r res_summ$tmerged[ctd=="DescartesHuman"][["phenotypes significant"]]`/`r res_summ$tmerged[ctd=="DescartesHuman"][["phenotypes"]]` (`r res_summ$tmerged[ctd=="DescartesHuman"][["phenotypes significant (%)"]]`%) phenotypes revealed significant phenotype-cell type associations after multiple-testing correction (FDR\<0.05). Using the Human Cell Landscape single-cell atlas, `r res_summ$tmerged[ctd=="HumanCellLandscape"][["tests significant"]]`/`r res_summ$tmerged[ctd=="HumanCellLandscape"][["tests"]]` (`r res_summ$tmerged[ctd=="HumanCellLandscape"][["tests significant (%)"]]`%) tests across `r res_summ$tmerged[ctd=="HumanCellLandscape"][["cell types significant"]]`/`r res_summ$tmerged[ctd=="HumanCellLandscape"][["cell types"]]` (`r res_summ$tmerged[ctd=="HumanCellLandscape"][["cell types significant (%)"]]`%) cell types and `r res_summ$tmerged[ctd=="HumanCellLandscape"][["phenotypes significant"]]`/`r res_summ$tmerged[ctd=="HumanCellLandscape"][["phenotypes"]]` (`r res_summ$tmerged[ctd=="HumanCellLandscape"][["phenotypes significant (%)"]]`%) phenotypes showed significant phenotype-cell type associations (FDR\<0.05). The median number of significantly associated phenotypes per cell type was `r res_summ$tmerged[ctd=="DescartesHuman"][["phenotypes per cell type (median)"]]` (Descartes Human) and `r res_summ$tmerged[ctd=="HumanCellLandscape"][["phenotypes per cell type (median)"]]` (Human Cell Landscape), respectively. Overall, using the Human Cell Landscape reference yielded a greater percentage of phenotypes with at least one significant cell type association than the Descartes Human reference. This is expected at the Human Cell Landscape contains a greater diversity of cell types across multiple life stages (embryonic, fetal, adult).
Across both single-cell references, the median number of significantly associated cell types per phenotype was `r res_summ_all[["cell types per phenotype (median)"]]`, suggesting reasonable specificity of the testing strategy. Within the HPO, `r res_summ_all[["diseases significant"]]`/`r res_summ_all[["diseases"]]` (\~`r res_summ_all[["diseases significant (%)"]]`%) of diseases gene annotations showed significant cell type associations for at least one of their respective phenotypes. A summary of the phenome-wide results stratified by single-cell atlas can be found in @tbl-summary.
### Validation of expected phenotype-cell type relationships
```{r plot_bar_dendro, cache=FALSE}
plot_bar_dendro_out <- MSTExplorer::plot_bar_dendro(
results = results,
hpo = hpo,
show_plot = FALSE)
overrep_dat <- plot_bar_dendro_out$ggbars_out$data_summary
#
# overrep_dat[,summary:=paste0(ancestor_name,": ",
# n_celltypes_sig,"/",n_celltypes,
# " types of ",shQuote(target_celltypes),
# " were overrepresented",
# " ($N_{p}$=",phenotypes_per_ancestor,").")]
overrep_df <- overrep_dat[,list("HPO branch"=ancestor_name,
"Phenotypes (total)"=phenotypes_per_ancestor,
"CL branch"=target_celltypes,
"Cell types (overrepresented)"=n_celltypes_sig,
"Cell types (total)"=n_celltypes)]|>
data.frame(check.names=FALSE)
```
We intuitively expect that abnormalities of an organ system will often be driven by cell types within that system. The HPO has broad categories at the higher level of the ontology, enabling us to systematically test this. For example, phenotypes associated with the heart should generally be caused by cell types of the heart (i.e. cardiocytes), while abnormalities of the nervous system should largely be caused by neural cells. There will of course be exceptions to this. For example, some immune disorders can cause intellectual disability through neurodegeneration. Nevertheless, it is reasonable to expect that abnormalities of the nervous system will be most often associated with neural cells. All cell types in our single-cell reference atlases were mapped onto the Cell Ontology (CL); a controlled vocabulary of cell types organised into hierarchical branches (e.g. neural cell include neurons and glia, which in turn include their respective subtypes).
Here, we consider a cell type to be *on-target* relative to a given HPO branch if it belongs to one of the matched CL branches (see @tbl-ontarget-celltypes). Within each high-level branch in the HPO shown in [Fig. @fig-summary]b, we tested whether each cell type was more often associated with phenotypes in that branch relative to those in all other branches (including those not shown). We then checked whether each cell type was overrepresented (at FDR\<0.05) within its respective on-target HPO branch, where the number of phenotypes within that branch. Indeed, we found that all `r length(unique(overrep_df[["HPO branch"]]))` HPO branches were disproportionately associated with on-target cell types from their respective organ systems.
{{< pagebreak >}}
::: {#tbl-ontarget-celltypes}
```{r tbl-ontarget-celltypes}
#| label: tbl-ontarget-celltypes
knitr::kable(overrep_df)
```
Cross-ontology mappings between HPO and CL branches. The last two columns represent the number of cell types that were overrepresented in the on-target HPO branch and the total number of cell types in that branch. A disaggregated version of this table with all descendant cell type names is available in @tbl-celltypes.
:::
```{r cor_pct_p}
cor.pct_p <- stats::cor.test(plot_bar_dendro_out$ggprop_out$data$pct,
plot_bar_dendro_out$ggprop_out$data$minus_log_p)
nervous_pct <- plot_bar_dendro_out$ggprop_out$data[branch=="Abnormality of the\n nervous system" & minus_log_p==6]$pct
```
In addition to binary metrics of a cell type being associated with a phenotype or not, we also used association test p-values as a proxy for the strength of the association. We hypothesized that the more significant the association between a phenotype and a cell type, the more likely it is that the cell type is on-target for its respective HPO branch. To evaluate whether this, we grouped the association $-log_{10}(\text{p-values})$ into 6 bins. For each HPO-CL branch pairing, we then calculated the proportion of on-target cell types within each bin. We found that the proportion of on-target cell types increased with increasing significance of the association ($rho=$`r cor.pct_p$estimate`, $p=$`r cor.pct_p$p.value`). For example, abnormalities of the nervous system with $-log_{10}(\text{p-values}) = 1$, only `r plot_bar_dendro_out$ggprop_out$data[branch=="Abnormality of the\n nervous system" & minus_log_p==1]$pct`% of the associated cell types were neural cells. Whereas for those with $-log_{10}(\text{p-values}) = 6$, `r plot_bar_dendro_out$ggprop_out$data[branch=="Abnormality of the\n nervous system" & minus_log_p==6]$pct`% were neural cells despite the fact that this class of cell types only constituted `r round(plot_bar_dendro_out$ggprop_out$data[branch=="Abnormality of the\n nervous system" & minus_log_p==6]$pct_celltype)`% of the total cell types tested (i.e. the baseline). This shows that the more significant the association, the more likely it is that the cell type is on-target.
::: {#fig-summary}
```{r fig-summary}
#| label: fig-summary
#| fig-cap: High-throughput analysis reveals cell types underlying thousands of rare disease phenotypes. **a**, Some cell types are much more commonly associated with phenotypes than others. Bar height indicates the total number of significant phenotype enrichments per cell type ($FDR<0.05$) across all branches of the HPO. **b**, Analyses reveal expected and novel cell type associations within high-level HPO branches. Asterisks above each bar indicate whether that cell type was significantly more often enriched in that branch relative to all other HPO branches, including those not shown here, as a proxy for how specifically that cell type is associated with that branch; FDR<0.0001 (\*\*\*\*), FDR<0.001 (\*\*\*), FDR<0.01 (\*\*), FDR<0.05 (\*). **c**, Ontological relatedness of cell types in the Cell Ontology (CL) [@Diehl2016-gt]. **d**, The proportion of on-target associations (*y-axis*) increases with greater test significance (*x-axis*). Percentage of significant phenotype associations with on-target cell types (second row of facet labels), respective to the HPO branch.
#| fig-height: 15
#| fig-width: 13
plot_bar_dendro_out$plot
```
:::
### Validation of inter- and intra-dataset consistency
```{r validate_associations_correlate_ctd}
library(data.table)
## Across CTD
validate_associations_correlate_ctd_out <- MSTExplorer::validate_associations_correlate_ctd(
results=results,
group_var="ctd")
## Replace p-values of exactly 0 with smallest number R can compute
validate_associations_correlate_ctd_out$data_stats$p.all$summary_data$p.value <- max(validate_associations_correlate_ctd_out$data_stats$p.all$summary_data$p.value,
.Machine$double.xmin)
## Within CTD: across developmental stages
validate_associations_correlate_ctd_out_hcl <- MSTExplorer::validate_associations_correlate_ctd(
results=results,
filters= list(ctd=c("HumanCellLandscape"),
stage=c("Fetus","Adult")),
group_var="stage")
```
If our methodology works, it should yield consistent phenotype-cell type associations across different datasets. We therefore tested for the consistency of our results across the two single-cell reference datasets (Descartes Human vs. Human Cell Landscape) across the subset of overlapping cell types [Fig. @fig-ctd-correlation]. In total there were `r nrow(validate_associations_correlate_ctd_out$data$all)` phenotype-cell type associations to compare across the two datasets (across `r length(unique(validate_associations_correlate_ctd_out$data$all$hpo_id))` phenotypes and `r length(unique(validate_associations_correlate_ctd_out$data$all$cl_name))` cell types annotated to the exact same CL term. We found that the correlation between p-values of the two datasets was high ($rho$=`r validate_associations_correlate_ctd_out$data_stats$p.all$summary_data$estimate`, $p$=`r validate_associations_correlate_ctd_out$data_stats$logFC.significant$summary_data$p.value`). Within the subset of results that were significant in both single-cell datasets (FDR\<0.05), we found that degree of correlation between the association effect sizes across datasets was even stronger ($rho=$`r validate_associations_correlate_ctd_out$data_stats$logFC.significant$summary_data$estimate`, $p=$`r validate_associations_correlate_ctd_out$data_stats$logFC.significant$summary_data$p.value`). We also checked for the intra-dataset consistency between the p-values of the foetal and adult samples in the Human Cell Landscape, showing a very similar degree of correlation as the inter-dataset comparison ($rho=$`r validate_associations_correlate_ctd_out_hcl$data_stats$p.all$summary_data$estimate`, $p=$`r validate_associations_correlate_ctd_out_hcl$data_stats$logFC.significant$summary_data$p.value`). Together, these results suggest that our approach to identifying phenotype-cell type associations is highly replicable and generalisable to new datasets.
### More specific phenotypes are associated with fewer genes and cell types
```{r plot_ontology_levels, cache=FALSE}
plot_ontology_levels_out <- MSTExplorer::plot_ontology_levels(
results = results,
ctd_list = ctd_list,
x_vars = c("genes","cell types","p"),
sig_vars= c(FALSE, TRUE, FALSE),
log_vars = c(FALSE, FALSE, FALSE),
nrow = 1,
show_plot = FALSE)
plot_ontology_levels_out_stats <- plot_ontology_levels_out$data_stats$summary_data|>
data.table::setkeyv("parameter2")
## replace pvalues of exactly 0 with the minimum computable number in R
## This avoids creating -Inf when logging values.
plot_ontology_levels_out_stats[p.value==0, p.value:=.Machine$double.xmin]
plot_ontology_levels_out_stats[q.value==0, q.value:=.Machine$double.xmin]
```
Higher levels of the ontology are broad classes of phenotype (e.g. 'Abnormality of the nervous system') while the lower levels can get very detailed (e.g. 'Spinocerebellar atrophy'). The higher level phenotypes inherit all genes associated with lower level phenotypes, so naturally they have more genes than the lower level phenotypes ([Fig. @fig-ontology-lvl]a; $rho=$`r plot_ontology_levels_out_stats["genes"]$estimate`, $p=$`r plot_ontology_levels_out_stats["genes"]$p.value`).
Next, we reasoned that the more detailed and specific a phenotype is, the more likely it is to be driven by one cell type. For example, while 'Neurodevelopmental abnormality' could plausibly be driven by any/all cell types in the brain, it is more likely that 'Impaired visuospatial constructive cognition' is driven by a single cell type. This was indeed the case, as we observed a strongly significant negative correlation between the two variables ([Fig. @fig-ontology-lvl]b; $rho=$`r plot_ontology_levels_out_stats["cell types"]$estimate`, $p=$`r plot_ontology_levels_out_stats["cell types"]$p.value`). We also found that the phenotype-cell type association p-values increased with greater phenotype specificity, reflecting the decreasing overall number of associated cell types at each ontological level ([Fig. @fig-ontology-lvl]c; $rho=$`r plot_ontology_levels_out_stats["p"]$estimate`, $p=$`r plot_ontology_levels_out_stats["p"]$p.value`).
::: {#fig-ontology-lvl}
```{r fig-ontology-lvl}
#| label: fig-ontology-lvl
#| fig-cap: More specific phenotypes are associated with fewer, more specific genes and cell types. Box plots showing relationship between HPO phenotype level and **a**, the number of genes annotated to each phenotype, **b**, the number of significantly enriched cell types, **c**, the p-values of phenotype-cell type association tests. Ontology level 0 represents the most inclusive HPO term 'All', while higher ontology levels (max=16) indicate progressively more specific HPO terms (e.g. 'Contracture of proximal interphalangeal joints of 2nd-5th fingers'). Boxes are coloured by the mean value (respective to the subplot) within each HPO level.
#| fig-height: 7
#| fig-width: 15
plot_ontology_levels_out$plot
```
:::
### Hepatoblasts have a unique role in recurrent Neisserial infections
```{r plot_bar_dendro_facets-rni}
target_branches <- list("Recurrent bacterial infections"="leukocyte")
lvl <- subset(hpo@elementMetadata,name==names(target_branches)[1])$ontLvl
results_tmp <- HPOExplorer::add_ancestor(data.table::copy(results),
lvl = lvl,
force_new = TRUE)
infections_out <- MSTExplorer::plot_bar_dendro_facets(
results=results_tmp,
target_branches=target_branches,
facets = "hpo_name",
lvl=lvl+1,
ncol=2,
vlines="hepatoblast",
legend.position="top",
facets_n=NULL,
q_threshold=0.05,
background_full=FALSE)
remove(results_tmp)
staph_res <- infections_out$data[hpo_name=="Recurrent staphylococcal infections"]
staph_res_top <- staph_res[,.SD[p %in% head(sort(p), 1)], by=c("hpo_id")]
recurrent_infections_ids <- KGExplorer::get_ontology_descendants(
ont = hpo,
terms = "Recurrent infections")[[1]]
hepatoblast_res <- results[q<0.05 &
hpo_id %in% recurrent_infections_ids &
cl_name=="hepatoblast"]
hepatocyte_res <- results[q<0.05 &
ancestor_name=="Abnormality of the immune system" &
grepl("hepatocyte",CellType,ignore.case = TRUE)]
rni_res <- infections_out$data[hpo_name=="Recurrent Neisserial infections" & cl_name!="hepatoblast"]
```
We selected the HPO term 'Recurrent bacterial infections' and all of its descendants (`r data.table::uniqueN(infections_out$data$hpo_id)-1` phenotypes) as an example of how investigations at the level of granular phenotypes can reveal different cell type-specific mechanisms ([Fig. @fig-rni]). As expected, these phenotypes are primarily associated with immune cell types (e.g. macrophages, dendritic cells, T cells, monocytes, neutrophils). Some associations confirm relationships previously suggested in the literature, such as that between 'Recurrent staphylococcal infections' and myeloid cells [@Heim2014-du; @Pidwill2020-le; @Stoll2018-dc; @Tebartz2015-xs]. Specifically, our results pinpoint `r staph_res_top$cl_name`s as the most strongly associated cell subtypes (FDR=`r staph_res_top$q`, $\beta$=`r staph_res_top$effect`).
In contrast to all other recurrent infection types, 'Recurrent Neisserial infections' highlighted a novel association with hepatoblasts (Descartes Human : FDR=`r hepatoblast_res[hpo_name=="Recurrent Neisserial infections" & CellType=="Hepatoblasts"]$q`, $\beta$=`r hepatoblast_res[hpo_name=="Recurrent Neisserial infections" & CellType=="Hepatoblasts"]$estimate`). Whilst unexpected, a convincing explanation involves the complement system, a key driver of innate immune response to Neisserial infections. Hepatocytes, which derive from hepatoblasts, produce the majority of complement proteins [@Zhou2016-kq], and Kupffer cells express complement receptors @Dixon2013-ok. In addition, individuals with deficits in complement are at high risk for Neisserial infections [@Ladhani2019-nf; @Rosain2017-ih], and a genome-wide association study in those with a Neisserial infection identified risk variants within complement proteins [@The_International_Meningococcal_Genetics_Consortium2010-if]. While the potential of therapeutically targeting complement in RDs (including Neisserial infections) has been proposed previously [@Lung2019-il; @Reis2015-yz], performing this in a gene- and cell type-specific manner may help to improve efficacy and reduce toxicity (e.g. due to off-target effects). Importantly, there are over 56 known genes within the complement system [@Seal2023-pa], highlighting the need for a systematic, evidence-based approach to identify effective gene targets.
Also of note, despite the fact that our datasets contain both hepatoblasts and their mature counterpart, hepatocytes, only the hepatoblasts showed this association. This suggests that the genetic factors that predispose individuals for risk of Neisserial infections are specifically affecting hepatoblasts before they become fully differentiated. It is also notable that these phenotypes were the only ones within the 'Recurrent bacterial infections' branch, or even the broader 'Recurrent infections' branch, perhaps indicating a unique role for hepatoblasts in recurrent infectious disease. The only phenotypes within the even broader 'Abnormality of the immune system' HPO branch that significantly associated with mature hepatocytes were `r shQuote(hepatocyte_res[1,]$hpo_name)` (FDR=`r hepatocyte_res[1,]$q`, $\beta$=`r hepatocyte_res[1,]$estimate`) and `r shQuote(hepatocyte_res[2,]$hpo_name)` (FDR=`r hepatocyte_res[2,]$q`, $\beta$=`r hepatocyte_res[2,]$estimate`) both of which are well-known to involve the liver [@Al-Hamoudi2009-le; @Brewer2018-dg; @Eshchar1973-tz].
::: {#fig-rni}
```{r fig-rni}
#| label: fig-rni
#| fig-cap: Association tests reveal that hepatoblasts have a unique role in recurrent Neisserial infections. Significant phenotype-cell type tests for phenotypes within the branch 'Recurrent bacterial infections'. Amongst all different kinds of recurrent bacterial infections, hepatoblasts (highlighted by vertical dotted lines) are exclusively enriched in 'Recurrent gram−negative bacterial infections'. Note that terms from multiple levels of the same ontology branch are shown as separate facets (e.g. 'Recurrent bacterial infections' and 'Recurrent gram−negative bacterial infections').
#| fig-height: 11
#| fig-width: 13
infections_out$plot +
ggplot2::guides(fill=ggplot2::guide_legend(ncol=2,
title = NULL))
```
:::
```{r prioritise_targets_network-RNI, cache.lazy=FALSE}
## Annotate results with disease/symptom-level and gene-level information
## filtering q-values at this step yields the same results as filtering at the next step,
## albeit with much fast computation.
results_annot <- HPOExplorer::add_disease(results[q<0.05],
add_descriptions = TRUE,
allow.cartesian = TRUE)
results_annot <- MSTExplorer::add_symptom_results(results = results_annot,
ctd_list = ctd_list)
## Plot multi-scale mechanisms as an interactive network
phenotype <- "Recurrent Neisserial infections"
results_annot[,estimate_mult:=estimate*25]
vn_rni <- MSTExplorer::prioritise_targets_network(
top_targets = results_annot[hpo_name==phenotype],
edge_size_var = "estimate_mult",
mediator_var = list(c(3,4),c(1,3),c(1,4)),
main = NULL,
submain = NULL)
graph_dat <- KGExplorer::graph_to_dt(vn_rni$data, what = "nodes")
# graph_dat[node_type=="gene_symbol"]$name
# unique(graph_dat$CellType)
results_annot_sub <- MSTExplorer:::add_driver_genes(results = results_annot[hpo_id %in% graph_dat$hpo_id & disease_id %in% graph_dat$disease_id & gene_symbol %in% graph_dat[node_type=="gene_symbol"]$name],
ctd_list = ctd_list,
metric = "specificity")
results_annot_sub[,mean_specificity:=mean(specificity), by=c("cl_name","gene_symbol")]
results_annot_sub[,gene_symbol:=factor(gene_symbol, levels=unique(graph_dat[node_type=="gene_symbol"]$name), ordered=TRUE)]
plot_gene_heat <- ggplot2::ggplot(results_annot_sub,
ggplot2::aes(x=gene_symbol, y=cl_name,fill=specificity_quantile)) +
ggplot2::geom_tile() +
ggplot2::theme_classic() +
ggplot2::labs(x="Gene",y="Cell type",fill="Expression\nspecificity\nquantile") +
# ggplot2::scale_fill_viridis_c() +
ggplot2::scale_fill_distiller(palette = "BuGn",direction = 1) +
ggplot2::guides(fill = ggplot2::guide_colorbar(barheight = 3)) +
ggplot2::theme(text = ggplot2::element_text(size=8))
```
Phenotypes can be associated with multiple diseases, cell types and genes. In addition to hepatoblasts, 'Recurrent Neisserial infections' were also associated with `r rni_res[cl_name==unique(rni_res$cl_name)[1]]$cl_name[1]`s (FDR=`r rni_res[cl_name==unique(rni_res$cl_name)[1]]$q[1]`, $\beta$=`r rni_res[cl_name==unique(rni_res$cl_name)[1]]$effect[1]`), `r rni_res[cl_name==unique(rni_res$cl_name)[2]]$cl_name[1]`s (FDR=`r rni_res[cl_name==unique(rni_res$cl_name)[2]]$q[1]`, $\beta$=`r rni_res[cl_name==unique(rni_res$cl_name)[2]]$effect[1]`), and `r rni_res[cl_name==unique(rni_res$cl_name)[3]]$cl_name[1]`s (FDR=`r rni_res[cl_name==unique(rni_res$cl_name)[3]]$q[1]`, $\beta$=`r round(rni_res[cl_name==unique(rni_res$cl_name)[3]]$effect[1],3)`). 'Recurrent Neisserial infections' is a phenotype of `r length(sort(unique(results_annot[hpo_name=="Recurrent Neisserial infections"]$disease_name)))` different diseases (`r paste(shQuote(sort(unique(results_annot[hpo_name=="Recurrent Neisserial infections"]$disease_name))), collapse=", ")`).
Next, we sought to link multi-scale mechanisms at the levels of disease, phenotype, cell type, and gene by visualising putative causal relationships between them as a network ([Fig. @fig-network-rni]). The phenotype 'Recurrent Neisserial infections' was connected to cell types through the aforementioned association test results (FDR\<0.05). Genes that were primarily driving these associations (i.e. genes that were both strongly linked with 'Recurrent Neisserial infections' and were highly specifically expressed in the given cell type) were designated as "driver genes" and retained for plotting. Diseases that have 'Recurrent Neisserial infections' as a phenotype were collected from the HPO annotation files. Genes that were annotated to a given disease via a particular 'Recurrent Neisserial infections' constituted "symptom"-level gene sets. Only diseases whose symptom-level gene sets had \>25% overlap with the driver gene sets for at least one cell type were retained in the network plot. Using this approach, we were able to construct and refine causal networks tracing multiple scales of disease biology.
In the case of 'Recurrent Neisserial infections', this revealed that genetic deficiencies in various complement system genes (e.g. *C5*, *C8*, and *C7*) are primarily mediated by different cell types (hepatoblasts, stratified epithelial cells, and stromal cells, respectively). While genes of the complement system are expressed throughout many different tissues and cell types, these results indicate that different subsets of these genes may mediate their effects through different cell types. This finding suggests that investigating (during diagnosis) and targeting (during treatment) different cell types may be critical for the diagnosis and treatment of these closely related, yet mechanistically distinct, diseases.
::: {#fig-network-rni}
```{r fig-network-rni}
#| label: fig-network-rni
#| fig-cap: Constructing a multi-scale causal network of disease biology for the phenotype 'Recurrent Neisserial infections' (RNI). **a**, Starting from the bottom of the plot, one can trace how genes causal for RNI (yellow boxes) mediate their effects through cell types (orange circles) and diseases (blue cylinders). Cell types are connected to RNI via association testing (FDR<0.05). Only the top driver genes are shown, defined as genes with both strong evidence for a causal role in RNI and high expression specificity in an associated cell type. Only diseases with >25% overlap with the driver genes of at least one cell type are shown. Nodes were spatially arranged using the Sugiyama algorithm [@Sugiyama1981-ev]. **b** Expression specificity quantiles (on a scale from 1-40) of each driver gene in each cell type (darker corresponds to greater specificity).
#| fig-height: 4
#| fig-width: 7
if(knitr::is_html_output()){
knitr::opts_current$set(results="asis")
vn_rni$plot
plot_gene_heat
} else {
fig_network_rni <- patchwork::wrap_plots(
magick::image_read(here::here("manuscript/img/fig-rni.png"), density=400)|>
magick::image_ggplot(interpolate=TRUE),
plot_gene_heat,
ncol = 1,
# guides = "collect",
heights = c(5,1)) +
patchwork::plot_annotation(tag_levels = "a") &
ggplot2::theme(plot.margin = ggplot2::margin(0,0,0,0))
fig_network_rni
# ggplot2::ggsave("~/Downloads/fig_network_rni.png", fig_network_rni, dpi = 400, height = 6, width = 7)
}
```
:::
### Monarch Knowledge Graph recall
The Monarch Knowledge Graph (MKG) is a comprehensive, standardised database that aggregates up-to-date knowledge about biomedical concepts and the relationships between them. This currently includes `r nrow(kg_hp)` well-established phenotype-cell type relationships [@Putman2024-et]. We used the MKG as a proxy for the field's current state of knowledge of phenotype-cell type associations. We evaluated the proportion of MKG associations that were recapitulated by our results [Fig. @fig-monarch-recall]. For each phenotype-cell type association in the MKG, we computed the percent of cell types recovered in our association results at a given ontological distance according to the CL ontology. An ontological distance of 0 means that our nominated cell type was as close as possible to the MKG cell type after adjusting for the cell types available in our single-cell references. Instances of exact overlap of terms between the MKG and our results would qualify as an ontological distance of 0 (e.g. 'monocyte' vs. 'monocyte'). Greater ontological distances indicate further divergence between the MKG cell type and our nominated cell type. A distance of 1 indicating that the MKG cell type was one step away from our nominated cell type in the CL ontology graph (e.g. 'monocyte' vs. 'classical monocyte'). The maximum possible percent of recovered terms is capped by the percentage of MKG ground-truth phenotypes we were able to find at least one significant cell type association for at $FDR_{pc}$.
In total, our results contained at least one significant cell type associations for `r round(validate_associations_mkg_out$proportion_phenotypes_captured*100,1)`% of the phenotypes described in the MKG. Of these phenotypes, we captured `r round(validate_associations_mkg_out$cl_distance_results$data[1,]$pct,1)`% of the MKG phenotype-cell associations at an ontological distance of `r validate_associations_mkg_out$cl_distance_results$data[1,]$dist` (i.e. the closest possible Cell Ontology term match). Recall increased with greater flexibility in the matching of cell type annotations. At an ontological distance of `r validate_associations_mkg_out$cl_distance_results$data[2,]$dist` (e.g. 'monocyte' vs. 'classical monocyte'), we captured `r round(validate_associations_mkg_out$cl_distance_results$data[2,]$pct,1)`% of the MKG phenotype-cell associations. Recall reached a maximum of `r round(max(validate_associations_mkg_out$cl_distance_results$data$pct),1)`% at a ontological distance of `r max(validate_associations_mkg_out$cl_distance_results$data$dist)`. This recall percentage is capped by the proportion of phenotypes for which we were able to find at least one significant cell type association for. It should be noted that we were unable to compute precision as the MKG (and other knowledge databases) only provide true positive associations. Identifying true negatives (e.g. a cell type is definitely never associated with a phenotype) is a fundamentally more difficult task to resolve as it would require proving the null hypothesis. Regardless, these benchmarking tests suggests that our results are able to recover the majority of known phenotype-cell type associations while proposing many new associations.
### Annotation of phenotypes using generative large language models
```{r gpt_annot_codify}
gpt_check <- HPOExplorer::gpt_annot_check()
gpt_annot <- HPOExplorer::gpt_annot_codify()
gpt_annot$annot_weighted[,hpo_name:=gsub("^obsolete ","",hpo_name)]
least_severe_phenotype <- gpt_annot$annot_weighted[hpo_name=="Thin toenail" & severity_score_gpt==0,]
```
Some phenotypes are more severe than others and thus could be given priority for developing treatments. For example, 'Leukonychia' (white nails) is much less severe than 'Leukodystrophy' (white matter degeneration in the brain). Given the large number of significant phenotype-cell type associations, we needed a way of prioritising phenotypes for further investigation. We therefore used the large language model GPT-4 to systematically annotate the severity of all HPO phenotypes [@murphyHarnessingGenerativeAI2024].
Severity annotations were gathered from GPT-4 for `r length(unique(gpt_check$annot$hpo_id))`/`r hpo@n_terms` (`r length(unique(gpt_check$annot$hpo_id))/hpo@n_terms*100`%) HPO phenotypes in our companion study [@murphyHarnessingGenerativeAI2024]. Benchmarking tests of these results using ground-truth HPO branch annotations. For example, phenotypes within the 'Blindness' HPO branch (*HP:0000618*) were correctly annotated as causing blindness by GPT-4. Across all annotations, the recall rate of GPT-4 annotations was `r mean(gpt_check$true_pos_rate)*100`% (min=`r min(gpt_check$true_pos_rate)*100`%, max=`r max(gpt_check$true_pos_rate)*100`%, SD=`r sd(gpt_check$true_pos_rate)*100`) with a mean consistency score of `r mean(gpt_check$consistency_rate)*100`% (min=`r min(gpt_check$consistency_rate)*100`%, max=`r max(gpt_check$consistency_rate)*100`%, SD=`r sd(gpt_check$consistency_rate)*100`) for phenotypes whose annotation were collected more than once. This clearly demonstrates the ability of GPT-4 to accurately annotate phenotypes. This allowed us to begin using these annotations to compute systematically collected severity scores for all phenotypes in the HPO.
From these annotations we computed a weighted severity score metric for each phenotype ranging from 0-100 (100 being the theoretical maximum severity of a phenotype that always causes every annotation). Within our annotations, the most severe phenotype was `r shQuote(gpt_annot$annot_weighted[1,]$hpo_name)` (*`r gpt_annot$annot_weighted[1,]$hpo_id`*) with a severity score of `r gpt_annot$annot_weighted[1,]$severity_score_gpt`, followed by `r shQuote(gpt_annot$annot_weighted[2,]$hpo_name)` (*`r gpt_annot$annot_weighted[2,]$hpo_id`*) with a severity score of `r gpt_annot$annot_weighted[2,]$severity_score_gpt`. There were `r data.table::uniqueN(gpt_annot$annot_weighted[severity_score_gpt==0]$hpo_name)` phenotypes with a severity score of 0 (e.g. `r paste(shQuote(least_severe_phenotype$hpo_name),collapse=", ")`). The mean severity score across all phenotypes was `r mean(gpt_annot$annot_weighted$severity_score_gpt)` (median=`r median(gpt_annot$annot_weighted$severity_score_gpt)`, standard deviation=`r sd(gpt_annot$annot_weighted$severity_score_gpt)`).
```{r plot_celltype_severity}
## Identify cell types associated with the most severe phenotypes
plot_celltype_severity_out <- MSTExplorer::plot_celltype_severity(results = results,
top_n=3)
```
We next sought to answer the question "are disruptions to certain cell types more likely to cause severe phenotypes?". To address this, we merged the GPT annotations with the significant (FDR<0.05) phenotype-cell type association results and computed the frequency of each severity annotation per cell type (Fig. @fig-celltype-severity-bar). We found that `r plot_celltype_severity_out$bar$data$cl_name[1]`s were associated with phenotypes that had the highest average composite severity scores, followed by `r plot_celltype_severity_out$bar$data$cl_name[2]`s and `r plot_celltype_severity_out$bar$data$cl_name[3]`s. This suggests that disruptions to these cell types are more likely to cause generally severe phenotypes. Meanwhile, `r plot_celltype_severity_out$bar$data$cl_name[length(plot_celltype_severity_out$bar$data$cl_name)]`s were associated with phenotypes that had the lowest average composite severity scores, suggesting a that disruptions to these cell types can be better tolerated than.
Of course, different aspects of phenotype severity will be more associated with some cells than others. Therefore, after encoding the GPT annotations numerically (0="never", 1="rarely", 2="often", 3="always") we computed the mean value per cell type within each annotation. For each annotation category (death, intellectual disability, impaired mobility, etc.) we plotted the top/bottom three cell types that had the highest/bottom mean encoded values within each annotation [Fig. @fig-celltype-severity-dot]. This consistently yielded expected relationships between cell types (e.g. `r plot_celltype_severity_out$dot$data[variable=="blindness" & group=="top"]$cl_name[1]`) and phenotype characteristics (e.g. blindness). Similarly, phenotypes that more commonly cause death are most commonly associated with `r plot_celltype_severity_out$dot$data[variable=="death" & group=="top"]$cl_name[1]` and `r plot_celltype_severity_out$dot$data[variable=="death" & group=="top"]$cl_name[2]`s, and least commonly associated with `r plot_celltype_severity_out$dot$data[variable=="death" & group=="bottom"]$cl_name[3]`s and `r plot_celltype_severity_out$dot$data[variable=="death" & group=="bottom"]$cl_name[2]`s. This pattern is shown consistently across all annotations (e.g. fertility-reducing phenotypes associated with `r plot_celltype_severity_out$dot$data[variable=="reduced\nfertility" & group=="top"]$cl_name[2]`s, immunodeficiency-causing phenotypes associated with `r plot_celltype_severity_out$dot$data[variable=="immunodeficiency" & group=="top"]$cl_name[1]`s,
mobility-impairing phenotypes associated with `r plot_celltype_severity_out$dot$data[variable=="impaired\nmobility" & group=="top"]$cl_name[1]`s, cancer-causing phenotypes associated with `r plot_celltype_severity_out$dot$data[variable=="cancer" & group=="top"]$cl_name[1]`s etc.).
::: {#fig-celltype-severity-dot}
```{r fig-celltype-severity-dot}
#| label: fig-celltype-severity-dot
#| fig-cap: Cell types are differentially associated with phenotypes of varying severity. The dot plot shows the mean encoded frequency value for a given annotation (0="never", 1="rarely", 2="often", 3="always"), aggregated by associated cell type. After sorting by the mean encoded frequency value for each annotation, the top three and bottom three cell types are shown. Cell types are ordered according to their ontological similarity within the Cell Ontology (CL).
#| fig-height: 8
#| fig-width: 7
plot_celltype_severity_out$dot$plot
```
### Congenital phenotypes are associated with foetal cell types
```{r plot_congenital_annotations}
plot_congenital_annotations_out <- MSTExplorer::plot_congenital_annotations(
results = results[ctd=="HumanCellLandscape"],
x_var="fetal_celltype",
hpo=hpo)
plot_congenital_annotations_out_fo <- MSTExplorer::plot_congenital_annotations(
results = results[ctd=="HumanCellLandscape"],
x_var="fetal_only",
hpo=hpo)
plot_congenital_annotations_out_data <-
data.table::data.table(plot_congenital_annotations_out$data,
key="congenital_onset")
plot_congenital_annotations_out_data[,summary:=paste0(
shQuote(congenital_onset),"=",.label,
" (n=",counts," associations)"
)]
plot_congenital_annotations_out_stats <- data.table::data.table(
plot_congenital_annotations_out$data_stats$summary_data
)
```
To further verify the biological relevance of our results, we examined the association of foetal cell types with phenotypes annotated as congenital in onset. As expected, the frequency of congenital onset with each phenotype (as determined by GPT-4 annotations) was strongly predictive with the proportion of significantly associated foetal cell types in our results ($p=$`r plot_congenital_annotations_out_stats$p.value`, $\chi^2_{Pearson}=$`r plot_congenital_annotations_out_stats$statistic`, $\hat{V}_{Cramer}=$`r plot_congenital_annotations_out_stats$estimate`). Furthermore, we observed the same pattern when investigating the relationship between the proportion of phenotypes *only* associated with the foetal version of a cell type and not the adult version of the same cell type ($p=$`r plot_congenital_annotations_out_fo$data_stats$summary_data$p.value`, $\chi^2_{Pearson}=$`r plot_congenital_annotations_out_fo$data_stats$summary_data$statistic`, $\hat{V}_{Cramer}=$`r plot_congenital_annotations_out_fo$data_stats$summary_data$estimate`). See [Fig. @fig-congenital-fetal-only]. These results are consistent with the expected role of foetal cell types in development and the aetiology of congenital disorders.
::: {#fig-congenital}
```{r fig-congenital}
#| label: fig-congenital
#| fig-cap: Congenital phenotypes are more often associated with foetal cell types. As a phenotype is more often congenital in nature, the greater proportion of foetal cell types are significantly associated with it. The summary statistics in the plot title are the results of a $\chi^2$ tests of independence between the ordinal scale of congenital onset and the proportion of foetal cell types associated with each phenotype. The p-values above each bar are the results of an additional series of $\chi^2$ tests to determine whether the proportion of foetal vs. non-foetal cell types significantly different differs from the proportions expected by chance (the dashed line). The feotal silhouette was generated with DALL-E. The adult silhouette is from phylopic.org and is freely available via CC0 1.0 Universal Public Domain Dedication.
#| fig-height: 5
#| fig-width: 7
legend_colors <- ggplot2::ggplot_build(
plot_congenital_annotations_out$plot)$data[[1]][["fill"]]|>unique()
img_fetus <- magick::image_read(here::here("manuscript/img/fetus.png"))
img_adult <- magick::image_read(here::here("manuscript/img/adult.png"))
fetus <- magick::image_ggplot(img_fetus, interpolate=TRUE) +
ggplot2::geom_line(data = data.frame(x = 1:2, y = 1:2),
linewidth=20,
ggplot2::aes(x, y,
color = 'Foetal\ncell\ntypes')) +
ggplot2::scale_color_manual(NULL, values = legend_colors[1]) +
ggplot2::theme(legend.position = 'bottom',
legend.key.width = ggplot2::unit(2.5, 'cm'))
adult <- magick::image_ggplot(img_adult, interpolate=TRUE) +
ggplot2::geom_line(data = data.frame(x = 0:1, y = 1:2),
linewidth=20,
ggplot2::aes(x, y,
color = 'Adult\ncell\ntypes')) +
ggplot2::scale_color_manual(NULL, values = legend_colors[2]) +
ggplot2::theme(legend.position = 'top',
legend.key.width = ggplot2::unit(2.5, 'cm'))
plot_congenital_annotations_out$plot$labels$subtitle <- NULL
plot_congenital_annotations_out$plot$labels$caption <- NULL
plot_congenital_annotations_out$plot$labels$group <- NULL
((
plot_congenital_annotations_out$plot +
ggplot2::labs(x="Frequency of congenital onset in phenotype",
y="% phenotype-cell type associations",
fill=NULL, group=NULL) +
ggplot2::theme(axis.text.x = ggplot2::element_text(size=12),
legend.position = "none")
) |
(
fetus /adult
)) +
patchwork::plot_layout(widths = c(1, .25))
```
:::
```{r plot_congenital_annotation_branch}
# Which branches of the HPO are most enriched for phenotypes with fetal cell type associations?
plot_congenital_annotations_branch_out <- MSTExplorer::plot_congenital_annotations(
results = results,
by_branch=TRUE,
keep_descendants="Phenotypic abnormality",
hpo=hpo)
if(plot_congenital_annotations_branch_out$data_stats$summary_data$p.value==0){
plot_congenital_annotations_branch_out$data_stats$summary_data$p.value <- .Machine$double.xmin
}
top_congenital <- unique(plot_congenital_annotations_branch_out$data[,c("ancestor_name","perc")])
```
Upon exploring these results more deeply, we also found that some branches of the HPO were more commonly enriched in foetal cell types compared to others ($\hat{V}_{Cramer}$=`r plot_congenital_annotations_branch_out$data_stats$summary_data$estimate`, $p$\<`r plot_congenital_annotations_branch_out$data_stats$summary_data$p.value`). The branch with the greatest proportion of foetal cell type enrichments was `r shQuote(top_congenital$ancestor_name[1])` (`r top_congenital$perc[1]`%), followed by `r shQuote(top_congenital$ancestor_name[2])` (`r top_congenital$perc[2]`%) and `r shQuote(top_congenital$ancestor_name[3])` (`r top_congenital$perc[3]`%). These results align well with the fact that physical malformations tend to be developmental in origin.
```{r run_congenital_enrichment}
## Identify which phenotypes are more often associated with foetal version of
## cell types than the adult versions.
run_congenital_enrichment_out <- MSTExplorer::run_congenital_enrichment(
results = results[ctd=="HumanCellLandscape"]
)
```
Finally, we designed a metric to identify which phenotypes were more often associated with foetal cell types than adult cell types. For each phenotype, we calculated the difference in the association p-values between the foetal and adult version of the equivalent cell type. The resulting metric ranges from 1 (indicating the phenotype is only associated with the foetal version of the cell type) and -1 (indicating the phenotype is only associated with the adult version of the cell type). To summarise the most foetal-biased phenotype categories, we ran an ontological enrichment test with the HPO graph. To identify foetal cell type-biased phenotype categories, we fed the top `r formals(MSTExplorer::run_congenital_enrichment)$top_n_hpo` phenotypes with the greatest foetal cell type bias (closer to 1) into the enrichment function. Conversely, we used the top `r formals(MSTExplorer::run_congenital_enrichment)$top_n_hpo` phenotypes with the greatest adult cell type bias (closer to -1) to identify adult cell type-biased phenotype categories.
The phenotype categories with the greatest bias towards foetal cell types were `r shQuote(run_congenital_enrichment_out$hpo_enrich_top$name[1])` (p=`r run_congenital_enrichment_out$hpo_enrich_top$p_value[1]`, $log_2(\text{fold-change})$=`r run_congenital_enrichment_out$hpo_enrich_top$log2_fold_enrichment[1]`) and `r shQuote(run_congenital_enrichment_out$hpo_enrich_top$name[2])` (p=`r run_congenital_enrichment_out$hpo_enrich_top$p_value[2]`, $log_2(\text{fold-change})$=`r run_congenital_enrichment_out$hpo_enrich_top$log2_fold_enrichment[2]`). Specific examples of such phenotypes include `r shQuote(unique(run_congenital_enrichment_out$fetal_pdiff_top$hpo_name)[1])`, `r shQuote(unique(run_congenital_enrichment_out$fetal_pdiff_top$hpo_name)[2])`, and `r shQuote(unique(run_congenital_enrichment_out$fetal_pdiff_top$hpo_name)[3])`. Indeed, these phenotypes are morphological defects apparent at birth caused by abnormal developmental processes.
Conversely, the most adult cell type-biased phenotype categories were `r shQuote(run_congenital_enrichment_out$hpo_enrich_bottom$name[1])` (p=`r run_congenital_enrichment_out$hpo_enrich_bottom$p_value[1]`, $log_2(\text{fold-change})$=`r run_congenital_enrichment_out$hpo_enrich_bottom$log2_fold_enrichment[1]`) and `r shQuote(run_congenital_enrichment_out$hpo_enrich_bottom$name[2])` (p=`r run_congenital_enrichment_out$hpo_enrich_bottom$p_value[2]`, $log_2(\text{fold-change})$=`r run_congenital_enrichment_out$hpo_enrich_bottom$log2_fold_enrichment[2]`). Specific examples of such phenotypes include 'Excessive wrinkled skin' and 'Paroxysmal supraventricular tachycardia'. It is well known that ageing naturally causes a loss of skin elasticity (due to decreasing collagen production) and vascular degeneration [@liAgingAgeRelated2021].
Next, we were interested whether some cell type tend to show strong differences in their phenotype associations between their foetal and adult forms. To test this, we performed an analogous enrichment procedure as with the phenotypes, except using Cell Ontology terms and the Cell Ontology graph. This analysis identified the cell type category `r run_congenital_enrichment_out$cl_enrich_top$name[1]` (p=`r run_congenital_enrichment_out$cl_enrich_top$p_value[1]`, $log_2(\text{fold-change})$=`r run_congenital_enrichment_out$cl_enrich_top$log2_fold_enrichment[1]`) as the most foetal-biased cell type. No cell type categories were significantly enriched for the most adult-biased cell types (@tbl-cl_enrich).
See @tbl-hpo_enrich for the full enrichment results, and @tbl-foetal_examples for specific examples of the most foetal/adult-biased phenotypes.
Together, these findings serve to further validate our methodology as a tool for identifying the causal cell types underlying a wide range of phenotypes.
### Therapeutic target identification
```{r prioritise_targets}
prioritise_targets_out <- MSTExplorer::prioritise_targets(
results = results,
ctd_list = ctd_list,
phenotype_to_genes = p2g,
hpo = hpo,
keep_deaths=NULL,
keep_onsets=NULL,
keep_specificity_quantiles = seq(30,40), ## NULL:70, 30-40:64
keep_mean_exp_quantiles = seq(30,40), ## NULL:65, 10:55
info_content_threshold=8, ## 8:55, 5:64
effect_threshold=NULL, ## 1:39
severity_score_gpt_threshold=NULL, ## 10:78, NULL:82
symptom_intersection_threshold=.25, ## .25:57
evidence_score_threshold=3, ## 5:47, 4:47, 3:64
top_n = 10, ## 5:38, 20:42, 30:45, 40:52, 50:55
group_vars = "hpo_id")
genes_per_pheno <- prioritise_targets_out$top_targets[,list(n=data.table::uniqueN(gene_symbol)),
by="hpo_id"]
p2g <- HPOExplorer::add_ont_lvl(p2g)
min_ont_lvl <- 3
genes_per_pheno_all <- p2g[ontLvl>min_ont_lvl,
list(n=data.table::uniqueN(gene_symbol)),by="hpo_id"]
top_celltypes <- prioritise_targets_out$top_targets[,list(
np=data.table::uniqueN(hpo_id)),
by="cl_name"]|>data.table::setorderv("np",-1)
top_ancestors <- prioritise_targets_out$top_targets[,list(
np=data.table::uniqueN(hpo_id),
nc=data.table::uniqueN(CellType),
ng=data.table::uniqueN(gene_symbol)
),
by="ancestor_name"]|>
data.table::setorderv("np",-1)
```
While the phenotype-cell type associations are informative on their own, we wished to take these results further in order to have a more translational impact. Therapeutic targets with supportive genetic evidence have 2.6x higher success rates in clinical trials [@nelsonSupportHumanGenetic2015; @ochoaHumanGeneticsEvidence2022; @minikelRefiningImpactGenetic2024]. We therefore systematically identified putative cell type-specific gene targets for severe phenotypes. This yielded putative therapeutic targets for `r prioritise_targets_out$report[step=="end"][["Phenotypes"]]` phenotypes across `r prioritise_targets_out$report[step=="end"][["Diseases"]]` diseases in `r prioritise_targets_out$report[step=="end"][["Cell types"]]` cell types and `r prioritise_targets_out$report[step=="end"][["Genes"]]` genes ([Fig. @fig-therapy-filter]). While this constitutes a large number of genes in total, each phenotype was assigned a median of `r median(genes_per_pheno$n)` gene targets (mean=`r mean(genes_per_pheno$n)`, min=`r min(genes_per_pheno$n)`, max=`r max(genes_per_pheno$n)`). Relative to the number of genes annotations per phenotype in the HPO overall (median=`r median(genes_per_pheno_all$n)`, mean=`r mean(genes_per_pheno_all$n)`, min=`r min(genes_per_pheno_all$n)`, max=`r max(per_phenotype$ng)`) this represents a substantial decrease in the number of candidate target genes, even when excluding high-level phenotypes (HPO level\>`r min_ont_lvl`). It is also important to note that the phenotypes in the prioritised targets list are ranked by their severity, allowing us to distinguish between phenotypes with a high medical urgency (e.g. `r shQuote(unique(prioritise_targets_out$top_targets[order(severity_score_gpt, decreasing = TRUE)]$hpo_name)[1])`) from those with lower medical urgency (e.g. `r shQuote(unique(prioritise_targets_out$top_targets[order(severity_score_gpt)]$hpo_name)[1])`). This can be useful for both clinicians, biomedical scientists, and pharmaceutical manufacturers who wish to focus their research efforts on phenotypes with the greatest need for intervention.
Across all phenotypes, `r top_celltypes[1,]$cl_name` were most commonly implicated (`r top_celltypes[1,]$np` phenotypes), followed by `r top_celltypes[2,]$cl_name` (`r top_celltypes[2,]$np` phenotypes), `r top_celltypes[2,]$cl_name` (`r top_celltypes[2,]$np` phenotypes), `r top_celltypes[3,]$cl_name` (`r top_celltypes[3,]$np` phenotypes), `r top_celltypes[4,]$cl_name` (`r top_celltypes[4,]$np` phenotypes), and `r top_celltypes[5,]$cl_name` (`r top_celltypes[5,]$np` phenotypes). Grouped by higher-order ontology category, `r shQuote(top_ancestors[1,]$ancestor_name)` had the greatest number of enriched phenotypes (`r top_ancestors[1,]$np` phenotypes, `r top_ancestors[1,]$ng` genes), followed by `r shQuote(top_ancestors[2,]$ancestor_name)` (`r top_ancestors[2,]$np` phenotypes, `r top_ancestors[2,]$ng` genes), `r shQuote(top_ancestors[3,]$ancestor_name)` (`r top_ancestors[3,]$np` phenotypes, `r top_ancestors[3,]$ng` genes), `r shQuote(top_ancestors[4,]$ancestor_name)` (`r top_ancestors[4,]$np` phenotypes, `r top_ancestors[4,]$ng` genes), and `r shQuote(top_ancestors[5,]$ancestor_name)` (`r top_ancestors[5,]$np` phenotypes, `r top_ancestors[5,]$ng` genes).
```{r plot_report}
plot_report_out <- MSTExplorer::plot_report(
results = results,
rep_dt =prioritise_targets_out$report[(Rows_diff!=0) | step %in% c("start","end")],
label.size = .1,
add_tiers = FALSE,
show_plot = FALSE,
save_path = NULL)
```
### Therapeutic target validation
```{r ttd_check}
remove_status <- c("Application submitted","Patented","Preregistration","Registered","Investigative","Preclinical",
"Clinical trial","Phase 0",NA)
run_map_genes = FALSE
## Gene therapy only
ttd_check_out <- MSTExplorer::ttd_check(
top_targets=prioritise_targets_out$top_targets,
phenotype_to_genes=p2g,
run_map_genes = run_map_genes,
drug_types = "Gene therapy",
# remove_status=remove_status,
allow.cartesian = TRUE,
show_plot = FALSE)
## All therapy types
ttd_check_all_out <- MSTExplorer::ttd_check(
top_targets=prioritise_targets_out$top_targets,
phenotype_to_genes=p2g,
run_map_genes = run_map_genes,
# remove_status=remove_status,
allow.cartesian = TRUE,
show_plot = FALSE)
```
To determine whether the genes prioritised by our therapeutic targets pipeline were plausible, we checked what percentage of gene therapy targets we recapitulated. Data on therapeutic approval status was gathered from the Therapeutic Target Database (TTD; release `r Sys.Date()`) [@Liu2011-qd]. Overall, we prioritised `r sum(ttd_check_out$data[failed==FALSE]$prioritised)/nrow(ttd_check_out$data[failed==FALSE])*100`% of all non-failed existing gene therapy targets. A hypergeometric test confirmed that our prioritised targets were significantly enriched for non-failed gene therapy targets ($p=$`r ttd_check_out$ttd_hypergeo_out$nonfailed_enrichment`). Importantly, we did not prioritise any of the failed therapeutics (0%), defined as having been terminated or withdrawn from the market. The hypergeometric test for depletion of failed targets did not reach significance ($p=$`r ttd_check_out$ttd_hypergeo_out$failed_depletion`), but this is to be expected as there was only one failed gene therapy target in the TTD database.
Even when considering therapeutics of any kind ([Fig. @fig-therapy-validate-all]), not just gene therapies, we recapitulated `r sum(ttd_check_all_out$data[failed==FALSE]$prioritised)/nrow(ttd_check_all_out$data[failed==FALSE])*100`% of the non-failed therapeutic targets and 0% of the terminated/withdrawn therapeutic targets (n=`r nrow(ttd_check_all_out$data[failed==TRUE])`). Here we found that our prioritised targets were highly significantly depleted for failed therapeutics ($p=$`r ttd_check_all_out$ttd_hypergeo_out$failed_depletion`). This suggests that our multi-scale evidence-based prioritisation pipeline is capable of selectively identifying genes that are likely to be effective therapeutic targets.
::: {#fig-therapy-validate}
```{r fig-therapy-validate}
#| label: fig-therapy-validate
#| fig-cap: Validation of prioritised therapeutic targets. The proportion of existing gene therapy targets (documented in the Therapeutic Target Database) recapitulated by our prioritisation pipeline. Therapetics are stratified by the stage of clinical development they were at during the time of writing.
#| fig-height: 5
#| fig-width: 8
ttd_check_out$plot
```
:::
### Selected example targets
```{r prioritise_targets_grid}
res_class <- HPOExplorer::gpt_annot_class()
prioritise_targets_grid_out <- MSTExplorer::prioritise_targets_grid(
top_targets = prioritise_targets_out$top_targets,
res_class = res_class,
keep_severity_class = c("profound","severe"),
keep_physical_malformations = c(0,1),
show_plot=FALSE)
```
```{r top_targets}
severity_score_gpt_threshold <- 15
top_targets <- prioritise_targets_out$top_targets[severity_score_gpt>severity_score_gpt_threshold,]
top_phenotypes <- unique(top_targets$hpo_name)
```
```{r prioritise_targets_network-examples}
height <- "100vh"
width <- "100vw"
phenotypes_network <- c(
"GM2-ganglioside accumulation",
"Spinocerebellar atrophy",
"Neuronal loss in central nervous system")
vn_therapy <- lapply(stats::setNames(phenotypes_network, phenotypes_network), function(phenotype){
MSTExplorer::prioritise_targets_network(
top_targets = top_targets[hpo_name==phenotype],
main = NULL,
height = height,
width = width,
submain = NULL)$plot
})
```
::: {#fig-therapy-examples}
```{r fig-therapy-examples}
#| label: fig-therapy-examples
#| fig-cap: Top 40 prioritised gene therapy targets at multiple biological scales, stratified by congenital (top row) vs. non-congential phenotypes (bottom row) as well as severity class ("profound" or "severe"). In this plot, only the top 10 most severe phenotypes within a given strata/substrata are shown **a,c**, Severity annotation generated by GPT-4. **b,d**, Composite severity scores computed across all severity metrics. **e,g**, Top mediator disease and cell type-specific target for each phenotype. **f,h** top target gene for each phenotype within humans (*Homo sapiens*). We also include the 1:1 ortholog of each human gene in several commonly used animal models, including monkey (*Macaca mulatta*), mouse (*Mus musculus*), zebrafish (*Danio rerio*), fly (*Drosophila melanogaster*) and nematode (*Caenorhabditis elegans*). Boxes are empty where no 1:1 ortholog is known. **i-k** Example cell type-specific gene therapy targets for several severe phenotypes and their associated diseases. Each disease (blue cylinders) is connected to its phenotype (purple cylinders) based on well-established clinical observations recorded within the HPO [@Gargano2024-fc]. Phenotypes are connected to cell types (red circles) via association testing between weighted gene sets (FDR<0.05). Each cell type is connected to the prioritised gene targets (yellow boxes) based on the driver gene analysis.The thickness of the edges connecting the nodes represent the (mean) fold-change from the bootstrapped enrichment tests. Nodes were spatially arranged using the Sugiyama algorithm [@Sugiyama1981-ev].
#| fig-height: 19
#| fig-width: 20
if(knitr::is_html_output()){
knitr::opts_current$set(results="asis")
prioritise_targets_grid_out$plot
vn_therapy
} else {
img_paths <- stats::setNames(
file.path("img/fig-therapy-examples",c("tay-sachs.png","spinocerebellar-atrophy.png","neuronal-loss.png")),
c("GM2-ganglioside accumulation",
"Spinocerebellar atrophy",
"Neuronal loss in central nervous system")
)
fig_networks <- lapply(img_paths, function(x){
i = which(x==img_paths)
magick::image_read(x, density=300)|>magick::image_ggplot(interpolate=TRUE) +
ggplot2::theme_void() +
ggplot2::labs(subtitle=names(img_paths)[i]) +
ggplot2::theme(plot.margin = ggplot2::unit(c(0,0,0,0),"pt"),
panel.background = ggplot2::element_rect(fill = "grey50"))
}) |> patchwork::wrap_plots(nrow=1, widths=c(.5,.5,1))
fig_therapy_examples <- ( (
(patchwork::plot_spacer()|prioritise_targets_grid_out$plot) +patchwork::plot_layout(widths=c(0,1))
) /
fig_networks )+
patchwork::plot_annotation(tag_levels="a")
fig_therapy_examples
# ggplot2::ggsave(fig_therapy_examples, filename="~/Downloads/fig_therapy_examples.png", height=15, width=20, dpi=300)
}
```
:::
From our prioritised targets, we selected the following four sets of phenotypes or diseases as examples: `r paste(shQuote(phenotypes_network), collapse=", ")`. Only phenotypes with a GPT severity score greater than `r severity_score_gpt_threshold` were considered to avoid overplotting and to focus on the more clinically relevant phenotypes. These examples were then selected partly on the basis of severity rankings, and partly for their relatively smaller, simpler networks than lent themselves to compact visualisations.
Tay-Sachs disease (TSD) is a devastating hereditary condition in which children are born appearing healthy, which gradually degrades leading to death after 3-5 years. The underlying cause is the toxic accumulation of gangliosides in the nervous system due to a loss of the enzyme produced by *HEXA*. While this could in theory be corrected with gene editing technologies, there remain some outstanding challenges. One of which is identifying which cell types should be targeted to ensure the most effective treatments. Here we identified alternatively activated macrophages as the cell type most strongly associated with 'GM2-ganglioside accumulation'. The role of aberrant macrophage activity in the regulation of ganglioside levels is supported by observation that gangliosides accumulate within macrophages in TSD [@fendersonChapterDevelopmentalGenetic2009], as well as experimental evidence in rodent models [@vilcaesGangliosideSynthesisPlasma2020.; @yoheGangliosideAlterationsStimulated1985; @demirGM2GangliosideAccumulation2020]. Our results not only corroborate these findings, but propose macrophages as the primary causal cell type in TSD, making it the most promising cell type to target in therapies.
Another challenge in TSD is early detection and diagnosis, before irreversible damage has occurred. Our pipeline implicated extravillous trophoblasts of the placenta in 'GM2-ganglioside accumulation'. While not necessarily a target for gene therapy, checking these cells *in utero* for an absence of *HEXA* may serve as a viable biomarker as these cells normally express the gene at high levels. Early detection of TSD may lengthen the window of opportunity for therapeutic intervention [@solovyevaNewApproachesTaySachs2018], especially when genetic sequencing is not available or variants of unknown significance are found within *HEXA* [@hoffmanNextgenerationDNASequencing2013].
```{r sca}
sca <- results[q<0.05 & hpo_name=='Spinocerebellar atrophy']
```
Spinocerebellar atrophy is a debilitating and lethal phenotype that occurs in diseases such as Spinocerebellar ataxia and Boucher-Nenhauser syndrome. These diseases are characterised by progressive degeneration of the cerebellum and spinal cord, leading to severe motor and cognitive impairments. Our pipeline identified M2 macrophages as the only causal cell type associated with 'Spinocerebellar atrophy'. This strongly suggests that degeneration of cerebellar Purkinje cells are in fact downstream consequences of macrophage dysfunction, rather than being the primary cause themselves. This is consistent with the known role of macrophages, especially microglia, in neuroinflammation and other neurodegenerative conditions such as Alzheimer's and Parkinsons' disease [@ferroRoleMicrogliaAtaxias2019; @holMicroglialTranscriptomicsMeets2022; @lopesAtlasGeneticEffects2020a]. While experimental and postmortem observational studies have implicated microglia in spinocerebellar atrophy previously [@ferroRoleMicrogliaAtaxias2019], our results provide a statistically-supported and unbiased genetic link between known risk genes and this cell type. Therefore, targeting M2 microglia in the treatment of spinocerebellar atrophy may therefore represent a promising therapeutic strategy. This is aided by the fact that there are mouse models that perturb the ortholog of human spinocerebellar atrophy risk genes (e.g. [*Atxn1*](https://www.informatics.jax.org/marker/MGI:104783), [*Pnpla6*](https://www.informatics.jax.org/marker/MGI:1354723)) and reliably recapitulate the effects of this diseases at the cellular (e.g. loss of Purkinje cells), morphological (e.g. atrophy of the cerebellum, spinal cord, and muscles), and functional (e.g. ataxia) levels.
```{r neuronal_loss}
neuronal_loss <- results[q<0.05 & hpo_name=='Neuronal loss in central nervous system']
```
Next, we investigated the phenotype 'Neuronal loss in the central nervous system'. Despite the fact that this is a fairly broad phenotype, we found that it was only significantly associated with `r length(unique(neuronal_loss$cl_name))` cell types (`r paste(unique(neuronal_loss$cl_name), collapse=', ')`), specifically M2 macrophages and sinusoidal endothelial cells.
Skeletal dysplasia is a heterogeneous group of over 450 disorders that affect the growth and development of bone and cartilage. This phenotype can be lethal when deficient bone growth leads to the constriction of vital organs such as the lungs. Even after surgical interventions, these complications continue to arise as the child develops. Pharmacological interventions to treat this condition have largely been ineffective. While there are various cell types involved in skeletal system development, our pipeline nominated chondrocytes as the causal cell type underlying the lethal form of this condition ([Fig. @fig-therapy-examples-supp]). Assuringly, we found that the disease 'Achondrogenesis Type 1B' is caused by the genes *SLC26A2* and *COL2A1* via chondrocytes. We also found that 'Platyspondylic lethal skeletal dysplasia, Torrance type'. Thus, in cases where surgical intervention is insufficient, targeting these genes within chondrocytes may prove a viable long-term solution for children suffering from lethal skeletal dysplasia.
Alzheimer's disease (AD) is the most common neurodegenerative condition. It is characterised by a set of variably penetrant phenotypes including memory loss, cognitive decline, and cerebral proteinopathy. Interestingly, we found that different forms of early onset AD (which are defined by the presence of a specific disease gene) are each associated with different cell types via different phenotypes ([Fig. @fig-therapy-examples-supp]). For example, AD 3 and AD 4 are primarily associated with cells of the digestive system (`r paste(shQuote(unique(top_targets[disease_name=="Alzheimer disease 4"]$cl_name)), collapse=", ")`) and are implied to be responsible for the phenotypes `r paste(shQuote(unique(top_targets[disease_name=="Alzheimer disease 4"]$hpo_name)), collapse=", ")`. Meanwhile, AD 2 is primarily associated with immune cells (`r paste(shQuote(unique(top_targets[disease_name %in% c("Early-onset autosomal dominant Alzheimer disease","Alzheimer disease 2")]$cl_name)[1]), collapse=", ")`) and is implied to be responsible for the phenotypes `r paste(shQuote(unique(top_targets[disease_name %in% c("Early-onset autosomal dominant Alzheimer disease","Alzheimer disease 2")]$hpo_name)), collapse=", ")`. This suggests that different forms of AD may be driven by different cell types and phenotypes, which may help to explain its variability in onset and clinical presentation.
Finally, Parkinson's disease (PD) is characterised by motor symptoms such as tremor, rigidity, and bradykinesia. However there are a number of additional phenotypes associated with the disease that span multiple physiological systems. PD 19a and PD 8 seemed to align most closely with the canonical understanding of PD as a disease of the central nervous system in that they implicated oligodendrocytes and neurons ([Fig. @fig-therapy-examples-supp]). Though the reference datasets being used in this study were not annotated at sufficient resolution to distinguish between different subtypes of neurons, in particular dopaminergic neurons. PD 19a/8 also suggested that risk variants in *LRRK2* mediate their effects on PD through both myeloid cells and oligodendrocytes by causing gliosis of the substantia nigra. The remaining clusters of PD mechanisms revolved around chondrocytes (PD 20), amacrine cells of the eye (hereditary late-onset PD), and the respiratory/immune system (PD 14). While the diversity in cell type-specific mechanisms is somewhat surprising, it may help to explain the wide variety of cross-system phenotypes frequently observed in PD.
It should be noted that the HPO only includes gene annotations for the monogenic forms of AD and PD. However it has previously been shown that there is at least partial overlap in their phenotypic and genetic aetiology with respect to their common forms. Thus understanding the monogenic forms of these diseases may shed light onto their more common counterparts.
### Experimental model translatability
```{r map_upheno_data}
pheno_map_genes_match <- KGExplorer::map_upheno_data(pheno_map_method = "upheno")
pheno_map_targets <- pheno_map_genes_match[
id1 %in% unique(prioritise_targets_out$top_targets$hpo_id)
]|>
data.table::setorderv("phenotype_genotype_score",-1)
taxa_count <- sort(table(pheno_map_targets$gene_taxon_label2), decreasing = TRUE)
pheno_map_targets_severe <- pheno_map_targets[
id1 %in% unique(prioritise_targets_out$top_targets[severity_score_gpt>10,]$hpo_id)
]
pheno_map_targets_severe[,summary:=paste0(
shQuote(object_label1),
" ($SIM_{og}=",round(phenotype_genotype_score,options()$digits),"$)"
)]
```
We computed interspecies translatability scores using a combination of both ontological ($SIM_{o}$) and genotypic ($SIM_{g}$) similarity relative to each homologous human phenotype and its associated genes [Fig. @fig-animal-models]. In total, we mapped `r length(unique(pheno_map_genes_match$id2))` non-human phenotypes (in `r paste0("*",sort(unique(pheno_map_genes_match$gene_taxon_label2)),"*",collapse=", ")`) to `r length(unique(pheno_map_genes_match$id1))` homologous human phenotypes. Amongst the `r length(unique(prioritise_targets_out$top_targets$hpo_id))` phenotype within our prioritised therapy targets, `r length(intersect(prioritise_targets_out$top_targets$hpo_id, pheno_map_genes_match$id1))` had viable animal models in at least on non-human species. Per species, the number of homologous phenotypes was: `r paste0("*",names(taxa_count),"*"," (n=",taxa_count,")")`. Amongst our prioritised targets with a GPT-4 severity score of \>10, the phenotypes with the greatest animal model similarity were `r paste(pheno_map_targets_severe$summary[1:5],collapse = ", ")`.
### Mappings
::: {#tbl-mappings}
```{r tbl-mappings}
#| label: tbl-mappings
maps <- HPOExplorer::get_mappings(max_dist = NULL)
maps_agg <- (
data.table::rbindlist(maps, idcol = "source")
)[,list("source terms"=data.table::uniqueN(curie_id),
"HPO terms"=data.table::uniqueN(mapped_curie),
"mappings"=.N),
keyby=c("source","distance")]
maps_agg_top <- maps_agg[distance==1][get("HPO terms")==max(get("HPO terms")),]
knitr::kable(maps_agg)
```
Mappings between HPO phenotypes and other medical ontologies. "source" indicates the medical ontology and "distance" indicates the cross-ontology distance. "source terms" and "HPO terms" indicates the number of unique IDs mapped from the source ontology and HPO respectively. "mappings" is the total number of cross-ontology mappings within a given distance. Some IDs may have more than one mapping for a given source due to many-to-many relationships.
:::
Mappings from HPO phenotypes and other commonly used medical ontologies were gathered in order to facilitate use of the results in this study in both clinical and research settings. Direct mappings, with a cross-ontology distance of 1, are the most precise and reliable. Counts of mappings at each distance are shown in @tbl-mappings. In total, there were `r sum(maps_agg[distance==1][["HPO terms"]])` direct mappings between the HPO and other ontologies, with the largest number of mappings coming from the `r maps_agg_top$source[1]` ontology (`r maps_agg_top[["source terms"]]` UMLS terms).
The mappings files can be accessed with the function `HPOExplorer::get_mappings` or directly via the `HPOExplorer` Releases page on GitHub (<https://github.com/neurogenomics/HPOExplorer/releases/tag/latest>).
## Discussion {#sec-discussion}
Investigating RDs at the level of phenotypes offers numerous advantages in both research and clinical medicine. First, the vast majority of RDs only have one associated gene (`r nrow(per_disease[ng==1])`/`r nrow(per_disease)` diseases = `r format(nrow(per_disease[ng==1])/nrow(per_disease)*100,digits=1)`%). Aggregating gene sets across diseases into phenotype-centric "buckets" permits sufficiently well-powered analyses, with an average of \~`r mean(per_phenotype$ng)` genes per phenotype (median=`r median(per_phenotype$ng)`) [see Fig. @fig-diagram]. Second, we hypothesised that these phenotype-level gene sets converge on a limited number of molecular and cellular pathways. Perturbations to these pathways manifest as one or more phenotypes which, when considered together, tend to be clinically diagnosed as a certain disease. Third, RDs are often highly heterogeneous in their clinical presentation across individuals, leading to the creation of an ever increasing number of disease subtypes (some of which only have a single documented case). In contrast, a phenotype-centric approach enables us to more accurately describe a particular individual’s version of a disease without relying on the generation of additional disease subcategories. By characterising an individual’s precise phenotypes over time, we may better understand the underlying biological mechanisms that have caused their condition. However, in order to achieve a truly precision-based approach to clinical care, we must first characterise the molecular and cellular mechanisms that cause the emergence of each phenotype. Here, we provide a highly reproducible framework that enables this at the scale of the entire phenome.
Across the `r res_summ_all[["cell types"]]` cell types and `r res_summ_all[["phenotypes"]]` RD-associated phenotypes investigated, more than `r res_summ_all[["tests significant"]]` significant phenotype-cell type relationships were discovered. This presents a wealth of opportunities to trace the mechanisms of rare diseases through multiple biological scales. This in turn enhances our ability to study and treat causal factors in disease with deeper understanding and greater precision. These results recapitulate well-known relationships, while providing additional cellular context to many of these known relationships, and discovering novel relationships.
It was paramount to the success of this study to ensure our results were anchored in ground-truth benchmarks, generated falsifiable hypotheses, and rigorously guarded against false-positive associations. Extensive validation using multiple approaches demonstrated that our methodology consistently recapitulates expected phenotype-cell type associations ([Fig. @fig-summary]-[Fig. @fig-congenital]). This was made possible by the existence of comprehensive, structured ontologies for all phenotypes (the Human Phenotype Ontology) and cell types (the Cell Ontology), which provide an abundance of clear and falsifiable hypotheses for which to test our predictions against. Several key examples include 1) strong enrichment of associations between cell types and phenotypes within the same anatomical systems ([Fig. @fig-summary]b-d), 2) a strong relationship between phenotype-specificity and the strength and number of cell type associations ([Fig. @fig-ontology-lvl]), 3) identification of the precise cell subtypes involved in susceptibility to various subtypes of recurrent bacterial infections ([Fig. @fig-rni]), 4) a strong positive correlation between the frequency of congenital onset of a phenotype and the proportion of developmental cell types associated with it ([Fig. @fig-congenital])), and 5) consistent phenotype-cell type associations across multiple independent single-cell datasets ([Fig. @fig-ctd-correlation]).
Unfortunately, there are currently only treatments available for less than 5% of RDs [@Halley2022-pd]. Novel technologies including CRISPR, prime editing, antisense oligonucleotides, viral vectors, and/or lipid nanoparticles, have been undergone significant advances in the last several years [@Bueren2023-ma; @Bulaklak2020-ta; @Godbout2023-uo; @Kohn2023-vh; @Zhao2023-qy] and proven remarkable clinical success in an increasing number of clinical applications [@Darrow2019-om; @Mendell2017-kg; @Mueller2017-fz; @Russell2017-dh]. The U.S. Food and Drug Administration (FDA) recently announced an landmark program aimed towards improving the international regulatory framework to take advantage of the evolving gene/cell therapy technologies [@Lu2024-kl] with the aim of bringing dozens more therapies to patients in a substantially shorter timeframe than traditional pharmaceutical product development (typically 5-20 years with a median of 8.3 years) [@Brown2022-ye]. While these technologies have the potential to revolutionise RD medicine, their successful application is dependent on first understanding the mechanisms causing each disease.
To address this critical gap in knowledge, we used our results to create a reproducible and customisable pipeline to nominate cell type-resolved therapeutic targets ([Fig. @fig-therapy-filter]-[Fig. @fig-therapy-examples]). Targeting cell type-specific mechanisms underlying granular RD phenotypes can improve therapeutic effectiveness by treating the causal root of an individual's conditions [@Bulaklak2020-ta; @Moffat2017-al]. A cell type-specific approach also helps to reduce the number of harmful side effects caused by unintentionally delivering the therapeutic to off-target tissues/cell types (which may induce aberrant gene activity), especially when combined with technologies that can target cell surface antigens (e.g viral vectors) [@Zhou2013-wx]. This has the additional benefit of reducing the minimal effective dose of a therapeutic, which can be both immunogenic and extremely financially costly [@Bueren2023-ma; @Kohn2023-vh; @Nuijten2022-yc; @Thielen2022-ud]. Here, we demonstrate the utility of a high-throughput evidence-based approach to RD therapeutics discovery by highlighting several of the most promising therapeutic candidates. Our pipeline takes into account a myriad of factors, including the strength of the phenotype-cell type associations, symptom-cell type associations, cell type-specificity of causal genes, the severity and frequency of the phenotypes, suitability for gene therapy delivery systems (e.g. recombinant adeno-associated viral vectors (rAAV)), as well as a quantitative analysis of phenotypic and genetic animal model translatability ([Fig. @fig-animal-models]). We validated these candidates by comparing the proportional overlap with gene therapies that are presently in the market or undergoing clinical trials, in which we recovered `r sum(ttd_check_out$data[failed==FALSE]$prioritised)/nrow(ttd_check_out$data[failed==FALSE])*100`% of all active gene therapies and `r sum(ttd_check_out$data[failed==TRUE]$prioritised)/nrow(ttd_check_out$data[failed==TRUE])*100`% of failed gene therapies ([Fig. @fig-therapy-validate], [Fig. @fig-therapy-validate-all]). Despite nominating a large number of putative targets, hypergeometric tests confirmed that our targets were strongly enriched for targets of existing therapies that are either approved or currently undergoing clinical trials.
From our target prioritisation pipeline results, we highlight cell type-specific mechanisms for 'GM2-ganglioside accumulation' in Tay-Sachs disease, spinocerebellar atrophy in spinocerebellar ataxia, and 'Neuronal loss in central nervous system' in a variety of diseases ([Fig. @fig-therapy-examples]). Of interest, all three of these neurodegenerative phenotypes involved alternatively activated (M2) macrophages. The role of macrophages in neurodegeneration is complex, with both neuroprotective and neurotoxic functions, including the clearance of misfolded proteins, the regulation of the blood-brain barrier, and the modulation of the immune response [@gaoMicrogliaNeurodegenerativeDiseases2023]. We also recapitulated prior evidence that microglia, the resident macrophages of the nervous system, are causally implicated in Alzheimer's disease (AD) ([Fig. @fig-therapy-examples-supp]) [@mcquadeMicrogliaAlzheimerDisease2019]. An important contribution of our current study is that we were able to pinpoint the specific phenotypes of AD caused by macrophages to neurofibrillary tangles and long-tract signs (reflexes that indicate the functioning of spinal long fiber tracts). Other AD-associated phenotypes were caused by other cell types (e.g. gastric goblet cells, enterocytes).
It should be noted that our study has several key limitations. First, while our cell type datasets are amongst the most comprehensive human scRNA-seq references currently available, they are nevertheless missing certain tissues, cell types (e.g. spermatocytes, oocytes), and life stages (post-natal childhood, senility). It is also possible that we have not captured certain cell state signatures that only occur in disease (e.g. disease-associated microglia [@keren-shaulUniqueMicrogliaType2017; @deczkowskaDiseaseAssociatedMicrogliaUniversal2018]). Though we reasoned that using only control cell type signatures would mitigate bias towards any particular disease, and avoid degradation of gene signatures due to loss of function mutations. Second, the collective knowledge of gene-phenotype and gene-disease associations is far from complete and we fully anticipate that these annotations will continue to expand and change well into the future. It is for this reason we designed this study to be easily reproduced within a single containerised script so that we (or others) may rerun it with updated datasets at any point. Finally, causality is notoriously difficult to prove definitively from associative testing alone, and our study is not exempt from this rule. Despite this, there are several reasons to believe that our approach is able to better approximate causal relationships than traditional approaches. First, we did not intentionally preselect any subset of phenotypes or cell types to investigate here. Along with a scaling prestep during linear modelling, this means that all the results are internally consistent and can be directly compared to one another (in stark contrast to literature meta-analyses). Furthermore, for the phenotype gene signatures we used expert-curated GenCC annotations [@DiStefano2022-ao; @DiStefano2023-np] to weight the current strength of evidence supporting a causal relationship between each gene and phenotype. This is especially important for phenotypes with large genes lists (thousands of annotations) for which some of the relationships may be tenuous. Within the cell type references, we deliberately chose to use specificity scores (rather than raw gene expression) as this normalisation procedure has previously been demonstrated to better distinguish between signatures of highly similar cell types/subtypes [@Skene2016-rb].
Common ontology-controlled frameworks like the HPO open a wealth of new opportunities, especially when addressing RDs. Services such as the Matchmaker Exchange [@Osmond2022-ml; @Philippakis2015-dq] have enabled the discovery of hundreds of underlying genetic etiologies, and led to the diagnosis of many patients. This also opens the possibility of gathering cohorts of geographically dispersed patients to run clinical trials, the only viable option for treatment in many individuals. To further increase the number of individuals who qualify for these treatments, as well as the trial sample size, proposals have been made deviate from the traditional single-disease clinical trial model and instead perform basket trials on groups of RDs with shared molecular etiologies (SaME) [@Zanello2023-zd].
Moving forward, we are now actively seeking industry and academic partnerships to begin experimentally validating our multi-scale target predictions and exploring their potential for therapeutic translation. Nevertheless, there are more promising therapeutic targets here than our research group could ever hope to pursue by ourselves. In the interest of accelerating research and ensuring RD patients are able to benefit from this work as quickly as possible, we have decided to publicly release all of the results described in this study. These can be accessed in multiple ways, including through a suite of R packages as well as a web app, the Rare Disease Celltyping Portal (https://neurogenomics.github.io/rare_disease_celltyping_apps/home/). The latter allows our results to be easily queried, filtered, visualised, and downloaded without any knowledge of programming. Through these resources we aim to make our findings useful to a wide variety of RD stakeholders including subdomain experts, clinicians, advocacy groups, and patients.
## Conclusions {#sec-conclusions}
In this study we aimed to develop a methodology capable of generating high-throughput phenome-wide predictions while preserving the accuracy and clinical utility typically associated with more narrowly focused studies. With the rapid advancement of gene therapy technologies, and a regulatory landscape that is evolving to better meet the needs of a large and diverse patient population, there is finally momentum to begin to realise the promise of genomic medicine. This has especially important implications for the global RD community which has remained relatively neglected. Here, we have provided a scalable, cost-effective, and fully reproducible means of resolving the multi-scale, cell-type specific mechanisms of virtually all rare diseases.
## Methods {#sec-methods}
### Human Phenotype Ontology
```{r get_gencc}
gencc <- KGExplorer::get_gencc(agg_by = NULL)
gencc_version <- KGExplorer::get_version(gencc, return_version = TRUE)
```
The latest version of the HPO (release `r KGExplorer::get_version(hpo, return_version = TRUE)`) was downloaded from the EMBL-EBI Ontology Lookup Service [@Cote2010-gp] and imported into R using the `HPOExplorer` package. This R object was used to extract ontological relationships between phenotypes as well as to assign absolute and relative ontological levels to each phenotype. The latest version of the HPO phenotype-to-gene mappings and phenotype annotations were downloaded from the official HPO GitHub repository and imported into R using `HPOExplorer`. This contains lists of genes associated with phenotypes via particular diseases, formatted as three columns in a table (gene, phenotype, disease).
However, not all genes have equally strong evidence of causality with a disease or phenotype, especially when considering that the variety of resources used to generate these annotations (OMIM, Orphanet, DECIPHER) use variable methodologies (e.g. expert-curated review of the medical literature vs. automated text mining of the literature). Therefore we imported data from the Gene Curation Coalition (GenCC) [@DiStefano2022-ao; @DiStefano2023-np], which (as of `r gencc_version`) `r nrow(gencc)` evidence scores across `r data.table::uniqueN(gencc$disease_curie)` diseases and `r data.table::uniqueN(gencc$gene_curie)` genes. Evidence scores are defined by GenCC using a standardised ordinal rubric which we then encoded as a semi-quantitative score ranging from 0 (no evidence of disease-gene relationship) to 6 (strongest evidence of disease-gene relationship) (see @tbl-gencc). As each Disease-Gene pair can have multiple entries (from different studies) with different levels of evidence, we then summed evidence scores per Disease-Gene pair to generate aggregated Disease-by-Gene evidence scores. This procedure can be described as follows.
Let us denote:
- $D$ as diseases.
- $P$ as phenotypes in the HPO.
- $G$ as genes
- $S$ as the evidence scores describing the strength of the relationship between each Disease-Gene pair.
- $M_{ij}$ as the aggregated Disease-by-Gene evidence score matrix.
$$
M_{ij} = \sum_{k=1}^{\text{f}} D_i G_j S_k
$$
Next, we extracted Disease-Gene-Phenotype relationships from the annotations file distributed by the HPO (*phenotype_to_genes.txt*). This provides a list of genes associated with phenotypes via particular diseases, but does not include any strength of evidence scores.
Here we define: - $A_{ijk}$ as the Disease-Gene-Phenotype relationships. - $D_i$ as the $i$th disease. - $G_j$ as the $j$th gene. - $P_k$ as the $k$th phenotype.
$$
A_{ijk} = D_i G_j P_k
$$
In order to assign evidence scores to each Phenotype-Gene relationship, we combined the aforementioned datasets from GenCC ($M_{ij}$) and HPO ($A_{ijk}$) by merging on the gene and disease ID columns. For each phenotype, we then computed the mean of Disease-Gene scores across all diseases for which that phenotype is a symptom. This resulted in a final 2D tensor of Phenotype-by-Gene evidence scores ($L_{ij}$):
::: {#eq-evidence-scores .content-hidden unless-format="html"}
![](equations/eq1.png){height="300px"}
\
Construction of the tensor of Phenotype-by-Gene evidence scores.
:::
\
\
::: {.content-visible unless-format="html"}
```{=tex}
$$
\eqnmarkbox[NavyBlue]{n1}{ L_{ij} } =
\begin{cases}
\frac{\sum_{k=1}^{\text{f}}
\eqnmarkbox[Cerulean]{n3a}{D_i G_j}
\eqnmarkbox[BlueViolet]{n5}{P_k}
}{\text{f}},
\text{if } \eqnmarkbox[Cerulean]{n3b}{D_i G_j}
\in \eqnmarkbox[blue]{n4a}{A},
\\
1, \hspace{2.2cm}
\text{if } \eqnmarkbox[Cerulean]{n3c}{D_i G_j}
\notin\eqnmarkbox[blue]{n4b}{A}
\end{cases}
$$
\annotate[yshift=1em]{left}{n1}{Tensor of Phenotype-by-Gene \\evidence scores}
\annotate[yshift=2.5em]{right}{n3a,n3b,n3c}{Tensor of Disease-by-Gene \\evidence scores}
\annotate[yshift=-2.5em]{below,left}{n4a,n4b}{Disease-by-Gene-by-Phenotype \\relationships}
\annotate[yshift=1em, xshift=-1em]{right}{n5}{Phenotype}
```
\
Construction of the tensor of Phenotype-by-Gene evidence scores.
:::
\
Histograms of evidence score distributions at each step in processing can be found in [Fig. @fig-evidence-histograms].
### Single-cell transcriptomic atlases
In this study, the gene by cell type specificity matrix was constructed using the Descartes Human transcriptome atlas of foetal gene expression, which contains a mixture of single-nucleus and single-cell RNA-seq data (collected with sci-RNA-seq3) [@Cao2020-qz]. This dataset contains 377,456 cells representing 77 distinct cell types across 15 tissues. All 121 human foetal samples ranged from 72 to 129 days in estimated postconceptual age. To independently replicate our findings, we also used the Human Cell Landscape which contains single-cell transcriptomic data (collected with microwell-seq) from embryonic, foetal, and adult human samples across 49 tissues [@Han2020-iq].
Specificity matrices were generated separately for each transcriptomic atlas using the R package `EWCE` (v1.11.3) [@Skene2016-rb]. Within each atlas, cell types were defined using the authors’ original freeform annotations in order to preserve the granularity of cell subtypes as well as incorporate expert-identified rare cell types. Cell types were only aligned and aggregated to the level of corresponding Cell Ontology (CL) [@Diehl2016-gt] annotations afterwards when generating summary figures and performing cross-atlas analyses. Using the original gene-by-cell count matrices from each single-cell atlas, we computed gene-by-cell type expression specificity matrices as follows. Genes with very no expression across any cell types were considered to be uninformative and were therefore removed from the input gene-by-cell matrix $F(g,i,c)$.
Next, we calculated the mean expression per cell type and normalised the resulting matrix to transform it into a gene-by-cell type expression specificity matrix ($S_{g,c}$). In other words, each gene in each cell type had a 0-1 score where 1 indicated the gene was mostly specifically expressed in that particular cell type relative to all other cell types. This procedure was repeated separately for each of the single-cell atlases and can be summarised as:
::: {#eq-ctd-specificity .content-hidden unless-format="html"}
![](equations/eq3.png){height="300px"}
Construction of the gene-by-cell type specificity matrix.
:::
\
::: {.content-visible unless-format="html"}
```{=tex}
\begin{equation*}
\eqnmarkbox[orange]{s1}{S_{gc}}
=
\frac{
\eqnmarkbox[purple]{s3a}{
\frac{
\sum_{i=1}^{|L|} F_{gic}
}{