-
Notifications
You must be signed in to change notification settings - Fork 18
/
Introduction.Rmd
1317 lines (1001 loc) · 73.4 KB
/
Introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to the myTAI package"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
fig_caption: yes
vignette: >
%\VignetteIndexEntry{Introduction to the myTAI package}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
---
```{r, echo = FALSE, message = FALSE}
library(myTAI)
options(width = 750)
knitr::opts_chunk$set(
comment = "#>",
error = FALSE,
tidy = FALSE)
```
# Table of Contents
1. [Scientific Introduction](#scientific-introduction-performing-evolutionary-transcriptomics-with-r)
2. [Installation](#installation)
3. [Motivation](#motivation)
4. [Retrieval of phylogenetic or taxonomic information](#retrieval-of-phylogenetic-or-taxonomic-information)
5. [Defining input data standards](#defining-input-data-standards)
6. [Performing a Standard Workflow for Evolutionary Transcriptomics Analyses](#performing-a-standard-workflow-for-evolutionary-transcriptomics-analyses)
## Scientific Introduction: Performing Evolutionary Transcriptomics with R
Quantifying `transcriptome conservation` or `transcriptome diversity` patterns within a biological process of interest allows researchers to extend a classic transcriptome study by addressing the question __why__ a particular developmental stage or environmental response timing is organized in the observed way. This is done by retrieving evolutionary age estimates of protein coding genes from a variety of gene age inference methods (see below) and adding this gene age information to the classic gene expression table. Dedicated transcriptome age quantification procedures implemented in the `myTAI` package then allow to capture the average transcriptome age for each developmental stage or environmental response stage at hand. Finally, a rigorous statistical assessment of observed transcriptome age patterns allows researchers to gain greater confidence in the robustness of observed patterns, ultimately facilitating the selection of promising candidate genes for further molecular functional studies.
## Installation
### Package dependencies
If users are interested in performing differential gene expression analyses with `myTAI`, they may install the `edgeR` package.
Users can download `myTAI` from [CRAN](https://cran.r-project.org/package=myTAI) :
```r
# install myTAI from CRAN
install.packages("myTAI", dependencies = TRUE)
```
## Motivation
Using embryo development of the plant _Arabidopsis thaliana_ as an example, we ask the user to imagine how one would investigate the differences of developmental transcriptomes across developmental stages.
![](Figures/Fig1.png)
<p style="font-size:10px;"> Figure 1: Gene expression distributions (= developmental transcriptome) throughout seven stages of _A. thaliana_ embryo development. Embryo development is divided into three phases: early embryogenesis (purple), mid-embryogenesis (green), and late embryogenesis (brown). This boxplot illustrates that the overall distributions of log2 expression levels (y-axis) hardly differ between developmental stages (x-axis) although the difference on the global scale is statistically significant (Kruskal-Wallis Rank Sum Test: p < 2e-16). Hence, a clear visual pattern of gene expression differences between early, mid, and late embryogenesis on the global scale can not be inferred. Adapted from [Drost, 2016](https://digital.bibliothek.uni-halle.de/urn/urn:nbn:de:gbv:3:4-18221). </p>
The objective of performing evolutionary transcriptomics studies is to classify a transcriptome into different categories of genes sharing similar evolutionary origins (detectable homologs) or genes that share similar phylogenetic relationships (orthologous genes) and to study the overall expression patterns of these classified genes throughout the biological process of interest. Thus, by introducing a phylogenetic or taxonomic variable to a transcriptome dataset, we can determine stages or time points that are under stronger constraints than others, indicating switches between biological programs or functions.
![](Figures/Fig2.png)
<p style="font-size:10px;"> Figure 2: Gene expression distributions (= developmental transcriptome) throughout seven stages of _A. thaliana_ embryo development classified into distinctive age categories. Each box represents the developmental stage during _A. thaliana_ embryogenesis, the y-axis denotes the log2 expression levels of genes that fall into the corresponding age category shown on the x-axis. Hence, each boxplot represents the gene expression distribution of genes that are classified into the corresponding age class during a specific developmental stage. The gene age distribution of _A. thaliana_ genes range from PS1 to PS12 where PS1 represents the evolutionarily most distant age category (cellular org.) and PS12 the evolutionary most recent age category (_A. thaliana_ specific). Yellow dots in the boxplots denote the mean expression level of the corresponding expression distribution. This visualization illustrates that although the global gene expression distributions do not change visually between developmental stages (Fig. 1), the global gene expression distributions of age categories differ between stages of _A. thaliana_ embryo development, and thus, allow studying the effect of transcriptome evolution and conservation on embryo development. Adapted from [Drost, 2016](https://digital.bibliothek.uni-halle.de/urn/urn:nbn:de:gbv:3:4-18221). </p>
Conceptually, the idea behind evolutionary transcriptomics studies is to combine the phylogenetic relationship between species (usually retrieved from comparative genomics studies in terms of gene homology confirmed by sequence identity) with transcriptome data of a reference species quantifying a particular biological process of interest (e.g. mutant gene expression versus WT gene expression, stress responses, cell differentiation, development, etc.). Usually, transcriptome data comes from `Next Generation Sequencing` technologies such as RNA-Seq or from Microarray experiments.
<p style="text-align: center; font-size:20px;"> Phylogenetic Information + Transcriptome Data </p>
Or in other words:
<p style="text-align: center; font-size:20px;"> Comparative Genomics + Transcriptomics = Evolutionary Transcriptomics </p>
In theory, any published or newly generated transcriptome dataset can be used to capture evolutionary signatures with `myTAI`.
`myTAI` is designed to receive phylogenetic information obtained from comparative genomics data and transcriptome data as input and internally combines these datasets to perform evolutionary transcriptomics analyses.
![](Figures/Fig3.png)
<p style="font-size:10px;"> Figure 3: Workflow describing the input and output of the myTAI package. The myTAI package takes phylogenetic information such as phylogenetic trees (see [Dunn, 2013](https://www.academia.edu/27635691/The_Comparative_Biology_of_Gene_Expression) ), genomic phylostratigraphy based gene age inference (see [Domazet-Loso et al., 2007](https://www.sciencedirect.com/science/article/pii/S0168952507002995); [Capra et al., 2013](https://www.sciencedirect.com/science/article/pii/S016895251300111X); [Liebeskind et al., 2016](https://academic.oup.com/gbe/article/8/6/1812/2574026) ), by dNdS estimation of orthologous genes (see [Quint, Drost et al., 2012](https://www.nature.com/articles/nature11394) and [Drost et al., 2015](https://academic.oup.com/mbe/article/32/5/1221/1125964)), or phylogenetic reconciliation (see [Doyon et al, 2011](https://pubmed.ncbi.nlm.nih.gov/21949266/) ) and a RNA-Seq or Microarray based transcriptome dataset as input. Internally, myTAI then combines the phylogenetic data and the transcriptome data an provides numerous functions to perform evolutionary transcriptomics analyses. Here, we exemplify the output of the functions `PlotSignature()`, `PlotRE()` and `PlotCategoryExpr()`. </p>
## Retrieval of phylogenetic or taxonomic information
For the comparative genomics part there are different methods and tools to quantify sequence homology between genes, miRNAs, lncRNAs etc of a reference species and related species. For example, for phylogenetic or taxonomic information retrieval such as phylogenetic trees, genomic phylostratigraphy based gene age inference, dNdS estimation of orthologous genes or phylogenetic reconciliation can be used. Below users can find the most recent tools and resources for retrieving or computing phylogenetic or taxonomic relationships for an organism of interest.
__Generate or retrieve phylostratigraphic maps:__
- [GenEra](https://github.com/josuebarrera/GenEra): a fast, easy-to-use and highly customizable command-line tool that estimates _gene-family founder events_. Please consult the paper for more information on _homology detection failure_ and other methodological considerations compared to previous approaches ([Barrera-Redondo et al., 2023](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02895-z)). Users can also use GenEra for generating phylostratigraphic maps based on _protein structure_ similarity, using [Foldseek](https://github.com/steineggerlab/foldseek). See [here](https://github.com/josuebarrera/GenEra/wiki/Downstream-analyses) for processing the output for `myTAI`.
- [orthomap](https://github.com/kullrich/orthomap): a python package to extract _orthologous_ maps (in other words the evolutionary age of a given orthologous group) from [OrthoFinder](https://github.com/davidemms/OrthoFinder)/[eggNOG](http://eggnog5.embl.de/#/app/home)/[PLAZA](https://bioinformatics.psb.ugent.be/plaza/) results. See also how one can [use some myTAI function within orthomap in python](https://orthomap.readthedocs.io/en/latest/tutorials/mytai.html).
- [GenOrigin](http://chenzxlab.hzau.edu.cn/GenOrigin/#!/): GenOrigin inferred gene age information of 9,102,113 genes from 565 species
- [R package: phylostratr](https://github.com/arendsee/phylostratr) which implements `Phylostratigraphy` as an `R package` - Please consult the [Vignettes](https://github.com/arendsee/phylostratr/tree/master/vignettes) for examples.
- [R package: fagin](https://github.com/arendsee/fagin): a synteny-based phylostratigraphy and finer classification of young genes (see also the [corresponding manuscript](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3023-y) and [example applications](https://github.com/arendsee/fagin-case-studies))
- [createPSmap.pl](https://github.com/AlexGa/Phylostratigraphy): generate a phylostratigraphic map (implemented by Alexander Gabel)
- [phylostratigraphy.pl](https://figshare.com/articles/dataset/A_developmental_hourglass_in_fungi/1284022): generate a phylostratigraphic (implemented by Cheng et al. 2015)
- [phylo_pipeline.py](https://mynotebook.labarchives.com/share/Shuqing_Xu/NTIuMHw2NDQyMi80MC9UcmVlTm9kZS8zNDEyODM1NjE2fDEzMi4w): a Python script to generate a phylostratigraphic map (implemented by Shuqing Xu)
- [Genomic-phylostratigraphy-tool](https://github.com/longjunwu/Genomic-phylostratigraphy-tool): a Python script to generate a phylostratigraphic map (implemented by Longjun Wu)
- [ORFanFinder](https://bcb.unl.edu/orfanfinder/): generate a phylostratigraphic map
- `Protein Historian`: generate a gene age map
- download pre-computed and published [phylostratigraphic maps](https://github.com/drostlab/published_phylomaps)
- [phylomapR](https://github.com/LotharukpongJS/phylomapr): quick retrieval of precomputed gene age maps (phylomaps) in R for easy integration with `myTAI`.
- [Liebeskind et al., 2016](https://github.com/marcottelab/Gene-Ages): use a gene age consensus approach to estimating gene ages for model organisms
- [orthoscape](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1427-5): a cytoscape application for grouping and visualization KEGG based gene networks by taxonomy and homology principles (implemented by Mustafin et al., 2017)
- [RecBlast](http://biorxiv.org/content/early/2017/03/02/112946): Cloud-Based Large Scale Orthology Detection (by Efrat Rapoport and Moran Neuhof)
- [protTrace](https://github.com/BIONF/protTrace): A simulation based framework to estimate the evolutionary traceability of protein
### dNdS estimation of orthologous genes
We recently proposed to use the classical dNdS measure to quantify the sequence conservation of protein coding genes between closely related species. This way, we combine the information about the selective pressure acting on a particular gene with its expression level during a particular time point or condition. We refer to this approach as Divergence Stratigraphy ([Drost et al., 2015 _Mol. Biol. Evol._](https://academic.oup.com/mbe/article/32/5/1221/1125964)). Analogous to gene age inference methods,
divergence stratigraphy generates a table storing the sequence conservation estimate in the first column and the corresponding gene id of the organism of interest in the second column. This table is named _divergence stratigraphic map_.
__Generate or retrieve divergence stratigraphic maps:__
- [orthologr](https://github.com/drostlab/orthologr): generate a [divergence stratigraphic map](https://github.com/drostlab/orthologr/blob/master/vignettes/divergence_stratigraphy.Rmd) (implemented by Hajk-Georg Drost)
- [compute_dNdS.pl](https://figshare.com/articles/dataset/A_developmental_hourglass_in_fungi/1284022): generate a divergence stratigraphic map (implemented by Cheng et al. 2015)
- [MetaPhOrs](https://orthology.phylomedb.org/): retrieve phylogeny-based orthology and paralogy predictions
- download pre-computed and published [divergence stratigraphic maps](https://github.com/drostlab/published_phylomaps)
- [orthoscape](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1427-5): a cytoscape application for grouping and visualization KEGG based gene networks by taxonomy and homology principles (implemented by Mustafin et al., 2017)
- [RecBlast](http://biorxiv.org/content/early/2017/03/02/112946): Cloud-Based Large Scale Orthology Detection (by Efrat Rapoport and Moran Neuhof)
### Generate custom table
In general, users can construct their own gene age assignment methods and are not limited to the methods listed above.
After formatting the corresponding results to the _age map_
specification (age assignment in the first column and gene id in the second column), users can use
any function in myTAI with their custom gene age assignment table.
## Getting Started with `myTAI`
`myTAI` takes an _age map_ and an expression dataset as input and combines both tables to the quantify transcriptome conservation for the biological process of interest.
### Defining input data standards
The following code illustrates an example structure of an `age map`. Here we choose genomic phylostratigraphy and dNdS estimation as method to generate a _phylostratigraphic map_ and _divergence stratigraphic map_:
```{r,eval=FALSE}
# load myTAI
library(myTAI)
# load example data sets (stored in myTAI)
data(PhyloExpressionSetExample)
data(DivergenceExpressionSetExample)
# show an example phylostratigraphic map of Arabidopsis thaliana
head(PhyloExpressionSetExample[ , c("Phylostratum","GeneID")])
```
```
Phylostratum GeneID
1 1 at1g01040.2
2 1 at1g01050.1
3 1 at1g01070.1
4 1 at1g01080.2
5 1 at1g01090.1
6 1 at1g01120.1
```
In detail, a _phylostratigraphic map_ stores the gene age assignment generated with e.g. [phylostratigraphy](https://github.com/AlexGa/Phylostratigraphy) in the
first columns and the corresponding gene id in the second column.
Analogously, a divergence stratigraphic map stores the gene age assignment generated with e.g. [divergence stratigraphy](https://github.com/drostlab/orthologr/blob/master/vignettes/divergence_stratigraphy.Rmd) in the
first column and the corresponding gene id in the second column:
```{r,eval=FALSE}
# show an example structure of a Divergence Map
head(DivergenceExpressionSetExample[ , c("Divergence.stratum","GeneID")])
```
```
Divergence.stratum GeneID
1 1 at1g01050.1
2 1 at1g01120.1
3 1 at1g01140.3
4 1 at1g01170.1
5 1 at1g01230.1
6 1 at1g01540.2
```
Hence, `myTAI` relies on pre-computed _age maps_ fulfilling the aforementioned standard for all downstream analyses. It does not matter whether or not age maps contain categorized age values like in `phylostratigraphic maps` or e.g. phylogenetic distance values generated by phylogenetic inference.
### Expression dataset specification
The aim of any evolutionary transcriptomics study is to quantify transcriptome conservation in biological processes.
For this purpose, users need to provide the transcriptome dataset of their studied biological process.
In the following examples we will use a `gene expression dataset` covering seven stages of _Arabidopsis thaliana_ embryo development. This data format is defined as `ExpressionMatrix` in the `myTAI` data format specification.
```
# gene expression set
GeneID Zygote Quadrant Globular Heart Torpedo Bent Mature
1 at1g01040.2 2173.6352 1911.2001 1152.5553 1291.4224 1000.2529 962.9772 1696.4274
2 at1g01050.1 1501.0141 1817.3086 1665.3089 1564.7612 1496.3207 1114.6435 1071.6555
3 at1g01070.1 1212.7927 1233.0023 939.2000 929.6195 864.2180 877.2060 894.8189
4 at1g01080.2 1016.9203 936.3837 1181.3381 1329.4734 1392.6429 1287.9746 861.2605
5 at1g01090.1 11424.5667 16778.1685 34366.6493 39775.6405 56231.5689 66980.3673 7772.5617
6 at1g01120.1 844.0414 787.5929 859.6267 931.6180 942.8453 870.2625 792.7542
```
The function `MatchMap()` allows users to join a _phylostratigraphic map_ with an _ExpressionMatrix_ to obtain a joined table referred to as _PhyloExpressionSet_. In some cases, the GeneIDs stored in the `ExpressionMatrix` and in the _phylostratigraphic map_ do not match. This is due to GeneID mappings between different databases and annotations. To map non matching GeneIDs between databases and annotations, please consult the [Functional Annotation Vignette](https://github.com/ropensci/biomartr/blob/master/vignettes/Functional_Annotation.Rmd) in the [biomartr](https://github.com/ropensci/biomartr) package. The `biomartr` package allows users to map GeneIDs between database annotations.
After matching a _phylostratigraphic map_ with an _ExpressionMatrix_ using the `MatchMap()` function, a standard _PhyloExpressionSet_ is returned storing the phylostratum assignment of a given gene in the first column, the _gene id_ of the corresponding gene in the second column, and the entire gene expression set (time series or treatments) starting with the third column. This format is crucial for all functions that are implemented in the `myTAI` package.
```{r,eval=FALSE}
library(myTAI)
# load the example data set
data(PhyloExpressionSetExample)
# construct an example Phylostratigraphic Map
Example.PhylostratigraphicMap <- PhyloExpressionSetExample[ , 1:2]
# construct an example ExpressionMatrix
Example.ExpressionMatrix <- PhyloExpressionSetExample[ , 2:9]
# join a PhylostratigraphicMap with an ExpressionMatrix using MatchMap()
Example.PhyloExpressionSet <- MatchMap(Example.PhylostratigraphicMap, Example.ExpressionMatrix)
# look at a standard PhyloExpressionSet
head(Example.PhyloExpressionSet, 3)
```
```
Phylostratum GeneID Zygote Quadrant Globular Heart Torpedo Bent Mature
1 4 at1g01010.1 878.2158 828.2301 776.0703 753.9589 775.3377 756.2460 999.9118
2 2 at1g01020.1 1004.9710 1106.2621 1037.5141 939.0830 961.5249 871.4684 997.5953
3 3 at1g01030.1 819.4880 771.6396 810.8717 866.7780 773.7893 747.9941 785.6105
```
Analogous to a standard _PhyloExpressionSet_, a standard _DivergenceExpressionSet_ is a `data.frame` storing the divergence stratum assignment of a given gene in the first column, the gene id of the corresponding gene in the second column, and the entire gene expression set (time series or treatments) starting with the third column.
The following `DivergenceExpressionSet` example illustrates the standard `DivergenceExpressionSet` data set format.
```{r,eval=FALSE}
# head of an example standard DivergenceExpressionSet
head(DivergenceExpressionSetExample, 3)
```
```
Divergence.stratum GeneID Zygote Quadrant Globular Heart Torpedo Bent Mature
1 1 at1g01050.1 1501.0141 1817.3086 1665.3089 1564.761 1496.3207 1114.6435 1071.6555
2 1 at1g01120.1 844.0414 787.5929 859.6267 931.618 942.8453 870.2625 792.7542
3 1 at1g01140.3 1041.4291 908.3929 1068.8832 967.749 1055.1901 1109.4662 825.4633
```
A _DivergenceExpressionSet_ defines the joined table between a _divergence stratigraphic map_ and a _Expression Set_. A _DivergenceExpressionSet_ can be generated analogous to a _PhyloExpressionSet_ by joining a _divergence stratigraphic map_ with an _ExpressionMatrix_ using the `MatchMap()` function. In some cases, the GeneIDs stored in the _ExpressionMatrix_ and in the _divergence stratigraphic map_ do not match. This is due to GeneID mappings between different databases and annotations. To map non matching GeneIDs between databases and annotations, please consult the [Functional Annotation Vignette](https://github.com/ropensci/biomartr/blob/master/vignettes/Functional_Annotation.Rmd) in the [biomartr](https://github.com/ropensci/biomartr) package.
Each function implemented in `myTAI` checks internally whether or not the _PhyloExpressionSet_ or _DivergenceExpressionSet_ standard is fulfilled.
```{r,eval=FALSE}
# used by all myTAI functions to check the validity of the PhyloExpressionSet standard
is.ExpressionSet(PhyloExpressionSetExample)
```
```
[1] TRUE
```
In case the PhyloExpressionSet standard is violated, the `is.ExpressionSet()` function will return `FALSE` and the corresponding function within the `myTAI` package will return an error message.
```{r,eval=FALSE}
# used a non standard PhyloExpressionSet
head(PhyloExpressionSetExample[ , 2:5], 2)
```
```
GeneID Zygote Quadrant Globular
1 at1g01040.2 2173.635 1911.200 1152.555
2 at1g01050.1 1501.014 1817.309 1665.309
```
```{r, error = TRUE,eval=FALSE}
is.ExpressionSet(PhyloExpressionSetExample[ , 2:5])
```
```
Error in is.ExpressionSet(PhyloExpressionSetExample[, 2:5]) :
The present input object does not fulfill the ExpressionSet standard.
```
__It might be that you work with a `tibble` object which will not be recognized by `is.ExpressionSet`. In that case, please convert your `tibble` object
to a `data.frame` using the function `as.data.frame()`.__
```{r, eval = FALSE}
# convert any tibble to a data.frame
PhyloExpressionSetExample <- as.data.frame(PhyloExpressionSetExample)
# now is.ExpressionSet() should return TRUE
is.ExpressionSet(PhyloExpressionSetExample)
```
__The PhyloExpressionSet and DivergenceExpressionSet formats are crucial for all functions that are implemented in the `myTAI` package__.
Keeping these standard data formats in mind will provide users with the most important requirements to get started with the `myTAI` package.
__Note__, that within the code of each function, the argument `ExpressionSet` always refers to either a PhyloExpressionSet or a DivergenceExpressionSet, whereas in specialized functions some arguments are specified as _PhyloExpressionSet_ when they take an PhyloExpressionSet as input data set, or specified as _DivergenceExpressionSet_ when they take an _DivergenceExpressionSet_ as input data set.
## Performing a Standard Workflow for Evolutionary Transcriptomics Analyses
The main goal of any evolutionary transcriptomics study is to quantify transcriptome conservation at a particular stage or treatment. This is achieved by computing the average age of genes that contribute to the transcriptome at that stage or treatment. In other words, by multiplying the gene age value with the expression level of the corresponding gene and averaging over all genes, we obtain the mean age of the transcriptome. Hence, we can say that at a particular stage `genes that are most expressed at this stage or treatment have (on average) the evolutionary age XY`.
To obtain this mean age value, several measures were introduced:
#### Transcriptome Age Index
The first measure named _Transcriptome Age Index_ (TAI) was introduced by Domazet-Loso and Tautz, 2010 and represents a weighted arithmetic mean of the transcriptome age during a corresponding developmental stage _s_.
$TAI_s = \sum_{i = 1}^n \frac{ps_i * e_{is}}{\sum_{i = 1}^n e_{is}}$
where $ps_i$ denotes the phylostratum assignment of gene $i$ and $e_{is}$ denotes
the gene expression level of gene $i$ at developmental time point $s$. A lower value of TAI describes an older transcriptome age, whereas a higher value of TAI denotes a younger transcriptome age.
The following figure shows the TAI computations for the seven stages of _A. thaliana_ embryo development.
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
data(PhyloExpressionSetExample)
# Plot the Transcriptome Age Index of a given PhyloExpressionSet
# Test Statistic : Flat Line Test (default)
PlotSignature( ExpressionSet = PhyloExpressionSetExample,
measure = "TAI",
TestStatistic = "FlatLineTest",
xlab = "Ontogeny",
ylab = "TAI" )
```
The x-axis shows the seven stages of _A. thaliana_ embryo development and the y-axis shows the corresponding mean transcriptome age (TAI) value. The lower the TAI value the older the mean transcriptome age and the higher the TAI value the younger the mean transcriptome age.
The interpretation of the TAI values on the y-axis is given by the next figure.
![](Figures/Fig4.png)
In this example, a TAI value of 3.5 quantifies that genes that contribute most the transcriptome at a particular stage emerged on average between phylostratum 3 and phylostratum 4. Due to the nature of the arithmetic mean, this value does not represent the true origin of individual genes, and thus the TAI measure is only helpful to screen for stages that express (on average) older or younger genes. Subsequent analyses such as mean expression of age categories, relative expression levels, and gene expression level distributions for each age category will then reveal which exact genes or age categories generate the overall TAI value.
To obtain a more detailed overview of which age categories contribute how much to
each developmental stage, the gene expression level distributions for each age category and each developmental stage can be visualized (using the `PlotCategoryExpr()` function).
```{r, fig.width= 9, fig.height= 5,eval=FALSE}
data(PhyloExpressionSetExample)
# category-centered visualization of PS
# specific expression level distributions (log-scale)
PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample,
legendName = "PS",
test.stat = TRUE,
type = "category-centered",
distr.type = "boxplot",
log.expr = TRUE)
```
This figure shows that in all developmental stages, genes coming from PS1-3 are (on average) more expressed than genes coming from PS4-12. Interestingly, the gene expression level distributions of PS4-12 become more equally distributed towards the Torpedo stage which has been marked as the most conserved stage by TAI analysis. This general trend can be visualized using the `PlotMeans()` function.
```{r, fig.width= 9, fig.height= 5,eval=FALSE}
data(PhyloExpressionSetExample)
# plot evolutionary old PS (PS1-3) vs
# evolutionary young PS (PS4-12)
PlotMeans(PhyloExpressionSetExample,
Groups = list(c(1:3), c(4:12)),
legendName = "PS",
adjust.range = TRUE)
```
Here, users will observe that indeed PS1-3 genes are (on average) higher expressed than PS4-12 genes.
Using a linear transformation of the mean expression levels into the interval $[0,1]$ (Quint et al., 2012 and Drost et al., 2015) we can compare mean expression patterns between Phylostrata independent from their actual mean expression magnitude. A relative expression level of 0 denotes the minimum mean expression level compared to all other stages and a relative expression level of 1 denotes the maximum mean expression level compared to all other stages.
The following figure illustrates the average gene expression profile for each phylostratum.
```{r, fig.width= 9, fig.height= 5,eval=FALSE}
data(PhyloExpressionSetExample)
# plot evolutionary old PS (PS1-3) vs
# evolutionary young PS (PS4-12)
PlotRE(PhyloExpressionSetExample,
Groups = list(c(1:3), c(4:12)),
legendName = "PS",
adjust.range = TRUE)
```
Users will observe, that PS4-12 genes are down-regulated towards the Torpedo stage (marked as most conserved by TAI analysis) and up-regulated after the Torpedo stage.
To cluster gene expression levels of PS4-12 genes we categorize _A. thaliana_ embryogenesis into three developmental modules (early, mid, and late) and cluster young genes (PS4-12) according to their fold-change pattern: High-Low-Low, High-Low-High, and Low-Low-High.
```{r,eval=FALSE,echo=FALSE,fig.width= 7, fig.height= 15}
# select PS4-12 genes
PhyloExpressionSetExample.PS4_12 <-
dplyr::filter(PhyloExpressionSetExample, Phylostratum %in% c(4:12))
# categorize A. thaliana embryogenesis into three
# developmental modules (early, mid, and late)
Ath.Embryogenesis.DiffGenes <-
DiffGenes(
ExpressionSet = PhyloExpressionSetExample.PS4_12,
nrep = c(2, 3, 2),
stage.names = c("Early", "Mid", "Late")
)
# cluster young genes (PS4-12) according to their fold-change pattern: High-Low-Low
Ath.Embryo.High_Low_Low <-
PhyloExpressionSetExample.PS4_12[which((Ath.Embryogenesis.DiffGenes[, "Early->Mid"] > 3) &
(dplyr::between(Ath.Embryogenesis.DiffGenes[, "Mid->Late"], 1, 2))),]
# cluster young genes (PS4-12) according to their fold-change pattern: High-Low-High
Ath.Embryo.High_Low_High <-
PhyloExpressionSetExample.PS4_12[which((Ath.Embryogenesis.DiffGenes[, "Early->Mid"] > 3) &
(Ath.Embryogenesis.DiffGenes[, "Mid->Late"] < 0.2)),]
# cluster young genes (PS4-12) according to their fold-change pattern: Low-Low-High
Ath.Embryo.Low_Low_High <-
PhyloExpressionSetExample.PS4_12[which((dplyr::between(Ath.Embryogenesis.DiffGenes[, "Early->Mid"], 1, 2)) &
(Ath.Embryogenesis.DiffGenes[, "Mid->Late"] < 0.2)),]
par(mfrow = c(3, 1))
matplot(
t(Ath.Embryo.High_Low_Low[, 3:9]),
type = "l",
lty = 1,
lwd = 2,
col = "lightblue",
xlab = "Ontogeny",
ylab = "Expression Level",
xaxt = "n",
cex.lab = 1.5,
cex.axis = 1.5,
main = "High-Low-Low"
)
lines(
colMeans(Ath.Embryo.High_Low_Low[, 3:9]),
lwd = 6 ,
col = "red",
cex.lab = 1.5,
cex.axis = 1.5,
cex = 1.5
)
axis(1, 1:7, names(PhyloExpressionSetExample)[3:9])
text(
4,
max(Ath.Embryo.High_Low_Low[, 3:9]) - 6000,
labels = paste0(nrow(Ath.Embryo.High_Low_Low), " Genes"),
cex = 2
)
matplot(
t(Ath.Embryo.High_Low_High[, 3:9]),
type = "l",
lty = 1,
lwd = 2,
col = "lightblue",
xlab = "Ontogeny",
ylab = "Expression Level",
xaxt = "n",
cex.lab = 1.5,
cex.axis = 1.5,
main = "High-Low-High"
)
lines(
colMeans(Ath.Embryo.High_Low_High[, 3:9]),
lwd = 6 ,
col = "red",
cex.lab = 1.5,
cex.axis = 1.5,
cex = 1.5
)
axis(1, 1:7, names(PhyloExpressionSetExample)[3:9])
text(
4,
max(Ath.Embryo.High_Low_High[, 3:9]) - 6000,
labels = paste0(nrow(Ath.Embryo.High_Low_High), " Genes"),
cex = 2
)
matplot(
t(Ath.Embryo.Low_Low_High[, 3:9]),
type = "l",
lty = 1,
lwd = 2,
col = "lightblue",
xlab = "Ontogeny",
ylab = "Expression Level",
xaxt = "n",
cex.lab = 1.5,
cex.axis = 1.5,
main = "Low-Low-High"
)
lines(
colMeans(Ath.Embryo.Low_Low_High[, 3:9]),
lwd = 6 ,
col = "red",
cex.lab = 1.5,
cex.axis = 1.5,
cex = 1.5
)
axis(1, 1:7, names(PhyloExpressionSetExample)[3:9])
text(
4,
max(Ath.Embryo.Low_Low_High[, 3:9]) - 6000,
labels = paste0(nrow(Ath.Embryo.Low_Low_High), " Genes"),
cex = 2
)
```
As a result, we find that there are two distinct sets of young genes: High-high-low and low-low-high. Almost none of the genes have a high-low-high pattern.
Finally, users can perform KEGG or GO Term enrichment analyses to obtain the annotated functions of these gene sets.
#### Transcriptome Divergence Index
Analogous to the TAI measure, the _Transcriptome Divergence Index_ (TDI) was introduced by Quint et al., 2012 and Drost et al., 2015 as a measure of average transcriptome selection pressure where $s$ denotes the corresponding developmental stage.
$TDI_s = \sum_{i = 1}^n \frac{ds_i * e_{is}}{\sum_{i = 1}^n e_{is}}$
where $ds_i$ denotes the divergence stratum assignment of gene $i$ and $e_{is}$ denotes
the gene expression level of gene $i$ at developmental time point $s$.
A lower value of TDI describes an more conserved transcriptome (in terms of sequence dissimilarity), whereas a higher value of TDI denotes a more variable transcriptome.
To assess the statistical significance of all introduced measures and analyses, we developed several test statistics that are introduced in the following sections.
## Transcriptome Age Index Analyses
Evolutionary signatures of transcriptomes can be captured by computing transcriptome indices at different measured stages of development, combining these computed values to a transcriptome index profile across the measured stages, and comparing the resulting profile with a flat line. A profile not significantly deviating from a flat line indicates the absence of significant variations of the computed transcriptome index from stage to stage. In contrast, a profile significantly deviating from a flat line indicates the presence of significant variations from stage to stage. We refer to any transcriptome index profile significantly deviating from a flat line as phylotranscriptomic pattern or evolutionary signature.
Previously, we introduced three statistical tests to quantify the significance of observed TAI or TDI patterns: `Flat Line Test`, `Reductive Hourglass Test`, and `Reductive Early Conservation Test` ([Drost et al., 2015](https://academic.oup.com/mbe/article/32/5/1221/1125964)).
The `PlotPattern()` function introduced in the following sections is the main analytics function of myTAI. `PlotPattern()` allows users to visualize TAI or TDI patterns and internally performs the following statistical tests to assess their significance.
### Flat Line Test
The `PlotSignature()` function with option `TestStatistic = "FlatLineTest",` first computes the `TAI` (given a PhyloExpressionSet and argument specification `measure = "TAI"`) or the `TDI` (given a DivergenceExpressionSet and argument specification `measure = "TDI"`) profile as well as their standard deviation, and statistical significance.
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
data(PhyloExpressionSetExample)
# Plot the Transcriptome Age Index of a given PhyloExpressionSet
# Test Statistic : Flat Line Test (default)
PlotSignature( ExpressionSet = PhyloExpressionSetExample,
measure = "TAI",
TestStatistic = "FlatLineTest",
xlab = "Ontogeny",
ylab = "TAI" )
```
The p-value (`p_flt`) above the TAI curve is returned by the `FlatLineTest`. As described in the
documentation of `PlotSignature()` (`?PlotSignature` or `?FlatLineTest`), the `FlatLineTest` is the default statistical test to quantify the statistical significance of the observed phylotranscriptomic pattern. In detail, the test quantifies any statistically significant deviation of the phylotranscriptomic pattern from a flat line. Here, we define any significant deviation of a phylotranscriptomic pattern from a flat line as evolutionary signature Furthermore, we define corresponding stages of deviation as evolutionary conserved or variable (less conserved) depending on the magnitude of `TAI` and corresponding p-values.
### Reductive Hourglass Test
In case the observed phylotranscriptomic pattern not only significantly deviates from a flat line but also visually resembles an _hourglass_ shape,
one can obtain a p-value quantifying the statistical significance of a visual _hourglass_ pattern based on the `ReductiveHourglassTest` (`?ReductiveHourglassTest`).
Since the `ReductiveHourglassTest` has been defined for a priori biological knowledge ([Drost et al., 2015](https://academic.oup.com/mbe/article/32/5/1221/1125964)), the `modules` argument within the `ReductiveHourglassTest()` function needs to be specified.
Three modules need to be specified: an __early-module__, __phylotypic module__ (mid), and a __late-module__.
For this example we divide _A. thaliana_ embryo development stored within the _PhyloExpressionSetExample_ into the following three modules:
* early = stages 1 - 2 (Zygote and Quadrant)
* mid = stages 3 - 5 (Globular, Heart, and Torpedo)
* late = stages 6 - 7 (Bent and Mature)
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Plot the Transcriptome Age Index of a given PhyloExpressionSet
# Test Statistic : Reductive Hourglass Test
PlotSignature( ExpressionSet = PhyloExpressionSetExample,
measure = "TAI",
TestStatistic = "ReductiveHourglassTest",
modules = list(early = 1:2, mid = 3:5, late = 6:7),
xlab = "Ontogeny",
ylab = "TAI" )
```
The corresponding p-value `p_rht` now denotes the p-value returned by the `ReductiveHourglassTest`
which is different from the p-value returned by the `FlatLineTest` (`p_flt`).
To make sure that correct modules have been selected to perform the `ReductiveHourglassTest`, users can use the `shaded.area` argument to visualize chosen modules:
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Visualize the phylotypic period used for the Reductive Hourglass Test
PlotSignature( ExpressionSet = PhyloExpressionSetExample,
measure = "TAI",
TestStatistic = "ReductiveHourglassTest",
modules = list(early = 1:2, mid = 3:5, late = 6:7),
shaded.area = TRUE,
xlab = "Ontogeny",
ylab = "TAI" )
```
__Note__ that for defining a priori knowledge for the `ReductiveHourglassTest` using the `modules` argument, modules need to start at stage 1, ..., N and do not correspond to the column position in the _PhyloExpressionSet/DivergenceExpressionSet_
which in contrast would start at position 3, ... N + 2.
In a biological context, it is not always clear which stages could define the early, mid, and late
modules. For animal embryogenesis, it has been suggested to choose early, mid, and late modules
according to the morphological conservation of animal embryos (e.g. mid-stage vertebrate embryos seem to be morphologically conserved). For plants, in contrast no morphological conservation has been reported and thus the average expression of embryo defective genes has been used to define modules (Drost et al. 2015).
### Reductive Early Conservation Test
The third test statistic that is implemented in the `myTAI` package is the `EarlyConservationTest`.
The `EarlyConservationTest` tests whether an observed phylotranscriptomic pattern follows
a low-high-high pattern (monotonically increasing function) supporting the Early Conservation Model of embryogenesis.
Analogous to the `ReductiveHourglassTest`, the `EarlyConservationTest` needs a priori biological knowledge [Drost et al., 2015](https://academic.oup.com/mbe/article/32/5/1221/1125964).
So again three `modules` have to be specified for the `EarlyConservationTest()` function.
Three modules need to be specified: an __early-module__, __phylotypic module__ (mid), and a __late-module__.
For this example we divide _A. thaliana_ embryo development stored within the _PhyloExpressionSetExample_ into the following three modules:
* early = stages 1 - 2 (Zygote and Quadrant)
* mid = stages 3 - 5 (Globular, Heart, and Torpedo)
* late = stages 6 - 7 (Bent and Mature)
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Plot the Transcriptome Age Index of a given PhyloExpressionSet
# Test Statistic : Reductive Early Conservation Test
PlotSignature( ExpressionSet = PhyloExpressionSetExample,
measure = "TAI",
TestStatistic = "EarlyConservationTest",
modules = list(early = 1:2, mid = 3:5, late = 6:7),
xlab = "Ontogeny",
ylab = "TAI" )
```
The corresponding p-value `p_ect` now denotes the p-value returned by the `EarlyConservationTest`
which is different from the p-value returned by the `FlatLineTest` (`p_flt`) and `ReductiveHourglassTest` (`p_rht`).
Since the present TAI pattern of the _PhyloExpressionSetExample_ doesn't support the Early Conservation Hypothesis,
the p-value `p_ect` = 1.
Again __note__ that for defining a priori knowledge for the `EarlyConservationTest` using the `modules` argument, modules need to start at stage 1, ..., N and do not correspond to the column position in the _PhyloExpressionSet/DivergenceExpressionSet_
which in contrast would start at position 3, ... N + 2.
To obtain the numerical _TAI_ values, the `TAI()` function can be used:
```{r,eval=FALSE}
# Compute the Transcriptome Age Index values of a given PhyloExpressionSet
TAI(PhyloExpressionSetExample)
```
```
Zygote Quadrant Globular Heart Torpedo Bent Mature
3.229942 3.225614 3.107135 3.116693 3.073993 3.176511 3.390334
```
### Reverse Hourglass Test
In case the observed phylotranscriptomic pattern not only significantly deviates from a flat line but also visually resembles an _reverse hourglass_ shape (low-high-low pattern),
one can obtain a p-value quantifying the statistical significance of a visual _reverse hourglass_ pattern based on the `ReverseHourglassTest` (`?ReverseHourglassTest`).
Since the `ReverseHourglassTest` has been defined for a priori biological knowledge ([Drost et al., 2015](https://academic.oup.com/mbe/article/32/5/1221/1125964)), the `modules` argument within the `ReverseHourglassTest()` function needs to be specified.
Three modules need to be specified: an __early-module__, __phylotypic module__ (mid), and a __late-module__.
For this example we divide _A. thaliana_ embryo development stored within the _PhyloExpressionSetExample_ into the following three modules:
* early = stages 1 - 2 (Zygote and Quadrant)
* mid = stages 3 - 5 (Globular, Heart, and Torpedo)
* late = stages 6 - 7 (Bent and Mature)
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Plot the Transcriptome Age Index of a given PhyloExpressionSet
# Test Statistic : Reverse Hourglass Test
PlotSignature( ExpressionSet = PhyloExpressionSetExample,
measure = "TAI",
TestStatistic = "ReverseHourglassTest",
modules = list(early = 1:2, mid = 3:5, late = 6:7),
xlab = "Ontogeny",
ylab = "TAI" )
```
The corresponding p-value `p_reverse_hourglass` now denotes the p-value returned by the `ReverseHourglassTest`
which is different from the p-value returned by the `FlatLineTest` (`p_flt`).
To make sure that correct modules have been selected to perform the `ReverseHourglassTest`, users can use the `shaded.area` argument to visualize chosen modules:
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Visualize the phylotypic period used for the Reductive Hourglass Test
PlotSignature( ExpressionSet = PhyloExpressionSetExample,
measure = "TAI",
TestStatistic = "ReverseHourglassTest",
modules = list(early = 1:2, mid = 3:5, late = 6:7),
shaded.area = TRUE,
xlab = "Ontogeny",
ylab = "TAI" )
```
__Note__ that for defining a priori knowledge for the `ReverseHourglassTest` using the `modules` argument, modules need to start at stage 1, ..., N and do not correspond to the column position in the _PhyloExpressionSet/DivergenceExpressionSet_
which in contrast would start at position 3, ... N + 2.
In a biological context, it is not always clear which stages could define the early, mid, and late
modules. For animal embryogenesis, it has been suggested to choose early, mid, and late modules
according to the morphological conservation of animal embryos (e.g. mid-stage vertebrate embryos seem to be morphologically conserved). For plants, in contrast no morphological conservation has been reported and thus the average expression of embryo defective genes has been used to define modules (Drost et al. 2015).
## Transcriptome Divergence Index Analyses
Analogous to the _TAI_ computations and visualization, the _TDI_ computations can be performed in a similar fashion:
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
data(DivergenceExpressionSetExample)
# Plot the Transcriptome Divergence Index of a given DivergenceExpressionSet
# Test Statistic : Flat Line Test (default)
PlotSignature( ExpressionSet = DivergenceExpressionSetExample,
measure = "TDI",
TestStatistic = "FlatLineTest",
xlab = "Ontogeny",
ylab = "TDI" )
```
Again, for the __ReductiveHourglassTest__ we divide _A. thaliana_ embryo development into three modules:
* early = stages 1 - 2 (Zygote and Quadrant)
* mid = stages 3 - 5 (Globular, Heart, and Torpedo)
* late = stages 6 - 7 (Bent and Mature)
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
data(DivergenceExpressionSetExample)
# Plot the Transcriptome Divergence Index of a given DivergenceExpressionSet
# Test Statistic : Reductive Hourglass Test
PlotSignature( ExpressionSet = DivergenceExpressionSetExample,
measure = "TDI",
TestStatistic = "ReductiveHourglassTest",
modules = list(early = 1:2, mid = 3:5, late = 6:7),
xlab = "Ontogeny",
ylab = "TDI" )
```
And for the __EarlyConservationTest__ we again divide _A. thaliana_ embryo development into three modules:
* early = stages 1 - 2 (Zygote and Quadrant)
* mid = stages 3 - 5 (Globular, Heart, and Torpedo)
* late = stages 6 - 7 (Bent and Mature)
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
data(DivergenceExpressionSetExample)
# Plot the Transcriptome Divergence Index of a given DivergenceExpressionSet
# Test Statistic : Reductive Early Conservation Test
PlotSignature( ExpressionSet = DivergenceExpressionSetExample,
measure = "TDI",
TestStatistic = "EarlyConservationTest",
modules = list(early = 1:2, mid = 3:5, late = 6:7),
xlab = "Ontogeny",
ylab = "TDI" )
```
To obtain the numerical TDI values for a given DivergenceExpressionSet simply run:
```{r,eval=FALSE}
# Compute the Transcriptome Divergence Index values of a given DivergenceExpressionSet
TDI(DivergenceExpressionSetExample)
```
```
Zygote Quadrant Globular Heart Torpedo Bent Mature
4.532029 4.563200 4.485705 4.500868 4.466477 4.530704 4.690292
```
### `Phylostratum` or `Divergence Stratum` specific contribution to the global transcriptome index profile
Another way to visualize the cumulative contribution of each `Phylostratum` or `Divergence Stratum` to the global Transcriptome Age Index or Transcriptome Divergence Index profile was introduced by Domazet-Loso and Tautz, 2010 (Fig. 1b). The advantage of visualizing the cumulative contribution of each `Phylostratum` or `Divergence Stratum` to the global pattern is to study how the final (global) TAI or TDI
profile emerges from the cumulative TAI/TDI distribution of each `Phylostratum` or `Divergence Stratum`. This `Phylostratum` or `Divergence Stratum` specific contribution on the global TAI or TDI pattern can be visualized using `PlotContribution()`:
#### Example: `Phylostrata`
```{r, fig.width= 7, fig.height= 5, eval = FALSE}
data(PhyloExpressionSetExample)
# visualize phylostrata contribution to the global TAI pattern
PlotContribution( ExpressionSet = PhyloExpressionSetExample,
legendName = "PS",
xlab = "Ontogeny",
ylab = "Transcriptome Age Index",
y.ticks = 10)
```
The `y.ticks` argument allows users to to adjust the number of ticks that shall be visualized on
the y-axis.
The exact values of the `Phylostratum` specific cumulative TAI profiles can be obtained
using the `pTAI()` function:
```{r,eval = FALSE}
pTAI(PhyloExpressionSetExample)
```
```
Zygote Quadrant Globular Heart Torpedo Bent Mature
1 0.3929533 0.3935308 0.4142106 0.4115399 0.4216806 0.4178302 0.3883815
2 0.9521021 0.9547833 0.9748522 0.9674842 0.9632213 0.9285336 0.8661883
3 1.0502814 1.0477016 1.0576104 1.0556300 1.0549156 1.0187303 0.9722031
4 1.3861830 1.3810595 1.3837548 1.3928030 1.3876006 1.3632945 1.3656170
5 1.5527473 1.5489829 1.5360986 1.5500708 1.5468767 1.5531699 1.5769758
6 1.8114772 1.8171167 1.7806803 1.7968463 1.8020203 1.8223456 1.9207232
7 1.8766090 1.8739449 1.8317327 1.8530276 1.8573216 1.8776292 1.9952325
8 1.9417254 1.9357499 1.8877436 1.9089144 1.9117438 1.9478303 2.0778287
9 2.0339192 2.0294517 1.9823091 1.9982678 1.9915249 2.0413791 2.1878074
10 2.4215868 2.4361137 2.3489524 2.3347670 2.3028346 2.3797112 2.5142773
11 2.4900201 2.5079107 2.4104647 2.3980760 2.3651571 2.4422093 2.5961316
12 3.2299418 3.2256139 3.1071348 3.1166934 3.0739935 3.1765113 3.3903336
```
#### Example: `Divergence Strata`
Analogously, the `Divergence Stratum` specific influence on the global TDI pattern can be visualized using:
```{r, fig.width= 7, fig.height= 5, eval = FALSE}
data(DivergenceExpressionSetExample)
# visualize divergence stratum contribution to global TDI
PlotContribution( ExpressionSet = DivergenceExpressionSetExample,
legendName = "DS")
```
The exact values of the `Divergence Stratum` specific cumulative TDI values can be obtained
using the `pTDI()` function:
```{r,eval = FALSE}
pTDI(DivergenceExpressionSetExample)
```
```
Zygote Quadrant Globular Heart Torpedo Bent Mature
1 0.2174378 0.2207644 0.2309211 0.2214881 0.2195601 0.2047938 0.1704023
2 0.4800055 0.4762352 0.4821244 0.4752145 0.4799418 0.4695871 0.4288670
3 0.7742988 0.7597816 0.7797646 0.7826244 0.8033702 0.8034316 0.7769026
4 1.1463780 1.1285545 1.1408903 1.1528497 1.1694967 1.1784113 1.1933157
5 1.5652686 1.5489672 1.5545589 1.5698173 1.5934158 1.6003011 1.6330444
6 2.0899161 2.0609661 2.0666314 2.1015199 2.1163138 2.1306059 2.1932280
7 2.6262330 2.6035894 2.6012719 2.6453427 2.6477490 2.6664143 2.8085434
8 3.2211299 3.1990804 3.1767694 3.2319604 3.2292835 3.2693046 3.4412353
9 3.8396769 3.8299654 3.7793476 3.8353941 3.8272391 3.8971210 4.0753877
10 4.5320286 4.5632002 4.4857052 4.5008685 4.4664774 4.5307040 4.6902921
```
In both cases (`Phylostrata` and `Divergence Strata`) the `pTAI()` and `pTDI()`
functions return a numeric matrix storing the cumulative TAI or TDI values
for each `Phylostratum` and `Divergence Stratum`. Note, that the TAI values
of `Phylostratum` 12 (in the `pTAI()` matrix) are equivalent to `TAI(PhyloExpressionSetExample)`.
```{r,eval = FALSE}
# show that the cumulative TAI value of PS 12 is
# equivalent to the global TAI() values
pTAI(PhyloExpressionSetExample)[12 , ]
# > Zygote Quadrant Globular Heart Torpedo Bent Mature
# > 3.229942 3.225614 3.107135 3.116693 3.073993 3.176511 3.390334
TAI(PhyloExpressionSetExample)
# > Zygote Quadrant Globular Heart Torpedo Bent Mature
# > 3.229942 3.225614 3.107135 3.116693 3.073993 3.176511 3.390334
```
This can be explained by the definition of the TAI. Here the sum of all
partial TAI values over all `Phylostrata` is equals the global TAI values:
```{r,eval = FALSE}
# show that the colSum() of partial TAI values
# over all Phylostrata equals the global TAI() values
apply(pStrata(PhyloExpressionSetExample) , 2 , sum)
# > Zygote Quadrant Globular Heart Torpedo Bent Mature
# > 3.229942 3.225614 3.107135 3.116693 3.073993 3.176511 3.390334
```
Now, the `PlotContribution()` only differs from `apply(pStrata(PhyloExpressionSetExample) , 2 , sum)` by exchanging the `sum()` by `cumsum()`.
```{r,eval = FALSE}
# show that apply(pStrata(PhyloExpressionSetExample) , 2 , cumsum)
# is equivalent to pTAI()
apply(pStrata(PhyloExpressionSetExample) , 2 , cumsum)
```
```
Zygote Quadrant Globular Heart Torpedo Bent Mature
1 0.3929533 0.3935308 0.4142106 0.4115399 0.4216806 0.4178302 0.3883815
2 0.9521021 0.9547833 0.9748522 0.9674842 0.9632213 0.9285336 0.8661883
3 1.0502814 1.0477016 1.0576104 1.0556300 1.0549156 1.0187303 0.9722031
4 1.3861830 1.3810595 1.3837548 1.3928030 1.3876006 1.3632945 1.3656170
5 1.5527473 1.5489829 1.5360986 1.5500708 1.5468767 1.5531699 1.5769758
6 1.8114772 1.8171167 1.7806803 1.7968463 1.8020203 1.8223456 1.9207232
7 1.8766090 1.8739449 1.8317327 1.8530276 1.8573216 1.8776292 1.9952325
8 1.9417254 1.9357499 1.8877436 1.9089144 1.9117438 1.9478303 2.0778287
9 2.0339192 2.0294517 1.9823091 1.9982678 1.9915249 2.0413791 2.1878074
10 2.4215868 2.4361137 2.3489524 2.3347670 2.3028346 2.3797112 2.5142773
11 2.4900201 2.5079107 2.4104647 2.3980760 2.3651571 2.4422093 2.5961316
12 3.2299418 3.2256139 3.1071348 3.1166934 3.0739935 3.1765113 3.3903336
```
```{r,eval = FALSE}
pTAI(PhyloExpressionSetExample)
```
```
Zygote Quadrant Globular Heart Torpedo Bent Mature
1 0.3929533 0.3935308 0.4142106 0.4115399 0.4216806 0.4178302 0.3883815
2 0.9521021 0.9547833 0.9748522 0.9674842 0.9632213 0.9285336 0.8661883
3 1.0502814 1.0477016 1.0576104 1.0556300 1.0549156 1.0187303 0.9722031
4 1.3861830 1.3810595 1.3837548 1.3928030 1.3876006 1.3632945 1.3656170
5 1.5527473 1.5489829 1.5360986 1.5500708 1.5468767 1.5531699 1.5769758
6 1.8114772 1.8171167 1.7806803 1.7968463 1.8020203 1.8223456 1.9207232
7 1.8766090 1.8739449 1.8317327 1.8530276 1.8573216 1.8776292 1.9952325
8 1.9417254 1.9357499 1.8877436 1.9089144 1.9117438 1.9478303 2.0778287
9 2.0339192 2.0294517 1.9823091 1.9982678 1.9915249 2.0413791 2.1878074
10 2.4215868 2.4361137 2.3489524 2.3347670 2.3028346 2.3797112 2.5142773
11 2.4900201 2.5079107 2.4104647 2.3980760 2.3651571 2.4422093 2.5961316
12 3.2299418 3.2256139 3.1071348 3.1166934 3.0739935 3.1765113 3.3903336
```
This `pTAI()` matrix is what is being visualized inside `PlotContribution()`.
__Note__ that the `pStrata()` function returns the partial TAI or TDI values
for each `Phylostratum` or `Divergence Stratum`, whereas `pMatrix()` returns
the partial TAI or TDI value for each gene.
```{r,eval = FALSE}
# compute partial TAI values for each Phylostratum
pStrata(PhyloExpressionSetExample)
```
```
Zygote Quadrant Globular Heart Torpedo Bent Mature
1 0.39295331 0.39353082 0.41421061 0.41153988 0.42168061 0.41783018 0.38838151
2 0.55914875 0.56125250 0.56064163 0.55594427 0.54154068 0.51070341 0.47780681
3 0.09817938 0.09291830 0.08275821 0.08814585 0.09169432 0.09019669 0.10601483
4 0.33590159 0.33335787 0.32614435 0.33717297 0.33268496 0.34456426 0.39341385
5 0.16656429 0.16792339 0.15234382 0.15726786 0.15927610 0.18987539 0.21135880
6 0.25872990 0.26813378 0.24458170 0.24677547 0.25514358 0.26917571 0.34374736
7 0.06513175 0.05682826 0.05105236 0.05618130 0.05530130 0.05528356 0.07450933
8 0.06511643 0.06180498 0.05601089 0.05588683 0.05442226 0.07020109 0.08259620
9 0.09219381 0.09370185 0.09456557 0.08935340 0.07978112 0.09354879 0.10997873
10 0.38766762 0.40666192 0.36664326 0.33649918 0.31130963 0.33833215 0.32646984
11 0.06843326 0.07179705 0.06151227 0.06330899 0.06232254 0.06249801 0.08185437
12 0.73992167 0.71770319 0.69667019 0.71861745 0.70883635 0.73430205 0.79420194
```
```{r,eval = FALSE}
# compute partial TAI values for each gene
dplyr::glimpse(pMatrix(PhyloExpressionSetExample))
```
```
Observations: 25260
Variables:
$ Zygote (dbl) 3.597950e-05, 2.484581e-05, 2.007498e-05, 1.683276e-05, 1.891073e-...
$ Quadrant (dbl) 3.203218e-05, 3.045853e-05, 2.066542e-05, 1.569402e-05, 2.812062e-...
$ Globular (dbl) 2.013329e-05, 2.909027e-05, 1.640631e-05, 2.063608e-05, 6.003300e-...
$ Heart (dbl) 2.311604e-05, 2.800871e-05, 1.663988e-05, 2.379714e-05, 7.119708e-...
$ Torpedo (dbl) 1.813626e-05, 2.713080e-05, 1.566972e-05, 2.525094e-05, 1.019572e-...
$ Bent (dbl) 1.685284e-05, 1.950712e-05, 1.535178e-05, 2.254055e-05, 1.172208e-...
$ Mature (dbl) 2.684331e-05, 1.695727e-05, 1.415911e-05, 1.362810e-05, 1.229886e-...
```
You can receive all gene specific partial TAI values by typing `pMatrix(PhyloExpressionSetExample)`.
Analogously, `pStrata()` and `pMatrix()` can be used for `Divergence Strata` by substituting `PhyloExpressionSetExample` by `DivergenceExpressionSetExample`.
### Mean Expression and Relative Expression of Single Phylostrata or Divergence Strata
__TAI__ or __TDI__ patterns are very useful to gain a first insight into the mean transcriptome age or mean sequence divergence of genes being most active during the corresponding developmental stage or experiment.
To further investigate the origins of the global __TAI__ or __TDI__ pattern it is useful to visualize the mean gene expression of each Phylostratum or Divergence-Stratum class.
### Mean Expression Levels of a PhyloExpressionSet and DivergenceExpressionSet
Visualizing the mean gene expression of genes corresponding to the same Phylostratum or Divergence Stratum class allows users to detect biological process specific groups of Phylostrata or Divergence Strata that are most expressed during the underlying biological process. This might lead to correlating specific groups of Phylostrata or Divergence Strata sharing similar evolutionary origins with common functions or functional contributions to a specific developmental process.
```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Visualizing the mean gene expression of each Phylostratum class
PlotMeans( ExpressionSet = PhyloExpressionSetExample,
Groups = list(1:12),
legendName = "PS")
```
Here we see that the mean gene expression of Phylostratum group: PS1-3 (genes evolved before the establishment of embryogenesis in plants) are more expressed during _A. thaliana_ embryogenesis than PS4-12 (genes evolved during or after the establishment of embryogenesis in plants).
In different biological processes different Phylostratum groups or combination of groups might resemble the majority of expressed genes.
The `PlotMeans()` function takes an PhyloExpressionSet or DivergenceExpressionSet and visualizes for each Phylostratum the mean expression levels of all genes that correspond to this Phylostratum. The `Groups` argument takes a list storing the Phylostrata (classified into the same group) that shall be visualized on the same plot.
For this example we separate groups of Phylostrata into __evolutionary old Phylostrata__ (PS1-3) in one plot versus __evolutionary younger Phylostrata__ (PS4-12) into another plot:
```{r, fig.height= 5, fig.width=7,eval=FALSE}
# Visualizing the mean gene expression of each Phylostratum class