-
Notifications
You must be signed in to change notification settings - Fork 19
/
multivariate.Rmd
826 lines (649 loc) · 37.3 KB
/
multivariate.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
---
title: "Cluster analysis"
output:
html_document:
fig_caption: no
number_sections: yes
toc: yes
toc_float: false
collapsed: no
---
```{r multivariate-1, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE, out.width = "75%", out.height = "75%")
knitr::opts_knit$set(progress = TRUE, verbose = TRUE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
<p><span style="color: #00cc00;">NOTE: This page has been revised for Winter 2024, but may undergo further edits.</span></p>
# Introduction #
"Cluster analysis" is a broad group of multivariate analyses that have as their objective the organization of individual observations (objects, sites, grid points, individuals), and these analyses are built upon the concept of multivariate distances (expressed either as similarities or dissimilarities) among the objects.
The organization generally takes two forms:
- the arrangement of the objects in a lower-dimensional space than the data were originally observed on;
- the development of "natural" clusters or the classifications of the objects.
These analyses share many concepts and techniques (both numerical and practical) with other procedures such as principal components analysis, numerical taxonomy, discriminant analysis and so on, and cluster analysis is one of the few machine learning techniques that uses its traditional name in multivariate analysis as its modern Data Science/Machine Learning name.
The analyses generally begin with the construction of an *n* x *n* matrix **D** of the distances between objects. For example, in a two-dimensional space, the elements *d<sub>ij</sub>* of **D** could be the Euclidean distances between points,
*d<sub>ij</sub>* = [(*x<sub>i1</sub>*- *x<sub>j1</sub>*)<sup>2</sup> + (*x<sub>i1</sub>*- *x<sub>j1</sub>*)<sup>2</sup>]<sup>½</sup>
The Euclidean distance, and related measures are easily generalized to more than two dimensions.
To begin load some packages:
```{r multivariate-2, message=FALSE, warning=FALSE, }
library(sf)
library(ggplot2)
library(RColorBrewer)
library(lattice)
library(GGally)
library(MASS)
library(ncdf4)
library(terra)
library(tidyterra)
library(stringr)
library(gplots)
library(parallelDist)
```
Some of these packages are familiar, but `{GGally}` provides some enhanced `{ggplot2}` plots, `{stringr}` manages the manipulation of strings, `{parallelDist}` exploits multiple processors/cores to efficiently calculate large dissimilarlty/distance matrices, and '{gplots}` calculates heat maps.
# A simple cluster analysis #
The basic idea of cluster analysis can be illustrated using the Oregon climate-station dataset, with the goal being to identify groups of stations with similar climates, which also regionalizes the climate of the state. Begin by reading a shape file for mapping the data, and the data itself.
```{r multivariate-3, message=FALSE}
# read county outlines
# orotl_sf
shapefile <- "/Users/bartlein/Projects/RESS/data/shp_files/OR/orotl.shp"
orotl_sf <- st_read(shapefile)
```
Add a coordinate reference system and plot:
```{r multivariate-4 }
# add a CRS
st_crs(orotl_sf) <- st_crs("+proj=longlat")
or_otl_sf <- st_geometry(orotl_sf)
plot(or_otl_sf)
```
Now read a *.csv file, and attach it.
```{r multivariate-5 }
# read stations
orstationc_path <- "/Users/bartlein/Projects/RESS/data/csv_files/"
orstationc_name <- "orstationc.csv"
orstationc_file <- paste(orstationc_path, orstationc_name, sep="")
orstationc <- read.csv(orstationc_file)
summary(orstationc)
attach(orstationc)
```
Plot the station locations, and then the three-letter station name abbreviations. Notice that the use of the `coord_sf()` function in the first plot to explicity specify the geographical extent of the map, while in the second case `ggplot2()` determines the extent.
```{r multivariate-6 }
# plot stations
ggplot() +
geom_sf(data = or_otl_sf, fill = NA) +
geom_point(data = orstationc, aes(x=lon, y=lat), shape=21, color="black", size = 3, fill="lightblue") +
coord_sf(xlim = c(-125, -116), ylim = c(41, 47), expand = FALSE) +
labs(title = "Oregon Climate Stations", x = "Longitude", y = "Latitude") +
theme_bw()
```
```{r multivariate-7 }
# plot station names
ggplot() +
geom_sf(data = or_otl_sf, fill = NA) +
geom_text(data = orstationc, aes(x =lon, y = lat, label = as.character(station)),
check_overlap = TRUE, size = 3) +
labs(title = "Oregon Climate Stations", x = "Longitude", y = "Latitude") +
theme_bw()
```
Take a quick look at the correlations among variables.
```{r multivariate-8, message=FALSE}
# matrix scatter plot and correlations
ggpairs(orstationc[2:10])
```
The off-diagonal cells for the climate variables indicate that there are some strong correlations among the climate variables (the asterisks indicate significance in the usual way).
## Distance/Dissimilarity matrices
In this example, the groups of stations (or the climate regions of Oregon, as expressed in the climate-station data) will be defined via an “agglomerative-hierarchical” cluster analysis. In such an analysis, individual objects (observations) are combined according to their (dis)similarity, with the two *most-similar* objects being combined first (to create a new composite object), then the next two most-similar objects (including the composites), and so on until a single object is defined. The sequence of combinations is illustrated using a “dendrogram” (cluster tree), and this can be examined to determine the number of clusters. The function `cutree()` determines the cluster membership of each observation, which can be listed or plotted.
The first step in an agglomerative cluster analysis (although not always expiclitly displayed), is the calculation of a dissimilarity matrix, which is a square symmetrical matrix that illustrates the dissimilarities (typically Euclidean distances) between individual cases or observations, using in this case the six climate variables to define the space that distance is calculated in. The `levelplot()` funcction from the `{lattice}` package is used to make the plot.
```{r multivariate-9, fig.asp=1}
# get dissimilarities (distances) and cluster
X <- as.matrix(orstationc[,5:10])
X_dist <- dist(scale(X))
grid <- expand.grid(obs1 = seq(1:92), obs2 = seq(1:92))
levelplot(as.matrix(X_dist)/max(X_dist) ~ obs1*obs2, data=grid,
col.regions = gray(seq(0,1,by=0.0625)), xlab="observation", ylab="observation")
```
In the code above, for convenience, a matrix **X** was created to omit the variables we don’t want to include in the clustering (i.e. the non-climatic information). The levelplot() function plots the dissimilarity matrix that the cluster analysis works with. The dark diagonal line is formed from the Eucldean distance values of 0.0 from the comparison of each observation with itself.
There is a range of different styles of patterns that may show up
- If the climates at the individual stations were completely unrelated to on another, or if they were identical, the plot would appear light gray (with black pixels along the diagonal).The "plaid" as opposed to completely random appearance of the plot indicates that it is likely that natural groups will exist in the data.
- If stations near to one another had similar climates then one would see a plaid pattern.
The "plaidness" of the image could be enhanced by sorting the observations in some way that makes sense. (Right now, they're in alphabetical order.) The plot can be redone, this time sorting in to longitudinal order, West to East (i.e. `orstationc[order(orstationc$lon), 5:10]`):
```{r multivariate-10, fig.asp=1}
# get dissimilarities (distances) and cluster
X <- as.matrix(orstationc[order(orstationc$lon), 5:10])
X_dist <- dist(scale(X))
grid <- expand.grid(obs1 = seq(1:92), obs2 = seq(1:92))
levelplot(as.matrix(X_dist)/max(X_dist) ~ obs1*obs2, data=grid,
col.regions = gray(seq(0,1,by=0.0625)), xlab="observation", ylab="observation")
```
The "plaid and blocks" pattern suggests that, as we might have expected, there is a longitudinal trend in Oregon climates.
## The cluster analysis
Get the orignal distance matrix again, and do the cluster analysis:
```{r multivariate-11 }
# distance matrix
X <- as.matrix(orstationc[,5:10])
X_dist <- dist(scale(X))
# cluster analysis
X_clusters <- hclust(X_dist, method = "ward.D2")
plot(X_clusters, labels=station, cex=0.4)
```
The function `hclust()` performs a "Wards's single-linkage" analysis, and the `plot()` function plots the dendrogram. Notice that the individual "leaves" of the plot are labeled by the climate-station names.
The dendrogram can be inspected to get an idea of how many distinct clusters there are (and it can also identify unusual points–look for Crater Lake `CLK`). As a first guess, let’s say there were four distinct clusters evident in the dendrogram. The `cuttree()` function can be used to "cut" the dendrogram at a particular level:
```{r multivariate-12 }
# cut dendrogram and plot points in each cluster
nclust <- 4
clusternum <- cutree(X_clusters, k=nclust)
```
And then the cluster membership (i.e. which cluster?) of each station can be mapped.
```{r multivariate-13 }
# plot cluster memberships of each station
ggplot() +
geom_sf(data = or_otl_sf, fill = NA) +
geom_text(data = orstationc, aes(x =lon, y = lat, label = as.character(clusternum)),
check_overlap = TRUE, color = "red", size = 4) +
labs(title = "Oregon Climate Stations -- Cluster Numbers", x = "Longitude", y = "Latitude") +
theme_bw()
```
Cluster 1 picks out the stations in eastern Oregon's "High Desert" region, while Cluster 2 highlights lower elevation stations in eastern Oregon. Cluster 3 includes stations in the valleys of western Oregon, while Cluster 4 indentifies coastal and lower Columbia Valley stations.
(A listing of the cluster number that each station belongs to can be gotten by the following code.)
```{r multivariate-14, echo=TRUE, eval=FALSE}
print(cbind(station, clusternum, as.character(Name)))
```
## Examining the cluster analysis
The efficacy of the analyis in regionalizing or grouping the stations can be gauged using plots and a analysis that looks at the "separation" of the groups of stations.
```{r multivariate-15 }
# examine clusters -- simple plots
tapply(tjan, clusternum, mean)
histogram(~ tjan | factor(clusternum), nint=20, layout=c(2,2))
```
Note how the histograms for some of the clusters seem multimodal multimodal–this is a clear sign of inhomogeneity of the clusters. Let’s see how the clusters differ. This can be done via discriminant analysis [Discriminant Analysis](https://pjbartlein.github.io/REarthSysSci/topics/da.html), which (like PCA) defines some new variables with desirable properties, in this case, new variables that maximally "spread out" the groups along the new variable axes:
```{r multivariate-16 }
# discriminant analysis
cluster_lda1 <- lda(clusternum ~ tjan + tjul + tann + pjan + pjul + pann)
cluster_lda1
plot(cluster_lda1, col = "red")
```
Get the discriminant scores (the values of the new discriminant function):
```{r multivariate-17 }
# discriminant scores
cluster_dscore <- predict(cluster_lda1, dimen=nclust)$x
```
Produce a grouped box plot for each of first three discriminant function:
```{r multivariate-18 }
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(cluster_dscore[, 1] ~ clusternum, ylab = "LD 1", ylim=c(-8,8))
boxplot(cluster_dscore[, 2] ~ clusternum, ylab = "LD 2", ylim=c(-8,8))
boxplot(cluster_dscore[, 3] ~ clusternum, ylab = "LD 2", ylim=c(-8,8))
par(opar)
```
The first discriminant function clearly spreads out the four clusters, while the second could be used to discriminate between clusters 1 and 4 and clusters 2 and 3.
Finally, get the discriminant function loadings, which measure the relationship or correlation between each original variables and the new discriminant function:
```{r multivariate-19 }
# discriminant "loadings"
cor(orstationc[5:10],cluster_dscore)
```
The loadings can be used to interpret or describe the discriminant functions. Here January temperature and January and annual precipitation are most highly correlated with the first discriminant function. Each of these have strong latitudinal gradient across the state: high in the west, low in the east. Note that this gradient corresponds well with a similar longitudinal gradient in cluster number.
The discriminant analysis here is used to answer the question “how do the clusters of stations differ climatically?” In this case, it looks like `pann` and `tann` are the variables most closely correlated with each discriminant function, and because each of these variables are more-or-less averages of the seasonal extreme variables, that might explain why the clusters seem inhomogeneous.
The analysis can be repeated, extracting different numbers of clusters. The easiest way to do this is to copy the three chunks of code above into a text editor, changing the assignment to nclust and including or excluding before executing each chunk, one at a time.
[[Back to top]](multivariate.html)
# Distances: among observations and among variables
As noted above, in addition to being the input to a cluster analysis, distance matrices are interesting in their own right. In a typical rectangular data set, there are two distance matrices that could be calculated, one comparing variables (i.e. the columns in a data frame), and the other comparing observations (or rows). In the examples below the "TraCE 21ka" decadal data are used. There are (96 x 48) 4608 grid points and 2204 observations. Read the raster brick in the netCDF file (`Trace21_TREFHT_anm2.nc`):
```{r multivariate-20 }
# read the TraCE21 temperature data
# read TraCE21-ka transient climate-model simulation decadal data
tr21dec_path <- "/Users/bartlein/Projects/RESS/data/nc_files/"
tr21dec_name <- "Trace21_TREFHT_anm2.nc"
tr21dec_file <- paste(tr21dec_path, tr21dec_name, sep="")
dname <- "tas_anm_ann"
file_label <- "TraCE_decadal_"
```
```{r multivariate-21 }
# read the data using ncdf4
nc_in <- nc_open(tr21dec_file)
print(nc_in)
lon <- ncvar_get(nc_in,"lon")
lat <- ncvar_get(nc_in,"lat")
t <- ncvar_get(nc_in, "time")
var_array <- ncvar_get(nc_in, dname)
dim(var_array)
plt_xvals <- t
```
The object `plt_xvals` is used to easily access time labels for some of the later plots.
Reshape the 3-d array into a 2-d "tidy" matrix with individual variables (grid-point locations) in columns, and observations (time) in rows:
```{r multivariate-22 }
# reshape the 3-d array
nlon <- dim(var_array)[1]; nlat <- dim(var_array)[2]; nt <- dim(var_array)[3]
var_vec_long <- as.vector(var_array)
length(var_vec_long)
X <- t(matrix(var_vec_long, nrow = nlon * nlat, ncol = nt))
dim(X)
```
## Distances among times
Notice the use of the transposition function (`t()`) to get the data in the proper "tidy" arrangement for a data frame (variables in columns, observations in rows). Distance matrix calculation can take a while, but the `{parallelDist}` package can be used to greatly speed things up, by handing off the creation of different blocks of the matrix to different processors in a multi-processor machine.
The `parDist()` function gets the dissimilarity matrix:
```{r multivariate-23 }
# distances among observations (time)
dist_obs <- as.matrix(parDist(X))
dim(dist_obs)
```
The dissimilarity matrix (`dist_obs`) can be conveniently visualized by turning it into a `{terra}` SpatRaster, and then using the `{tidyterra}` `geom_spatraster()` function to visualize it. The rendering of the image is faster if the image is written to a file:
```{r multivariate-24 }
# turn matrix into a SpatRaster
dist_obs_terra <- flip(rast(dist_obs))
dist_obs_terra
```
Use `ggplot2()` to plot the matrix (writing the *.png image to a file):
```{r multivariate-25, echo=TRUE, eval=FALSE }
# plot the dissimilarity matrix
png(paste(file_label, "dist_obs_v1.png", sep=""), width=720, height=720) # open the file
ggplot() +
geom_spatraster(data = t(dist_obs_terra)) +
scale_fill_distiller(type = "seq", palette = "Greens") +
scale_x_continuous(breaks = seq(0, 2200, by=500), minor_breaks = seq(0, 2200, by=100)) +
scale_y_continuous(breaks = seq(0, 2200, by=500), minor_breaks = seq(0, 2200, by=100)) +
labs(title = "Dissimilarity matrix among observations (times)",
x = "Decades since 0 BP", y = "Decades since 0 BP", fill="distance") +
theme(panel.grid.major=element_line(colour="black"), panel.grid.minor=element_line(colour="black"),
axis.text=element_text(size=12), axis.title=element_text(size=14), aspect.ratio = 1,
plot.title = element_text(size=20))
dev.off()
```
Here's the image:
![](images/TraCE_decadal_dist_obs_v1.png)
The observations here are ordered in time, and the blocks in the distance matrix indicate that there are two distinct climate regimes: one from the beginning of the record to around 14.7 ka (14,700 years before 1950 CE, e.g. in decade 1470 before the end of the record at 0 BP). Within each major interval, there are subintervals indicated by the dark squares. For example during the Holocene (the past 11,700 years, there are two blocks, the early Holocene, 11,700 to 8200 yr BP, and the middle and late Holocene, after 8200 yr BP). There are also patterns within the blocks. During the interval from 14,700 to 11,700 yr BP, the warm, but variable, Bølling-Allerød from 14,700 yr BP to 12,900 yr BP, can be seen followed by the colder Younger Dryas interval, 12,900 to 11,700 yr BP.
## Distance among locations
The dissimilarities among locations/variables can be gotten in a similar fashion, but note the use of the transpose function `t()` to get the variables in rows, and observations in columns:
```{r multivariate-26 }
# variables (grid points)
dist_var <- as.matrix(parDist(t(X)))
dim(dist_var)
```
```{r multivariate-27 }
# turn matrix into a SpatRaster
dist_var_terra <- flip(rast(dist_var))
dist_var_terra
```
Plot the matrix:
```{r multivariate-28, echo=TRUE, eval=FALSE }
# plot the dissimilarity matrix
png(paste(file_label, "dist_var_v1.png", sep=""), width=720, height=720) # open the file
ggplot() +
geom_spatraster(data = t(dist_var_terra)) +
scale_fill_distiller(type = "seq", palette = "Greens") +
scale_x_continuous(breaks = seq(0, 4600, by=500), minor_breaks = seq(0, 4600, by=250)) +
scale_y_continuous(breaks = seq(0, 4600, by=500), minor_breaks = seq(0, 4600, by=250)) +
labs(title = "Dissimilarity matrix among variables (grid points/locations)",
x = "Location", y = "Location", fill="distance") +
theme(panel.grid.major=element_line(colour="black"), panel.grid.minor=element_line(colour="black"),
axis.text=element_text(size=12), axis.title=element_text(size=14), aspect.ratio = 1,
plot.title = element_text(size=20))
dev.off()
```
![](images/TraCE_decadal_dist_var_v1.png)
The image is a sort of map: The grid points are arranged in the data frame with longitudes varying rapidly, and the latitudes slowly (forming the small squares in the image), such that the square dark area (low dissimilarities) in the middle is the tropics, where the spatial variability of climate is low, while the more spatially variable higher-latitude regions (at all longitudes) frame the low-variability region. There is also a pronounced Northern Hemisphere/Southern Hemisphere contrast
[[Back to top]](multivariate.html)
# Heatmaps (2-way, time-and-space clustering)
"Heatmaps" are a visualization approach that is implemented in the `{base}` package of R, but is extended by the `{gplots}` package. A heatmap basically consists of the raster-image type plot of the data frame, with the rows and columns of the matrix rearranged using cluster analyses of the varables and observations. (Note that the cluster analyses are carried out sequentially, as opposed to simultaneously, this latter approach known as "biclustering".) Typically, for small data sets, the matrix plot is augmented by dendrograms, and by "side bars" which plot some additional attribute of the data (location or time, for example) using a scatter plot, gradient plot, etc.
## A simple example with Hovmöller-matrix data
A simple heatmap can be generated for a "Hovmöller matrix" calculated using the gridded data. The Hovmöller matrix contains longitudinal averages, and is laid out with one row for each latitude and one column for each time.
### Set up
Create the Hovmöller matrix:
```{r multivariate-29 }
# Hovmöller matrix
Y <- matrix(0.0, nrow=nlat, ncol=nt)
dim(Y)
for (n in (1:nt)) {
for (k in (1:nlat)) {
for (j in (1:nlon)) {
Y[k, n] <- Y[k, n] + var_array[j, k, n]
}
Y[k, n] <- Y[k, n]/nlon
}
}
dim(Y)
ncols <- dim(Y)[2]
```
The Hovmöller matrix could also have been gotten using the `apply()` function:
```{r multivariate-30, echo = TRUE, eval = FALSE}
Y <- matrix(0.0, nrow=nlat, ncol=nt)
dim(Y)
Y <- apply(var_array, c(2,3), mean, na.rm=TRUE)
dim(Y)
ncols <- dim(Y)[2]
```
Set row and column names for the matrix.
```{r multivariate-31 }
# set row and column names
rownames(Y) <- str_pad(sprintf("%.3f", lat), 5, pad="0")
head(rownames(Y));tail(rownames(Y)); length(rownames(Y))
colnames(Y) <- str_pad(sprintf("%.3f", plt_xvals), 5, pad="0")
head(colnames(Y));tail(colnames(Y)); length(colnames(Y))
```
A matrix plot of the Hovmöller matrix will illustrate the data (and is an actual Hovmöller plot). To get high (northern) latitudes at the top of the plot, and high (southern) latitudes at the bottom, flip the data:
```{r multivariate-32 }
# flip Y
Y <- Y[(nlat:1),]
head(rownames(Y)); tail(rownames(Y)); length(rownames(Y))
```
For this example, the data are subsampled, by taking every 20th observation (to allow the dendrograms to be visualized):
```{r multivariate-33 }
# subset
Y2 <- Y[, seq(1, 2201, by=20)]
dim(Y2)
ncols2 <- dim(Y2)[2]
head(rownames(Y2))
head(colnames(Y2))
```
More set up. Generate colors for the matrix pot and the side bars, which here will illustrate the latitude and time of each data point. The rows will be labeled by a diverging color scale, purple for the Northern Hemisphere, green for the Southern Hemisphere, while time will be labeled from past to present by increasingly saturated purples.
```{r multivariate-34 }
# generate colors for side bars
rcol <- colorRampPalette(brewer.pal(10,"PRGn"))(nlat)
ccol <- colorRampPalette(brewer.pal(9,"Purples"))(ncols)
ccol2 <- colorRampPalette(brewer.pal(9,"Purples"))(ncols2)
```
Generate colors for the heatmap of temperature
```{r multivariate-35 }
nclr <- 10
zcol <- (rev(brewer.pal(nclr,"RdBu")))
breaks = c(-20,-10,-5,-2,-1,0,1,2,5,10,20)
length(breaks)
```
### Hovmöller-matrix plot (subsetted decadal data)
Produce an unclustered heatmap for the Hovmöller matrix. Note that the `system.time()` function is being used here to gauge the time it takes to calculate and display the various heatmaps.
```{r multivariate-36, echo=TRUE, eval=FALSE}
# unclustered heatmap of latitude by time Hovmöller matrix
time1 <- proc.time()
png_file <- paste(file_label, "hov_01", ".png", sep="")
png(png_file, width=960, height=960)
Y2_heatmap <- heatmap.2(Y2, cexRow = 0.8, cexCol = 0.8, scale="none",
Rowv=NA, Colv=NA, dendrogram = "none", # just plot data, don't do clustering
RowSideColors=rcol, ColSideColors=ccol2, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Latitude", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```
## user system elapsed
## 0.076 0.025 0.264
```
![](images/TraCE_decadal_hov_01.png)
Notice the way the side bars work in illustrating the location and time of each data value. The Hovmöller plot nicely shows the latitudinal variations in climate from the Last Glacial Maximum to present, with the insolation-driven warming in the Northern Hemisphere clearly expressed.
### Hovmöller-matrix heatmap (subsetted decadal data)
The heatmap of the Hovmöller-matrix reorganizes the rows and columns in such a way as to place variables (i.e. latitudes) with similar variations over time, and observations (times) with similar latitudinal patterns, next to one another. The row and column clustering will be done explicitly, in order to "reorder" the dendrograms, which could aid in their interpretability.
```{r multivariate-37, echo=TRUE, eval=FALSE}
# heatmap of Hovmöller matrix
time1 <- proc.time()
png_file <- paste(file_label, "heatmap_01", ".png", sep="")
png(png_file, width=960, height=960)
# row and column clustering
cluster_row <- hclust(parDist(Y2), method = "ward.D2")
cluster_col <- hclust(parDist(t(Y2)), method = "ward.D2")
# heatmap
Y2_heatmap <- heatmap.2(Y2, cexRow = 0.8, cexCol = 0.8, scale="none",
Rowv = as.dendrogram(cluster_row),
Colv = as.dendrogram(cluster_col),
RowSideColors=rcol, ColSideColors=ccol2, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Latitude", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```
## user system elapsed
## 0.225 0.058 0.490
```
![](images/TraCE_decadal_heatmap_01.png)
The heatmap has rearranged the data according to the cluster analyses of the observations and variables, placing more similar data points next to or near to one another. Notice how the colored side bars allow one to figure out the location and time of individual points. Here, and not surprisingly, there appear to be four distinct time-space clusters (or "regimes") in the data): glacial vs. interglacial (Holocene) and high latitudes (in both hemispheres) vs. low latitudes. Notice that the rearrangement in time is relatively modest (see the column color bar), while the rearrangement in location is more dramatic. On the row color bar, high latitudes are represented by more saturated colors. Spatial variations of climate are always larger in amplitude than temporal variations.
The purple column color bar suggests that the reorganized columns are almost in chronological order, while the colorbar for the rows shows a distinct high-latitude/low-latitude contrast. Because dendrograms are like a decorative mobile, and can be rotated without destroying the cluster or branch mememberships, it's sometimes useful to try that. This can be done with the `reorder.dendrogram()` (or just `reorder()`), with the cluster information and set of weights as input arguments.
```{r multivariate-38, echo=TRUE, eval=FALSE}
# reordered heatmap of Hovmöller matrix
time1 <- proc.time()
png_file <- paste(file_label, "heatmap_01b", ".png", sep="")
png(png_file, width=960, height=960)
# heatmap
Y2_heatmap <- heatmap.2(Y2, cexRow = 0.8, cexCol = 0.8, scale="none",
Rowv = reorder(as.dendrogram(cluster_row), 1:dim(Y2)[1]),
Colv = reorder(as.dendrogram(cluster_col), 1:dim(Y2)[2]),
RowSideColors=rcol, ColSideColors=ccol2, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Latitude", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```{r multivariate-39}
# user system elapsed
# 0.292 0.064 0.516
```
![](images/TraCE_decadal_heatmap_01b.png)
In this instance, the reordering of the dendrograms didn't accomplish much.
[[Back to top]](multivariate.html)
## A second example with decadal data
A second example uses the decadal data directly (as opposed to subsampling)
### Hovmöller-matrix plot
```{r multivariate-40, echo=TRUE, eval=FALSE}
# Hovmöller-matrix plot
time1 <- proc.time()
png_file <- paste(file_label, "hov_02", ".png", sep="")
png(png_file, width=960, height=960)
Y_heatmap <- heatmap.2(Y, cexRow = 0.8, cexCol = 0.8, scale="none",
Rowv=NA, Colv=NA, dendrogram = "none", # just plot data, don't do clustering
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks = breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Latitude", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```
## user system elapsed
## 35.070 0.091 35.195
```
![](images/TraCE_decadal_hov_02.png)
### Hovmöller-matrix heatmap
```{r multivariate-41, echo=TRUE, eval=FALSE}
# Hovmöller-matrix heatmap
time1 <- proc.time()
png_file <- paste(file_label, "heatmap_02", ".png", sep="")
png(png_file, width=960, height=960)
cluster_row <- hclust(parDist(Y), method = "ward.D2")
cluster_col <- hclust(parDist(t(Y)), method = "ward.D2")
# heatmap
Y_heatmap <- heatmap.2(Y, cexRow = 0.8, cexCol = 0.8, scale="none",
Rowv = as.dendrogram(cluster_row),
Colv = as.dendrogram(cluster_col),
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Latitude", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```
## user system elapsed
## 1.582 0.073 1.657
```
![](images/TraCE_decadal_heatmap_02.png)
The results are similar to those for the subsetted data, but the execution times are longer.
[[Back to top]](multivariate.html)
## Full-grid example
Finally, a heatmap of the actual gridded data (as opposed to the Hovmöller matrix data) can be obtained. The next set of plots will display values for each of the 4608 grid points and 2204 time steps, over ten million points, over ten times the number of pixels in the output *.png files (960 x 960). We'll trust the graphics subsystem in R along with whatever web browser is being used to view the images to get it right.
### Set up
There are several set-up steps necessary. Reshape the input array again, and generate row and column names:
```{r multivariate-42 }
# reshape the 3-d array
var_vec_long <- as.vector(var_array)
length(var_vec_long)
X <- matrix(var_vec_long, nrow = nlon * nlat, ncol = nt) # Don't transpose as usual this time
dim(X)
ncols <- dim(X)[2]
```
```{r multivariate-43 }
# generate row and column names
grid <- expand.grid(lon, lat)
names(grid) <- c("lon", "lat")
# rownames(X) <- paste("E", str_pad(sprintf("%.2f", round(grid$lon, 2)), 5, pad="0"),
# "N", str_pad(sprintf("%.2f", round(grid$lat, 2)), 5, pad="0"), sep="")
rownames(X) <- round(grid$lat, 2)
head(rownames(X), 20); tail(rownames(X), 20); length(rownames(X))
colnames(X) <- str_pad(sprintf("%.3f", t), 5, pad="0")
head(colnames(X)); tail(colnames(X)); length(colnames(X))
```
Flip the data...
```{r multivariate-44 }
# flip X
X <- X[((nlon*nlat):1),]
head(rownames(X)); tail(rownames(X)); length(rownames(X))
```
... and generate colors:
```{r multivariate-45 }
# generate colors for vertical and horizontal side bars
icol <- colorRampPalette(brewer.pal(10,"PRGn"))(nlat)
idx <- sort(rep((1:nlat), nlon))
rcol <- icol[idx]
ccol <- colorRampPalette(brewer.pal(9,"Purples"))(nt)
```
### Matrix plot -- full grid
Here's a matrix plot of the data (there are too many points to make the dendrograms interpretable in any way, and so they are suppressed).
```{r multivariate-46, echo=TRUE, eval=FALSE}
# matrix plot of the full dataset
time1 <- proc.time()
png_file <- paste(file_label, "hov_03", ".png", sep="")
png(png_file, width=960, height=960)
X_heatmap <- heatmap.2(X, cexRow = 0.8, cexCol = 0.8, scale="none",
Rowv=NA, Colv=NA, dendrogram = "none",
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Locations (by Latitude, then Longitude", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```
## user system elapsed
## 802.255 6.391 814.750
```
![](images/TraCE_decadal_hov_03.png)
The dimensions of the input netCDF file (96 longitudes x 48 latitudes x 2204 times) are such that when the 3-d arrays is reshaped, the longitudes vary most rapidly, followed by the latitudes and times. This gives rise to the horizontal banding in the image, with 48 bands organized from top to bottom. Within each band, the longitudes vary from top (west) to bottom (east). The N. Pacific/Beringia warm patch during glacial times (related to the influece of the North American Ice Sheet) is evident as the thin streaks of higher temperatures in the upper left corner of the image.
### Heatmap (Locations (by Latitude, then Longitude))
Now here is the reorganized heatmap of the grid-point data:
```{r multivariate-47, echo=TRUE, eval=FALSE}
# Heatmap (Locations (by Latitude, then Longitude))
time1 <- proc.time()
png_file <- paste(file_label, "heatmap_03", ".png", sep="")
png(png_file, width=960, height=960)
cluster_row <- hclust(parDist(X), method = "ward.D2")
cluster_col <- hclust(parDist(t(X)), method = "ward.D2")
# heatmap
X_heatmap <- heatmap.2(X, cexRow = 0.8, cexCol = 0.8, scale="none", dendrogram = "none",
Rowv = as.dendrogram(cluster_row),
Colv = as.dendrogram(cluster_col),
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Locations (by Latitude, then Longitude)", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```
## user system elapsed
## 1086.889 11.852 124.951
```
![](images/TraCE_decadal_heatmap_03.png)
There is a lot going on in the full-grid heatmap.
Let's see if reordering would contribute to the interpretability.
```{r multivariate-48, echo=TRUE, eval=FALSE}
# reordered reorganized heat map
time1 <- proc.time()
png_file <- paste(file_label, "heatmap_03b", ".png", sep="")
png(png_file, width=960, height=960)
# heatmap
X_heatmap <- heatmap.2(X, cexRow = 0.8, cexCol = 0.8, scale="none", dendrogram = "none",
Rowv = reorder(as.dendrogram(cluster_row), 1:dim(X)[1]),
Colv = reorder(as.dendrogram(cluster_col), 1:dim(X)[2]),
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Locations (by Latitude, then Longitude)", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```{r multivariate-49}
## user system elapsed
## 23.215 0.441 23.845
```
![](images/TraCE_decadal_heatmap_03b.png)
The reorganization seems to contrast low-variability (top) and high-variability locations and times (bottom).
[[Back to top]](multivariate.html)
### Heatmap (Locations (by Longitude, then Latitude))
```{r multivariate-50}
# reshape the 3-d array
dim(var_array)
var_array2 <- array(NA, dim = c(48, 96, 2204))
for (k in (1:nlat)) {
for (j in (1:nlon)) {
var_array2[k, j, 1:nt] <- var_array[j, k, 1:nt]
}
}
dim(var_array2)
```
```{r multivariate-51}
var_vec_long2 <- as.vector(var_array2)
length(var_vec_long2)
X2 <- matrix(var_vec_long2, nrow = nlat * nlon, ncol = nt) # Don't transpose as usual this time
dim(X2)
ncols <- dim(X2)[2]
```
```{r multivariate-52}
# generate row and column names
grid <- expand.grid(lat, lon)
names(grid) <- c("lat", "lon")
rownames(X2) <- paste("N", str_pad(sprintf("%.2f", round(grid$lat, 2)), 5, pad="0"),
"E", str_pad(sprintf("%.2f", round(grid$lon, 2)), 5, pad="0"), sep="")
rownames(X2) <- round(grid$lon, 2)
head(rownames(X2), 20); tail(rownames(X2), 20); length(rownames(X2))
colnames(X2) <- str_pad(sprintf("%.3f", t), 5, pad="0")
head(colnames(X2)); tail(colnames(X2)); length(colnames(X2))
```
```{r multivariate-53}
# generate colors for vertical and horizontal side bars
icol <- colorRampPalette(brewer.pal(10,"PRGn"))(nlon)
idx <- sort(rep((1:nlon), nlat))
rcol <- icol[idx]
ccol <- colorRampPalette(brewer.pal(9,"Purples"))(nt)
```
```{r multivariate-54, echo=TRUE, eval=FALSE}
time1 <- proc.time()
png_file <- paste(file_label, "hov_04", ".png", sep="")
png(png_file, width=960, height=960)
# heatmap
X2_heatmap <- heatmap.2(X2, cexRow = 0.8, cexCol = 0.8, scale="none",
Rowv=NA, Colv=NA, dendrogram = "none",
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Locations (by Longitude, then Latitude)", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```{r multivariate-55}
## user system elapsed
## 759.900 5.545 767.357
```
![](images/TraCE_decadal_hov_04.png)
```{r multivariate-56, echo=TRUE, eval=FALSE}
time1 <- proc.time()
png_file <- paste(file_label, "heatmap_04", ".png", sep="")
png(png_file, width=960, height=960)
cluster_row <- hclust(parDist(X2), method = "ward.D2")
cluster_col <- hclust(parDist(t(X2)), method = "ward.D2")
# heatmap
X2_heatmap <- heatmap.2(X2, cexRow = 0.8, cexCol = 0.8, scale="none", dendrogram = "none",
Rowv = as.dendrogram(cluster_row),
Colv = as.dendrogram(cluster_col),
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Locations (by Longitude, then Latitude)", key = TRUE, key.title = NA)
dev.off()
print (proc.time() - time1)
```
```{r multivariate-57}
## user system elapsed
## 1039.207 9.595 133.119
```
![](images/TraCE_decadal_heatmap_04.png)
[[Back to top]](multivariate.html)