-
Notifications
You must be signed in to change notification settings - Fork 35
/
appendix.Rmd
767 lines (581 loc) · 34 KB
/
appendix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
# Appendix
```{r appendix_setup, echo=F}
# knitr::opts_chunk$set(cache.rebuild=F)
```
## R packages
The following is a non-exhaustive list of R packages which contain GAM functionality. Each is linked to the CRAN page for the package. Note also that several build upon the <span class="pack">mgcv</span> package used for this document. I haven't really looked much lately, as between <span class="pack">mgcv</span> and <span class="pack">brms</span> there is little you can't do. I can vouch that <span class="pack">gamlss</span> and <span class="pack">VGAM</span> are decent too, but I've not used either in a long time.
[brms](http://cran.r-project.org/web/packages/brms/) Allows for Bayesian GAMs via the Stan modeling language (very new implementation).
[CausalGAM](http://cran.r-project.org/web/packages/CausalGAM/) This package implements various estimators for average treatment effects.
[gam](http://cran.r-project.org/web/packages/gam/) Functions for fitting and working with generalized additive models.
[gamboostLSS](http://cran.r-project.org/web/packages/gamboostLSS/): Boosting models for fitting generalized additive models for location, shape and scale (gamLSS models).
[GAMens](http://cran.r-project.org/web/packages/GAMens/): This package implements the GAMbag, GAMrsm and GAMens ensemble classifiers for binary classification.
[gamlss](http://cran.r-project.org/web/packages/gamlss/): Generalized additive models for location, shape, and scale.
[gamm4](http://cran.r-project.org/web/packages/gamm4/): Fit generalized additive mixed models via a version of mgcv's gamm function.
[gss](http://cran.r-project.org/web/packages/gss/): A comprehensive package for structural multivariate function estimation using smoothing splines.
[mgcv](http://cran.r-project.org/web/packages/mgcv/): Routines for GAMs and other generalized ridge regression with multiple smoothing parameter selection by GCV, REML or UBRE/AIC. Also GAMMs.
[VGAM](http://cran.r-project.org/web/packages/VGAM/): Vector generalized linear and additive models, and associated models.
## A comparison to mixed models
We noted previously that there were ties between generalized additive and mixed models. Aside from the identical matrix representation noted in the [technical section][technical details], one of the key ideas is that the penalty parameter for the smooth coefficients reflects the ratio of the residual variance to the variance components for the random effects (see Fahrmeier et al., 2013, p. 483). Conversely, we can recover the variance components by dividing the scale by the penalty parameter.
To demonstrate this, we can set things up by running what will amount to equivalent models in both <span class="pack">mgcv</span> and <span class="pack">lme4</span> using the <span class="objclass">sleepstudy</span> data set that comes from the latter[^sleepstudy]. I'll run a model with random intercepts and slopes, and for this comparison the two random effects will not be correlated. We will use the standard smoothing approach in <span class="pack">mgcv</span>, just with the basis specification for random effects - `bs='re'`. In addition, we'll use restricted maximum likelihood as is the typical default in mixed models.
[^sleepstudy]: See `?sleepstudy` for details.
```{r mixed_vs_gam}
library(lme4)
mixed_model = lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
data = sleepstudy)
ga_model = gam(
Reaction ~ Days + s(Subject, bs = 're') + s(Days, Subject, bs = 're'),
data = sleepstudy,
method = 'REML'
)
```
<br>
In the following we can see they agree on the fixed/parametric effects, but our output for the GAM is in the usual, albeit, uninterpretable form. So, we'll have to translate the smooth terms from the GAM to variance components as in the mixed model.
<br>
```{r mixed_vs_gam_summaries}
summary(mixed_model)
summary(ga_model)
```
<br>
Conceptually, we can demonstrate the relationship with the following code that divides the scale by the penalty parameters, one for each of the smooth terms. However, there has been some rescaling behind the scenes regarding the Days effect, so we have to rescale it to get what we need.
<br>
```{r mixed_vs_gam_est_vcomp, eval=FALSE}
rescaled_results = c(
ga_model$reml.scale / ga_model$sp[1],
ga_model$reml.scale / (ga_model$sp[2] / ga_model$smooth[[2]]$S.scale),
NA
)
lmer_vcov = VarCorr(mixed_model) %>% data.frame()
gam_vcov = data.frame(var = rescaled_results, gam.vcomp(ga_model))
```
My personal package <span class="pack">mixedup</span> does this for you, and otherwise makes comparing mixed models from different sources easier.
```{r mixed_vs_gam_est_vcomp_pretty, eval=FALSE}
mixedup::extract_variance_components(mixed_model)
mixedup::extract_variance_components(ga_model)
```
```{r mixed_vs_gam_est_vcomp_pretty_show, echo=FALSE}
suppressMessages({
bind_rows(
mixedup::extract_variance_components(mixed_model),
mixedup::extract_variance_components(ga_model),
.id = 'model'
) %>%
mutate(model = factor(model, labels = c('mixed', 'gam'))) %>%
gt()
})
```
Think about it this way. Essentially what is happening behind the scenes is that effect interactions with the grouping variable are added to the model matrix (e.g. `~ ... + Days:Subject - 1`)[^gam_modmat]. The coefficients pertaining to the interaction terms are then penalized in the typical GAM estimation process. A smaller estimated penalty parameter suggests more variability in the random effects. A larger penalty means more shrinkage of the random intercepts and slopes toward the population level (fixed) effects.
Going further, we can think of smooth terms as adding random effects to the linear component[^gpreg]. A large enough penalty and the result is simply the linear part of the model. In this example here, that would be akin to relatively little random effect variance.
## Time and Space
One of the things to know about GAMs is just how flexible they are. Along with all that we have mentioned, they can also be applied to situations where one is interested in temporal trends or the effects of spatial aspects of the data. The penalized regression approach used by GAMs can easily extend such situations, and the <span class="pack">mgcv</span> package in particular has a lot of options here.
### Time
A natural setting for GAMs is where there are observations over time. Perhaps we want to examine the trend over time. The SLiM would posit a linear trend, but we often would doubt that is the case. How would we do this with a GAM? We can incorporate a feature representing the time component and add it as a smooth term. There will be some additional issues though as we will see.
Here I use the data and example at [Gavin Simpon's nifty blog](https://www.fromthebottomoftheheap.net/), though with my own edits, updated data, and different model[^simpblog]. The data regards global temperature anomalies.
[^simpblog]:@simpson_modelling_2018 has offered an article regarding this topic as well.
```{r gamtime_setup, echo = 1:4, eval = FALSE}
## Global temperatures
# Original found at "https://crudata.uea.ac.uk/cru/data/temperature/"
load(url('https://github.com/m-clark/generalized-additive-models/raw/master/data/global_temperatures.RData'))
# current data is in a bad format that mixes two data different sources by row ('coverage' and anomaly). they provide an R function to import this type of data but it apparently doesn't work after trying on different types of the data available. They also don't complete the data in the final year, so standard reads will screw up since it is space delimited. data.table was very useful in fixing the issues.
init = data.table::fread(
'https://crudata.uea.ac.uk/cru/data/crutem4/HadCRUT4-gl.dat',
header = FALSE,
fill = TRUE
)
## Drop the even rows and change names if retreived from source
gtemp = init %>%
slice(seq(1, nrow(.), by = 2))
## Add colnames
colnames(gtemp) = c("Year", month.abb, "Annual")
gtemp = gtemp %>%
mutate(Annual = as.numeric(Annual)) %>%
as_tibble()
```
```{r gamtime_setup_long, echo=FALSE, eval = FALSE}
# Create a long format for later; Also set year to start at 0
gtemp_long = gtemp %>%
mutate(Year0 = Year - 1850) %>%
pivot_longer(-c(Year, Annual), names_to = 'Month', values_to = 'Anomaly')
# not presently used
```
```{r gamtime_setup2, echo=FALSE, eval=-1, fig.asp=.5}
# save(gtemp, file = 'data/global_temperatures.RData')
load('data/global_temperatures.RData')
gtemp %>%
ggplot(aes(x= Year, y = Annual)) +
geom_hline(yintercept = 0) +
geom_line() +
labs(x = '') +
scale_x_continuous(breaks = seq(1850, 2020, by = 10))
```
<br>
Fitting a straight line to this would be disastrous, so let's do a GAM.
```{r gamtime, fig.asp=.5, echo=1:2}
hot_gam = gam(Annual ~ s(Year), data = gtemp)
summary(hot_gam)
plotdat = plot(ggeffects::ggpredict(hot_gam, 'Year'))$data %>%
as_tibble() %>%
rename(Year = x) %>%
left_join(gtemp, by = 'Year')
plotdat %>%
ggplot(aes(x = Year)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1) +
geom_point(aes(y = Annual)) +
geom_line(aes(y = predicted)) +
labs(x = '') +
scale_x_continuous(breaks = seq(1850, 2020, by = 10))
```
<br>
We can see that the trend is generally increasing, and has been more or less since the beginning of the 20th century. We have a remaining issue though. In general, a time series is autocorrelated, i.e. correlated with itself over time. We can see this in the following plot.
```{r ar_raw, eval=FALSE}
acf(gtemp$Annual)
```
```{r ar_raw_pretty, echo=FALSE, fig.asp=.5}
acfdat = acf(
gtemp$Annual,
lag.max = 25,
bty = 'n',
lwd = 2,
lty = 1,
ylim = c(-.2, 1),
xlim = c(0, 25),
col = okabe_ito[1],
ci.col = okabe_ito[2],
main = '',
ylab = 'Autocorrelation',
col.axis = 'gray35',
col.lab = 'gray35',
col.main = 'gray50',
xaxt = 'n',
yaxt = 'n'
)
axis(
side = 1,
col = 'gray75',
col.ticks = 'gray50',
col.axis = 'gray35'
)
axis(
side = 2,
col = 'gray75',
col.ticks = 'gray50',
col.axis = 'gray35'
)
title(main = 'Autocorrelation of Annual Temperature Anomalies', col.main = 'gray25') # bc col.main ignored in acf
```
<br>
What the plot shows is the correlation of the values with themselves at different *lags*, or time spacings. Lag 0 is it's correlation with itself, so the value is 1.0. It's correlation with itself at the previous time point, i.e. lag = 1, is `r round(acfdat$acf[2], 2)`, it's correlation with itself at two time points ago is slightly less, `r round(acfdat$acf[3], 2)`, and the decreasing trend continues slowly. The dotted lines indicate a 95% confidence interval around zero, meaning that the autocorrelation is still significant 25 years apart.
With our model, the issue remains in that there is still autocorrelation among the residuals, at least at lag 1.
```{r ar_res, echo=FALSE, fig.asp=.5}
acfdat = acf(
resid(hot_gam),
lag.max = 25,
bty = 'n',
lwd = 2,
lty = 1,
ylim = c(-.2, 1),
xlim = c(0, 25),
col = okabe_ito[1],
ci.col = okabe_ito[2],
main = '',
ylab = 'Autocorrelation',
col.axis = 'gray35',
col.lab = 'gray35',
col.main = 'gray50',
xaxt = 'n',
yaxt = 'n'
)
axis(
side = 1,
col = 'gray75',
col.ticks = 'gray50',
col.axis = 'gray35'
)
axis(
side = 2,
col = 'gray75',
col.ticks = 'gray50',
col.axis = 'gray35'
)
axis(
side = 1,
col = 'gray75',
col.ticks = 'gray50',
col.axis = 'gray35'
)
axis(
side = 2,
col = 'gray75',
col.ticks = 'gray50',
col.axis = 'gray35'
)
title(main = 'Autocorrelation of Residuals', col.main = 'gray25')
```
The practical implications of autocorrelated residuals is that this positive correlation would result in variance estimates that are too low. However, we can take this into account with a slight tweaking of our model to incorporate such autocorrelation. For our purposes, we'll switch to the <span class="func">gamm</span> function. It adds additional functionality for generalized additive *mixed* models, though we can just use it to incorporate autocorrelation of the residuals. In running this, two sets of output are provided, one in our familiar <span class="objclass">gam</span> model object, and the other as a <span class="objclass">lme</span> object from the <span class="pack">nlme</span> package.
```{r gam_ar}
hot_gam_ar = gamm(Annual ~ s(Year),
data = gtemp,
correlation = corAR1(form = ~ Year))
summary(hot_gam_ar$gam)
summary(hot_gam_ar$lme)
```
In the gam output, we see some slight differences from the original model, but not much (and we wouldn't expect it). From the lme output we can see the estimated autocorrelation value denoted as `Phi`[^Xlme]. Let's see what it does for the uncertainty in our model estimates.
```{r gam_ar_fit, echo=FALSE, fig.asp=.5}
plotdat_noar = predict(
gamm(Annual ~ s(Year), data = gtemp)$gam,
newdata = tibble(Year = plotdat$Year),
type = 'response',
se = TRUE
) %>%
as_tibble() %>%
mutate(
Year = plotdat$Year,
upper = fit + 2 * se.fit,
lower = fit - 2 * se.fit
)
plotdat_ar = predict(
hot_gam_ar$gam,
newdata = tibble(Year = plotdat$Year),
type = 'response',
se = TRUE
) %>%
as_tibble() %>%
mutate(
Year = plotdat$Year,
upper = fit + 2 * se.fit,
lower = fit - 2 * se.fit
)
plotdat_ar %>%
ggplot(aes(Year, fit)) +
geom_ribbon(aes(ymin = lower, ymax = upper),
fill = okabe_ito[1],
alpha = .25) +
geom_ribbon(
aes(ymin = lower, ymax = upper),
fill = okabe_ito[1],
alpha = .5,
data = plotdat_noar
) +
geom_line(size = 1) +
scale_x_continuous(breaks = seq(1850, 2000, by = 25))
```
<br>
We can in fact see that we were a bit optimistic in the previous fit (darker band). Our new fit expreses more uncertainty at every point[^timefits]. So, in using a GAM for time-series data, we have similar issues that we'd have with standard regression settings, and we can deal with them in much the same way to get a better sense of the uncertainty in our estimates.
### Space
Consider a data set with latitude and longitude coordinates to go along with other features used to model some target variable. A spatial regression analysis uses an approach to account for spatial covariance among the observation points. A common technique used is a special case of *Gaussian process* which, as we noted previously, certain types of GAMs can be seen as such also. In addition, some types of spatial models can be seen similar to random effects models, much like GAMs. Such connections mean that we can add spatial models to the sorts of models covered by GAMs too.
When dealing with space, we may have spatial locations of a continuous sort, such as with latitude and longitude, or in a discrete sense, such as regions. In what follows we'll examine both cases.
#### Continuous Spatial Setting
Our example[^spatexample] will use census data from New Zealand and focus on median income. It uses the <span class="pack">nzcensus</span> package[^rforeverything] which includes median income, latitude, longitude and several dozen other variables. The latitude and longitude are actually centroids of the area unit, so this technically could also be used as a discrete example based on the unit.
[^spatexample]: This example is inspired by the post by Peter Ellis, which you can find [here](https://ellisp.github.io/blog/2016/08/04/nzcensus-gam-elastic-lm).
Let's take an initial peek. You can hover over the points to get the location and income information.
```{r nz_income, eval=T, echo=c(5,8), fig.asp=.5}
# install the nzcensus package (note it is part of the nzelect GitHub repository):
# devtools::install_github("ellisp/nzelect/pkg2")
# library(leaflet)
library(nzcensus)
# remove Chatham Islands
nz_census = AreaUnits2013 %>%
filter(WGS84Longitude > 0 & !is.na(MedianIncome2013)) %>%
rename(lon = WGS84Longitude,
lat = WGS84Latitude,
Income = MedianIncome2013) %>%
drop_na()
# create colour palette function
# pal = colorQuantile(viridis::plasma(n=10, begin = 1, end=0), domain = NULL, n=10)
# create labels for popups
nz_census$labs =
paste0(nz_census$AU_NAM, " $", format(nz_census$Income, big.mark = ","))
# draw map:
# leaflet() %>%
# addProviderTiles("CartoDB.Positron") %>%
# addCircles(lng = nz_census$lon, lat = nz_census$lat,
# color = pal(-nz_census$Income),
# popup = labs,
# radius = 500) %>%
# addLegend(
# pal = pal,
# values = -nz_census$Income,
# title = "Top Quantile of median<br>household income",
# position = "topleft")
pal = leaflet::colorQuantile(
scico::scico(
n = 100,
begin = 0,
end = 1,
palette = 'tokyo'
),
domain = NULL,
n = 10
)
g = list(scope = 'new zealand',
showframe = F,
showland = T,
landcolor = toRGB(NA))
g1 = c(
g,
resolution = 50,
showcoastlines = T,
countrycolor = toRGB('gray'),
coastlinecolor = toRGB('gray'),
projection = list(type = 'longlat'),
list(lonaxis = list(range = c(165, 179))),
list(lataxis = list(range = c(-46.75,-34))),
list(domain = list(x = c(0, 1), y = c(0, 1)))
)
```
```{r nz_income_plot, echo=FALSE, cache=FALSE}
map_data('nz') %>%
filter(grepl(region, pattern = 'North|South')) %>%
group_by(group) %>%
plot_geo(x = ~ nz_census$lon, y = ~ nz_census$lat) %>%
add_markers(
color = ~ Income,
colors = pal,
# size=I(5), # plotlys special bug
opacity = .66,
text = ~ labs,
marker = list(name = NULL),
hoverinfo = 'text',
data = as_tibble(nz_census)
) %>%
config(displayModeBar = F) %>%
layout(title = '',
geo = g1) %>%
theme_plotly()
```
<br>
So we can go ahead and run a model predicting median income solely by geography. We'll use a Gaussian process basis, and allowing latitude and longitude to interact (bumping up the default wiggliness possible to allow for a little more nuance). What the GAM will allow us to do is smooth our predictions beyond the points we have in the data to get a more complete picture of income distribution across the whole area[^marg][^spatvis].
[^marg]: The `m` argument allows one to specify different types of covariance functions.
[^spatvis]: This visualization was created with <span class="pack">plotly</span>. In case you're wondering, it was a notable ordeal to figure out how to make it, and nowadays, presumably there would be more efficient ways using sf/spatial features with ggplot.
```{r nz_income_gam, echo=1:2}
nz_gam = gam(Income ~ s(lon, lat, bs = 'gp', k = 100, m = 2), data = nz_census)
summary(nz_gam)
# convert map object to sp if needed
# nz_map = map('world', 'new zealand') %>%
# maptools::map2SpatialPolygons(IDs=sapply(strsplit(.$names, ":"), "[", 1L),
# proj4string=sp::CRS("+proj=longlat +datum=WGS84"))
nz_map_data = map_data("world") %>%
filter(region == 'New Zealand') %>%
filter(subregion %in% c('North Island', 'South Island')) %>% #adding , 'Stewart Island' will show plotly bug
group_by(group) %>%
rename(lon = long)
# might take a while
heatvals = tibble(
lon = rep(seq(
min(nz_census$lon) - 4, max(nz_census$lon) + 4, length.out = 500
), e = 500),
lat = rep(seq(
min(nz_census$lat) - 4, max(nz_census$lat) + 4, length.out = 500
), t = 500),
Income = predict(nz_gam, newdata = tibble(lon, lat))
) %>%
filter(
sp::point.in.polygon(.$lon, .$lat, pol.x = nz_map_data$lon, pol.y = nz_map_data$lat) == 1
)
blank_layer = list(
title = "",
showgrid = FALSE,
showticklabels = FALSE,
zeroline = FALSE
)
g = list(
scope = 'new zealand',
showframe = FALSE,
showland = TRUE,
landcolor = toRGB(NA)
)
g1 = c(g,
resolution = 50,
showcoastlines = TRUE,
countrycolor = toRGB('gray'),
coastlinecolor = toRGB('gray'),
projection = list(type = 'longlat'),
list(lonaxis = list(range = c(min(heatvals$lon), max(heatvals$lon)))),
list(lataxis = list(range = c(min(heatvals$lat), max(heatvals$lat)))),
list(domain = list(x = c(0, 1), y = c(0, 1)))
)
heatvals = heatvals %>% mutate(Size = .5)
```
```{r nz_income_gam_plot, echo = FALSE, cache=FALSE}
nz_map_data %>%
mutate(Size = .3) %>%
plot_geo(
x = ~ lon,
y = ~ lat,
mode = 'markers',
marker = list(size = ~ Size)
) %>%
layout(title = 'Expected Income',
geo = g1) %>%
add_markers(
x = ~ lon,
y = ~ lat,
color = ~ Income,
opacity = .5,
# size=~Size,
data = heatvals,
colors = scico::scico(100, palette = 'tokyo'),
inherit = F
)
```
<br>
Using the Gaussian process smooth produces a result that is akin to a traditional spatial modeling technique called *kriging*. There are many other features to play with, as well as other bases that would be applicable, so you should feel free to play around with models that include those.
Alternatively, as we did with the time series, we could instead deal with *spatial autocorrelation* by specifying a model for the residual structure. First, we can simply test for spatial autocorrelation in the income variable via the well-worn *Moran's I* statistic. Given some weight matrix that specifies the neighborhood structure, such that larger values mean points are closer to one another, we can derive an estimate. The following demonstrates this via the <span class="pack">ape</span> package[^spatscaledt].
[^spatscaledt]: Using `scaled = TRUE` results in a correlation metric that goes from -1 to 1.
```{r morans, eval=-3, echo=-4}
inv_dist = with(nz_census, 1/dist(cbind(lat, lon), diag = TRUE, upper = TRUE))
inv_dist = as.matrix(inv_dist)
ape::Moran.I(nz_census$Income, weight = inv_dist, scaled = TRUE)
ape::Moran.I(nz_census$Income, weight = inv_dist, scaled = TRUE) %>%
data.frame() %>%
gt()
```
<br>
While statistically significant, there actually isn't too much going on, though it may be enough to warrant dealing with in some fashion. As with the time series, we'll have to use the functionality with <span class="func">gamm</span>, where the underlying nlme package provides functions for spatial correlation structures. The following shows how this might be done. If you run this be prepared to wait for a few minutes.
```{r gamm_spatial, eval=FALSE}
gamm_spat = gamm(
Income ~ s(x) + s(y) + z, # choose your own features here
data = nz_census,
correlation = corSpatial(form = ~ lon + lat, type = 'gaussian')
)
plot(gamm_spat)
```
So whether you choose to deal with the spatial autocorrelation explicitly by using something like coordinates as features in the model itself, or via the residual correlation structure, or perhaps both, is up to you[^spatcorvar].
[^spatcorvar]: For a nice discussion of this, see the Q & A at [Stack Exchange](https://stats.stackexchange.com/questions/35510/why-does-including-latitude-and-longitude-in-a-gam-account-for-spatial-autocorre), and note the top *two* answers to the question "Why does including latitude and longitude in a GAM account for spatial autocorrelation?".
#### Discrete
What about the discrete case, where the spatial *random effect* is based on geographical regions? This involves a penalty that is based on the adjacency matrix of the regions, where if there are $g$ regions, the adjacency matrix is a $g \times g$ indicator matrix where there is some non-zero value when region i is connected to region j, and 0 otherwise. In addition, an approach similar to that for a random effect is used to incorporate observations belonging to specific regions. These are sometimes referred to as geoadditive models.
You'll be shocked to know that <span class="pack">mgcv</span> has a smooth construct for this situation as well, `bs = 'mrf'`, where `mrf` stands for *Markov random field*, which is an undirected graph.
The following[^spatdiscex] will model the percentage of adults with only a high school education. Unfortunately, when dealing with spatial data, getting it into a format amenable to modeling will often take some work. Specifically, <span class="pack">mgcv</span> will need a neighborhood list to tell it how the different areas are linked[^nosp]. Furthermore, the data we want to use will need to be linked to the data used for mapping.
[^spatdiscex]: This example comes from [Gavin Simpson's blog](https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/), which itself is based on an article at [The Pudding](https://pudding.cool/process/regional_smoothing/).
The first step is to read a shapefile that has some county level information. You could get this from census data as well[^shapefiledat].
[^shapefiledat]: Data with this high school graduation rate found on [GitHub](https://github.com/polygraph-cool/smoothing_tutorial/blob/master/us_county_hs_only.zip). But you can also find this sort of thing with the <span class="pack">tigris</span> and <span class="pack">tidycensus</span> packages.
```{r gamspatdiscrete}
# contiguous states c(1,4:6, 8:13, 16:42, 44:51, 53:56)
library(sp)
shp = rgdal::readOGR('data/us_county_hs_only')
## select michigan, and convert % to proportion
mich_df = shp[shp$STATEFP %in% c(26),] %>% # add other FIPS codes as desired
as_tibble() %>%
droplevels() %>%
mutate(
hsd = hs_pct / 100,
county = stringr::str_replace(tolower(NAME), pattern = '\\.', ''),
county = factor(county)
)
```
The following creates a neighborhood list[^usingmaps]. We also need names to match the values in the data, as well as the plotting data to be used later. I just made them lower case and remove punctuation. If you use more than one state, you will have to deal with duplicated names in some fashion.
```{r bind_map_data, echo=1:2}
nb = spdep::poly2nb(shp[shp$STATEFP %in% c(26), ], row.names = mich_df$county)
names(nb) = attr(nb, "region.id")
# mich_map = maps::map('county', 'michigan', fill=T) # fill is required
# mich_poly = maptools::map2SpatialPolygons(mich_map,
# IDs=stringr::str_sub(mich_map$names, start = 10),
# proj4string=sp::CRS("+proj=longlat +datum=WGS84"))
# mich_nb = spdep::poly2nb(mich_poly)
# mich_df$county = factor(stringr::str_replace(tolower(mich_df$NAME), pattern='\\.', ''))
# names(mich_nb) = attr(mich_nb, "region.id")
```
With neighborhood in place, we can now finally run the model. Note that the ID used for the smooth, in this case `county`, needs to be a factor variable. If not, you will get an uninformative error message that doesn't tell you that's the problem. For this demonstration we'll not include any other features in the model, but normally you would include any relevant ones.
```{r gam_mrf}
gam_mrf = gam(
# define MRF smooth
hsd ~ s(county, bs = 'mrf', xt = list(nb = nb)),
data = mich_df,
method = 'REML',
# fit a beta regression
family = betar
)
summary(gam_mrf)
mich_df = mich_df %>%
mutate(fit = predict(gam_mrf, type = 'response'))
```
Now we can plot it. <span class="pack">Plotly</span> works with <span class="pack">maps</span> package objects that have been converted via <span class="pack">ggplot2's</span> <span class="func">map_data</span> function. So, we create some plot-specific data, and then add our fitted values to it. We then add our own coloring based on the fitted values, and a custom clean theme.
```{r plot_mich, echo=-4, fig.asp=1}
plotdat = map_data("county", 'michigan') %>%
left_join(mich_df, by = c('subregion' = 'county')) %>%
mutate(fillcol = cut(fit, breaks = seq(.25, .45, by = .025)))
p = plotdat %>%
group_by(subregion) %>%
plot_ly(
x = ~ long,
y = ~ lat,
color = ~ fillcol,
colors = scico::scico(100, palette = 'tokyo'),
text = ~ subregion,
hoverinfo = 'text'
) %>%
add_polygons(line = list(width = 0.4)) %>%
layout(title = "% with Maximum of HS Education in Michigan") %>%
theme_blank()
p
```
<br>
Be prepared, as this potentially will be a notable undertaking to sort out for your given situation, depending on the map objects and structure you're dealing with.
#### A Discrete Alternative
One of the things that has puzzled me is just how often people deal with geography while ignoring what would almost always be inherent correlation in discrete geographical or other units. In the social sciences for example, one will see a standard random effects approach, i.e. [a mixed model](https://m-clark.github.io/mixed-models-with-R/), applied in the vast majority of situations where the data comes from multiple regions. This will allow for region specific effects, which is very useful, but it won't take advantage of the fact that the regions may be highly correlated with one another with regard to the target variable of interest, even after controlling for other aspects.
We've already been using <span class="func">gamm</span>, but haven't been using the typical random effects approach with it. We could do so here, but we can also just stick to the usual <span class="func">gam</span> function, as it has a basis option for random effects. One thing that distinguishes the mixed model setting is that observations will be clustered within the geographical units. So for our example, we'll use the Munich rent data available from the <span class="pack">gamlss</span> family of packages, which contains objects for the Munich rent data and boundaries files of the corresponding districts from the 1999 survey. The <span class="objclass">rent99</span> data contains information about rent, year of construction, weather it has central heating, etc. Important for our purposes is the district identifier. The following shows the data structure for some observations.
```{r basic_re, echo=FALSE}
# attempted to get the shapefile, but have no idea what these districts correspond to, and it definitely is not zip
# see https://gis.stackexchange.com/questions/25060/postal-areas-for-germany
# g_shp = rgdal::readOGR('data/plz')
# munich_df = g_shp[g_shp@data$PLZ99_N %in% 80000:81999, ]
library(gamlss.data)
rent99 %>%
mutate_if(is.numeric, round, digits=2) %>%
head(15) %>%
gt()
```
<br>
Here again we'll use a Markov random field smooth, and for comparison a mixed model with a random effect for district. The plots show that, while close, they don't exactly come to the same conclusions for the district fitted values.
```{r re_vs_mrf, echo=1:11}
library(gamlss.data)
# prep data
rent99 = rent99 %>%
mutate(district = factor(district))
rent99.polys[!names(rent99.polys) %in% levels(rent99$district)] = NULL
# run mrf and re models
gam_rent_mrf = gam(rent ~ s(district, bs = 'mrf', xt = list(polys = rent99.polys)),
data = rent99,
method = 'REML')
gam_rent_re = gam(rent ~ s(district, bs = 're'),
data = rent99,
method = 'REML')
```
```{r re_vs_mrf_plot, echo=FALSE, fig.asp=.66, cache=FALSE}
tibble(rent99,
mrf_fit = fitted(gam_rent_mrf),
mm_fit = fitted(gam_rent_re)) %>%
ggplot(aes(x = mm_fit, y = mrf_fit)) +
geom_point(color = '#D55E00', alpha =.25, size = 3, group = 1)
```
<br>
Next we show the plot of the estimated random effects of both models on the map of districts[^munichplot]. Gaps appear because there isn't data for every district available, as some are districts without houses like parks, industrial areas, etc.
```{r re_vs_mrf2, echo=FALSE, fig.asp=.66}
# Basically the procedures is as follows:
# - reorder data and polys according to district to remove any ordering ambiguity
# - run models
# - debug plotting the mrf until the very moment of plotting
# - save out objects/load objects
# - switch fits in the mrf pd object with re fits
# In the end, polys.plot is called on the polys and fitted values, see below
load('data/mrf_re_plot.RData')
par(mfrow = c(1, 2))
plot(gam_rent_mrf_forplot)
mgcv:::plot.mrf.smooth(x$smooth[[1]], P = pd2[[1]], scheme = 0)
graphics::layout(1)
# par(mfrow=c(1,2))
# fv = fitted(gam_rent_mrf)
# polys.plot(rent99.polys, fv)
# fv = fitted(gam_rent_re)
# polys.plot(rent99.polys, fv)
# graphics::layout(1)
```
As we might have expected, there appears to be more color coordination with the MRF result (left), since neighbors are more likely to be similar. Meanwhile, the mixed model approach, while showing similar patterning, does nothing inherently to correlate one district with the ones it's next to, but may allow for more regularization.
Either of these models is appropriate, but they ask different questions. The MRF approach may produce better results since it takes into account the potential similarity among neighbors, but also may not be necessary if there isn't much similarity among geographical units. One should also think about whether the other features in the model may account spatial autocorrelation or not, or unspecified unit effects, and proceed accordingly. In addition, there are approaches that would allow for a mix of both the unstructured and spatial random effects. See @riebler2016intuitive and the [brms package](https://paul-buerkner.github.io/brms/reference/car.html) and Mitzi Morris' note [here](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) .
[^Xlme]: All the same variables in the lme output start with X. This is more to avoid confusion in the functions behind the scenes.
[^timefits]: I don't show it to keep the plot clean, but the fitted values are essentially the same.
[^rforeverything]: Because of course there is an R package just for New Zealand census data.
[^usingmaps]: If you look at the markdown document for this on GitHub you'll see the code for how to create this using an object from the maps packages rather than needing a shapefile.
[^nosp]: There are actually three different types of objects one could supply here, but unfortunately the one thing <span class="pack">mgcv</span> doesn't do is work with any spatial objects one would already have from the popular R packages used for spatial modeling and visualization. The fact that this list is actually a special class object is of no importance here, it is read simply as a standard list object.
[^munichplot]: Unfortunately, I have neither the data nor the desire to try and make this a pretty plot. It just uses the basic <span class="pack">mgcv</span> plot and an attempted trick (which may not be entirely accurate) to superimpose the mixed model results onto the MRF data.
[^gam_modmat]: You can verify this by running `model.matrix(ga_model)`.
[^gpreg]: Much like with Gaussian process regression, where it's perhaps a bit more explicit.