-
Notifications
You must be signed in to change notification settings - Fork 40
/
simplify.qmd
247 lines (191 loc) · 18.1 KB
/
simplify.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
title: "choosing random effects terms in mixed models"
author: Ben Bolker
date: today
bibliography: ./glmm.bib
format:
html:
embed-resources: true
---
Source file is [here](https://github.com/bbolker/mixedmodels-misc/blob/master/simplify.qmd). Also see [Emi Tanaka's approach to the example below](https://rpubs.com/emitanaka/simplify-revarstr), mostly with AS-REML.
An R-centric discussion of what to do to manage the complexity of mixed models, specifically their random effect components.
# General framework
## maximal model
The *maximal model* [@barr_random_2013] is the model that includes all random effect terms that would be identifiable, given the experimental (observational) experimental design, *given arbitrarily large samples* (both of groups and of observations within groups), what I would call satisfying ``strong identifiability''.
* all categorical variables in the data with exchangeable levels are treated as random effects [@yarkonigeneralizability2022]
* all terms in the fixed-effect model are included as varying terms where possible, i.e. within every grouping variable where the values of the fixed-effect term vary within (and not just between) groups
* for models where the conditional distribution has an estimated scale parameter (e.g. Gaussian, Gamma) and a varying term with $n$ parameters (e.g. $n=2$ for a random-slopes model), there must be at least $n+1$ observations per group in order for the scale parameter (residual variance in the case of a Gaussian) to be identifiable, e.g. see "starlings" example [here](https://bbolker.github.io/mm_iowa2024/notes/mixed_lab.html). For models with fixed scale/dispersion (e.g. Poisson, Gamma) the limitation is $n$ observations per group (unless an observation-level random effect is added, in which case we again need $n+1$). For models with estimated dispersion rather than scale (e.g. negative binomial), the model is *theoretically* identifiable with only $n$ observations per group, but it probably won't work.
## model constraint strategies
When we constrain the covariance matrix (i.e. any model for specific than "general positive semi-definite"), the model results and definition depend on the contrasts used. For factor terms, it might make the most sense to specify sum-to-zero contrasts unless there is a clear reference/control level against which all other levels should be compared; for continuous terms, the predictors should probably be zero centered (or at least zeroed at some sensible reference level; see the end of section 2.2 in @bates_fitting_2015).
### compound-symmetric models
For factor models with many levels, compound-symmetric models are a good simplification approach. This restricts the correlations among levels to be identical for all pairs of levels (i.e., a correlation matrix with all off-diagonal elements equal to the same constant $\rho$).
For example, working with the built-in `cbpp` data set where `period` is a four-level factor, the covariance matrices implied by `(period|herd)` vs. `(1|period/herd)` are:
$$
(\textrm{intercept}, \textrm{slope}) =
\textrm{MVN}\left(\boldsymbol 0,
\left[
\begin{array}{cccc}
\sigma^2_{\{h|1\}} & . & . & . \\
\sigma_{\{h|1\},\{h|p_{21}\}} &
\sigma^2_{\{h|p_{21}\}} & . & . \\
\sigma_{\{h|1\}, \{h|p_{31}\}} &
\sigma_{\{h|p_{21}\},\{h|p_{31}\}} &
\sigma^2_{\{h|p_{31}\}} & . \\
\sigma_{\{h|1\} ,\{h|p_{41}\}} &
\sigma_{\{h|p_{21}\},\{h|p_{41}\}} &
\sigma_{\{h|p_{31}\},\{h|p_{41}\}} &
\sigma^2_{\{h|p_{41}\}}
\end{array}
\right]
\right)
$$
(=$(n(n+1))/2 = (4\times 5)/2 = 10$ parameters)
vs.
$$
\left[
\begin{array}{cccc}
\sigma^2 & . & . & . \\
\rho \sigma^2 & \sigma^2 & . & . \\
\rho \sigma^2 & \rho \sigma^2 & \sigma^2 & . \\
\rho \sigma^2 & \rho \sigma^2 & \rho \sigma^2 & \sigma^2 \\
\end{array}
\right]
$$
where $\sigma^2 = \sigma^2_{\{b|1\}}+\sigma^2_{\{herd:period|1\}}$,
$\rho = \sigma^2_{\{b|1\}}/\sigma^2$ (=2 parameters;
$\rho$ must be >0)
- The shortcut of using a nested model works in most mixed model frameworks (almost any platform that allows for multiple random effects will allow them to be nested), but is restricted to positive compound-symmetric models (i.e. $\rho>0$). Some packages allow general compound symmetric matrices (i.e. $-1 < \rho < 1$), e.g. using `pdCompSymm` in `nlme`, `cs()` in `glmmTMB`, `cosy` in `brms` (apparently `MCMCglmm` doesn't have a CS struture?) ...
- We also need to distinguish between *heterogeneous* compound symmetry (i.e., variances are the same for every level; $n+1$ parameters for an $n \times n$ covariance matrix) and *homogeneous* compound symmetry (all variances are the same, 2 parameters regardless of the size of the cov matrix). The nesting trick described above only handles homogeneous CS. `cs()` in `glmmTMB` does heterogeneous CS, homogeneous CS could be handled by using the `map` argument to fix all of the standard deviations to be the same (or a `homcs()` structure could be added if there's enough demand for it).
### independence models
This is probably the best known approach to model simplification (e.g. as recommended by @barr_random_2013). It's actually a special case of CS, with $\rho=0$. Also called "diagonal" for obvious (?) reasons.
- in `lme4` (or `glmmTMB` or `brms`), use `||` in place of `|`; `glmmTMB` allows `diag()` for heterogenous diagonal/independence models (and `homdiag()` for homogeneous); `MCMCglmm` uses `idv()` (homogeneous) and `idh()` (heterogeneous).
- `||` (still) doesn't work in `lme4` for factor-valued terms; use `afex::mixed` or `glmmTMB`, or expand the model in dummies (`(0 + dummy(f, level1)|g) + (0 + dummy(f, level2)|g) + ...`; kind of a pain for large numbers of factors, although you could construct it via string processing if you wanted)
- see general comments above about the dependence of the constrained model on the zero point of the numeric terms
### dropping terms
Especially for vector-valued random effects consisting of variation in multiple numeric predictors, it may make sense to drop the random effect term for terms with very small variance, or terms that are of less interest. For example, you might reduce the random-effect term `(1 + fire + NPP | biome)` to `(1 + fire | biome)` by dropping the variation in `NPP` across biomes.
### reduced-rank models
*Factor-analytic* or *reduced-rank* models are a more flexible approach to simplifying complex covariance functions, by doing something analogous to computing a principal components analysis on the covariance matrix and keeping a subset of the terms. In `glmmTMB` you can do this with the `rr()` covariance structure; see the relevant section of the [covariance structure vignette](), or [this draft](https://www.math.mcmaster.ca/bolker/misc/mcgillycuddy2024.pdf) by McGillycuddy et al., in press in *J. Stat. Software*.
## parameter constraint strategies
Instead of changing the covariance model to reduce its dimensionality, you can add a Bayesian prior (or a regularization term, if you're not Bayesian) to make the model better behaved and/or keep it from being singular.
- If you want to apply a minimally informative prior that will prevent singularity, @chung_nondegenerate_2013 show that an improper Gamma(shape=2, rate=0) is the weakest Gamma prior for the standard deviation that will prevent a posterior mode at zero. (The mode will be "approximately one standard error away from zero when the maximum likelihood estimate is at zero".) In practice this might be implemented using a very small rate parameter or very large shape parameter. A Wishart prior with $\nu = 1 + d$ and infinite scale might (?) have similar properties for vector-valued random effects/covariance matrices larger than 1×1. The `blme` package implements this approach (although with a default shape of 2.5 rather than 2); it can also be implemented in `glmmTMB` (see the [priors vignette](https://cran.r-universe.dev/glmmTMB/doc/priors.html)), and of course in any of the full-Bayesian packages (`brms`, `rstanarm`, `MCMCglmm`, etc.)
- If you're a real Bayesian or are otherwise comfortable with stronger priors, you can choose from a variety of other priors (e.g. half-Normal or half-$t$ or Gamma or log-Normal priors for standard deviations, Wishart or inverse-Wishart distributions for full covariance matrices, [LKJ priors](https://en.wikipedia.org/wiki/Lewandowski-Kurowicka-Joe_distribution) for correlations) to help out your model. (LKJ priors are a nice generalization of the diagonal covariance structures above, as an LKJ prior with $\eta>1$ pushes the correlation matrix to be closer to diagonal without constraining it to be exactly diagonal ...)
- a real Bayesian would also say you should **not** choose priors just to make your model behave better (e.g. eliminate divergences in a Hamiltonian Monte Carlo run). Instead, you should use prior predictive simulations to tune the priors to limit the random-effects structures to those that produce reasonable output (and then cross your fingers that this also resolves your computational problems).
## multiple RE terms
Given all these options, any combination of which might apply to different RE terms in a model with more than one random effect, how should one choose among them?
- if a model seemed to be sampling or converging to an optimum reasonably well but the fit was singular (i.e. on the boundary of the space of allowed covariance estimates), and the results were sufficiently robust that you could get estimates and standard errors/CIs for all of the values of interest), you might choose to proceed with the singular model. Singular fits are theoretically valid; they just make a lot of downstream procedures, such as those based on calculating curvature of the log-likelihood surface, harder. It might also be reasonable to suppose that numerical procedures could be less reliable in this case.
- @barr_random_2013 suggest "keep[ing] it maximal", i.e. reducing the model only where necessary to deal convergence issues (they don't actually mention singular fits, nor say much about the [issues with convergence checks in lme4](https://lme4.r-universe.dev/lme4/doc/manual.html#convergence):
> In cases of nonconvergence, simplification of the random effects structure proceeded as follows. For between-items designs, the by-subjects random slope was dropped. For within-items designs, statistics from the partially converged model were inspected, and the slope associated with smaller variance was dropped
It doesn't seem that they considered models with multiple random effects terms (where one wouldn't necessarily know which term to try simplifying first ...)
- @matuschek_balancing_2017 prefer a model selection approach; they lay out a stepwise approach based on likelihood ratio tests with a relaxed threshold ($\alpha = 0.2$) for retaining terms. Based on a model with two random-slopes terms, they suggest a sequence^[I don't understand how we choose between models 3 and 4 in this list; they are both nested in model 2 and contain model 5, but neither is nested in the other ...]
1. full model
2. independent slopes and intercepts for both terms [$\rho = 0$]
3. drop the random slope for one term [item-specific random slopes]
4. drop the random slope for the other term [subject-specific random slopes]
5. drop both random slopes
They also consider AIC-based selection.
- @moritzrole2023a used a more complex model than either of the references above: the among-group variation of an intercept and four numeric predictors (i.e., a 5×5 covariance matrix), applied at three different grouping levels. They consider the possibilities of (1) a full (unconstrained) covariance matrix, (2) a diagonal/independent covariance, or (3) an intercept-only random effect, for each grouping variable, i.e. a total of $3^3 = 81$ possible models. They fitted all 81 models and chose the model with the best AIC among only the models with non-singular fits
- @singmannIntroduction2019 doesn't present any methods beyond those suggested above, but the section on "Specifying the Random Effects Structure" has a clear description
- several R packages have implemented model selection machinery for the random effects (most of the options, e.g. `MuMIn`, `glmulti`, `glmmLasso`, do model selection only on the fixed effects component)
- The [buildmer](https://cran.r-project.org/package=buildmer) package is probably the fanciest: it
> [f]inds the largest possible regression model that will still converge for various types of regression analyses ... and then optionally performs stepwise elimination similar to the forward and backward effect-selection methods in SAS, based on the change in log-likelihood or its significance, Akaike's Information Criterion, the Bayesian Information Criterion, the explained deviance, or the F-test of the change in R².
Its allowed steps appear to include both including/excluding particular terms from the RE term for a particular grouping variable, as well as including or excluding complete terms (i.e. adding or dropping a grouping variable)
- `step.lmerModLmerTest()` from [lmerTest](https://CRAN.R-project.org/package=lmerTest) does backward stepwise selection (*among* random-effects terms, i.e. not considering simplification of individual terms as discussed above) based on likelihood ratio tests
- `ffRanefLMER.fnc()` from [LMERConvenienceFunctions](https://CRAN.R-project.org/package=LMERConvenienceFunctions) does forward selection (the description of the `ran.effects` argument that specifies the possible random effects is complicated ...)
- the [asremlPlus](https://cran.r-project.org/package=asremlPlus) package has a lot of machinery, and some detailed examples, for model building and selection using AS-REML
# Example
## setup
```{r pkgs, message=FALSE}
library(lme4)
library(Matrix)
library(glmmTMB)
## For rlkj_corr_cholesky; LKJ simulation also available in `rethinking` package
## (GitHub only: remotes::install_github("rmcelreath/rethinking"))
library(nimble)
library(blme)
conflicted::conflict_prefer("simulate", "stats")
```
For this example we'll think about a forest restoration experiment with four plots; 5 restoration treatments that are applied in each plot (a randomized complete block design); and 6 replicates within each block/treatment.
```{r sim}
library(lme4)
set.seed(101)
dd <- expand.grid(plot = factor(1:4),
ttt = factor(LETTERS[1:5]),
rep = 1:6)
## sample a random 5x5 correlation matrix
cchol <- nimble::rlkj_corr_cholesky(n = 1, eta = 1.2, p = 5) *
## scale by random SDs
rlnorm(5, meanlog = 0, sdlog = 0.2)
theta <- t(cchol)[lower.tri(cchol, diag = TRUE)]
dd$y <- suppressMessages(
simulate( ~ ttt + (ttt|plot),
family = gaussian,
newdata = dd,
newparams = list(beta = seq(1, by = -0.5, length.out = 5),
theta = theta,
sigma = 1)))[[1]]
```
## maximal model
We would like to be able to fit `~ ttt + (ttt|plot)`, but that's very unlikely to actually work:
```{r fit-full}
m1 <- lmer(y ~ ttt + (ttt|plot), data = dd)
isSingular(m1)
getME(m1, "theta")[getME(m1, "lower") ==0]
```
Seems pretty clear here that the real rank of the estimate is 3 ...
## diagnostics
Investigate eigenvectors of the estimated $\Sigma$; we can see again that the covariance matrix has 2 trivial ($\approx 0$) eigenvalues, but not much else; the rotation matrix doesn't give us any particularly useful clues about how to simplify the model (maybe a heatmap with column (eigenvector) order held fixed and reordering and hierarchical clustering done on the rows would be helpful for identifying clusters? Ideally, the clustering would be done based on shared similarity based on *absolute values* (i.e., we might want to identify linear combinations/contrasts among terms that would allow us to simplify the model ... surely someone has done something like this?) [`rePCA()` definitely could use some additions to make it more useful, e.g. a plot method ... even the original paper [@bates_parsimonious_2015] doesn't offer much guidance ...]
```{r repca}
(r <- rePCA(m1))
image(Matrix(r$plot$rotation))
```
## constraining the model
```{r cs-fits}
m1_homcs_lmer <- lmer(y ~ ttt + (1|plot/ttt), data = dd)
m1_homcs_glmmTMB <- glmmTMB(y ~ ttt + (1|plot/ttt), data = dd,
REML = TRUE)
m1_hetcs_glmmTMB <- glmmTMB(y ~ ttt + cs(ttt|plot), data = dd,
REML = TRUE)
```
### independence models
```{r diag-fits}
m1_diag_glmmTMB <- glmmTMB(y ~ ttt + diag(ttt|plot), data = dd,
REML = TRUE)
## also try afex::mixed?
```
### reduced-rank models
```{r rr-fits}
m1_rr1_glmmTMB <- glmmTMB(y ~ ttt + rr(ttt|plot, d=1), data = dd,
REML = TRUE)
m1_diag_rr1_glmmTMB <- glmmTMB(y ~ ttt +
diag(ttt|plot) +
rr(ttt|plot, d=1), data = dd,
REML = TRUE)
m1_diag_rr2_glmmTMB <- glmmTMB(y ~ ttt +
rr(ttt|plot, d=2), data = dd,
REML = TRUE)
```
## parameter constraints/regularization
Constraining just the standard deviations to be positive doesn't help us: the 5×5 correlation matrix is only rank-4 ... a singularity-avoiding prior for the correlation matrix isn't available yet.
```{r prior}
gprior <- data.frame(prior = "gamma(1e8, 2.5)",
class = "ranef",
coef = "")
m1_us_prior_glmmTMB <- glmmTMB(y ~ ttt +
(ttt|plot),
data = dd,
prior = gprior,
REML = TRUE)
cormat <- cov2cor(VarCorr(m1_us_prior_glmmTMB)$cond[[1]])
print(eigen(cormat)$value, digits = 3)
```
### AIC comparison
The more complex models really aren't doing well in terms of AIC in this case ...
```{r aictab}
all_fits <- ls(pattern = "^m1_")
all_fit_list <- mget(all_fits)
bbmle::AICtab(all_fit_list)
```
## to do
- provide examples of all (!?) of these approaches. In addition to the simulated example I started with (a single RE with a multi-level factor), might need to include examples with (1) multiple continuous predictors and (2) multiple RE terms in order to illustrate everything ...
- fancy ideas; RJMCMC for Bayesian models? Hierarchical clustering to lump/reduce factor levels? Priors on distribution of RR variances (e.g. geometric)?
## references