-
Notifications
You must be signed in to change notification settings - Fork 23
/
07-crossed-random-factors.Rmd
601 lines (391 loc) · 30 KB
/
07-crossed-random-factors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
# Linear mixed-effects models with crossed random factors
```{r chapter-status, echo=FALSE, results='asis'}
chapter_status("complete")
```
## Generalizing over encounters between subjects and stimuli
A common goal of experiments in psychology is to test claims about behavior that arises in response to certain types of stimuli (or sometimes, the neural underpinning of that behavior). The stimuli might be, for instance, words, images, sounds, videos, or stories. Some examples of claims you might want to test are:
* when listening to words in a second language, do bilinguals experience interference from words in their native language?
* do people rate the attractiveness of faces differently when they are in a good mood than when they are in a bad mood?
* does viewing soothing images help reduce stress relative to more neutral images?
* when reading a scenario ambiguously describing a target individual, are people more likely to make assumptions about what social group the target belongs to after being subliminally primed?
One thing to note about all these claims is that the are of the type, "what happens to our measurements when an individual of type X encounters a stimulus of type Y", where X is drawn from a target population of subjects and Y is drawn from a target population of stimuli. In other words, we are attempt to make generalizable claims about a particular class of **events** involving **encounters** between sampling units of subjects and stimuli [@Barr_2017]. But just like we can't sample all possible subjects from the target population of subjects, we also cannot sample all possible stimuli from the target population of stimuli. Thus, when drawing inferences, we need to account for the uncertainty introduced in our estimates by *both* sampling processes [@Coleman_1964; @Clark_1973; @Judd_Westfall_Kenny_2012; @yarkoni_2019]. Linear mixed-effects models make it particularly easy to do this by allowing more than one random factor in our model formula [@Baayen_Davidson_Bates_2008].
Here is a simple example of a study where you are interested in testing whether people rate pictures of cats, dogs, or sunsets as more soothing images to look at. You want to say something general about the category of cats, dogs, and sunsets and not something about the specific pictures that you happened to sample. Let's say you randomly select four images from each of the three categories from Google Images (you would absolutely need to have more to be able to say something generalizable, but we chose a small number to keep the example simple). So your table of stimuli might look like the following:
```{r stim-tbl, echo=FALSE}
stimuli <- tibble(stimulus_id = seq_len(12),
category = rep(c("cat", "dog", "sunset"), each = 4)) %>%
group_by(category) %>%
mutate(file = paste0(category, row_number(), ".jpg")) %>%
ungroup()
stimuli %>% knitr::kable()
```
Then you sample a set of four participants to perform the soothing ratings. Again, four would be too few for a real study, but we're keeping it small just for expository purposes.
```{r subj-tbl, echo=FALSE}
.dates <- sort(lubridate::ymd("2020-04-30") + sample(0:30, 4, TRUE))
subjects <- tibble(subject_id = seq_len(4),
age = sample(18:71, 4, TRUE),
date = .dates)
subjects %>% knitr::kable()
```
Now, because each subject has given a "soothingness" rating for each picture, you'd have a full dataset consisting of all of the levels of `subject_id` crossed with all of the levels of `stimulus_id`. This is what we mean when we talk about "crossed random factors." You can create the table containing all these combinations with the `crossing()` function from `tidyr` (which is loaded when you load in `tidyverse`).
```{r crossed-tbl, eval=FALSE}
crossing(subjects %>% select(subject_id),
stimuli %>% select(-category))
```
```{r crossed-tbl-knitr, echo=FALSE}
crossing(subjects %>% select(subject_id),
stimuli %>% select(-category)) %>%
knitr::kable()
```
Because you have 4 subjects responding to 12 stimuli, the resulting table will have 48 rows.
## lme4 syntax for crossed random factors
How should we analyze such data? Recall from the last chapter that the lme4 formula syntax for a model with by-subject random intercepts and slopes for predictor `x` would be given by `y ~ x + (1 + x | subject_id)` where the term in brackets with the vertical bar `|` provides the random effects specification. The variable to the right of the bar, `subject_id`, specifies the variable that identifies the levels of the random factor. The formula to the left of the bar within the brackets, `1 + x`, specifies the random effects associated with this factor, which in this case is a random intercept and random slope for `x`. The best way to think about this bracketed part of the formula `(1 + x | subject_id)` is **as an instruction to `lme4::lmer()` about how to build a covariance matrix capturing the variance introduced by the random factor of subjects.** By now you should realize that this instruction would result in the estimation of a two-dimensional covariance matrix, with one dimension for intercept variance and one for slope variance.
But we are not limited to the estimation of random effects for subjects; we can also specify the estimation of random effects for stimuli by simply adding another term to the formula. For example,
`y ~ x + (1 + x | subject_id) + (1 + x | stimulus_id)`
regresses `y` on `x` with by-subject random intercepts and slopes and by-stimulus random intercepts and slopes. In this way, the fitted model will capture two sources of uncertainty about our estimates---the uncertainty introduced by sampling subjects as well as the uncertainty introduced by sampling items. Now we are estimating **two independent covariance matrices**, one for subjects and one for items. In the above example, both of these matrices will have the same 2x2 structure, but this need not be the case. We can flexibly change the random effects structure by changing the formula specification on the left side of each bar `|` symbol. For instance, if we have another predictor `w`, we might have:
`y ~ x + (x | subject_id) + (x + w | stimulus_id)`
which would estimate the same 2x2 matrix for subjects, but the covariance matrix for stimuli would now be a 3x3 matrix (intercepts, slope of x, and slope of w). Although this enables great flexibility, as the random effects structure becomes more complex, the estimation process becomes more difficult and less likely to converge upon a result.
## Specifying random effects
The choice of random effects structure is not a new problem that appeared with linear mixed-effects models. In traditional approaches using t-test and ANOVA, you choose the random effects structure implicitly when you choose what procedure to use. As discussed in the last chapter, if you choose a paired samples t-test over an independent samples t-test, that is analogous to choosing to fit `lme4::lmer(y ~ x + (1 | subject))` over `lm(y ~ x)`. Likewise, you can run a mixed model ANOVA as either `aov(y ~ x + Error(subject_id))`, which is equivalent to a random intercepts model, or `aov(y ~ x + Error(x / subject_id))`, which is equivalent to a random slopes model. The tradition in psychology when performing confirmatory analyses is to use the *maximal random effects structure justified by the design of your study*. So, if you have one-factor data with pseudoreplications, you would use `aov(y ~ x + Error(x / subject_id))` rather than `aov(y ~ x + Error(subject_id))`. Analogously, if you were to analyze the same data with pseudoreplications using a linear mixed effects model, you should use `lme4::lmer(y ~ x + (1 + x | subject_id))` rather than `lme4::lmer(y ~ x + (1 | subject_id))`. In other words, you should account for all sources of non-independence introduced by repeated sampling from the same subjects or stimuli. This approach is known as the **maximal random effects** approach or the **design-driven approach** to specifying random effects structure [@Barr_et_al_2013]. Failing to account for dependencies introduced by the design is likely to lead to standard errors that are too small, which in turn, lead to p-values that are smaller than they should be, and thus, higher false positive (Type I error) rates. In some cases, it can lead to lower power, and thus a higher false negative (Type II error) rate. It is thus of critical importance to pay close attention to the random effects structure when performing analyses.
Linear mixed-effects models almost inevitably include random intercepts for any random factor included in the design. So if your random factors are subjects and stimuli identified by `subject_id` and `stimulus_id` respectively, then at the very least, your model syntax will include `(1 | subject_id) + (1 | stimulus_id)`. But you will have various predictors in the model, so key question becomes: what predictors should I allow to vary over what sampling units? For instance, if the fixed-effects part of your model is a 2x2 factorial design with factors A and B, `y ~ a * b + ...`, you could have a large variety of random effects structures, including (but not limited to):
1. random intercepts only: `y ~ a * b + (1 | subject_id) + (1 | stimulus_id)`
2. by-subjects random intercepts for a and by-stimulus random intercepts: `y ~ a * b + (a | subject_id) + (1 | stimulus_id)`
3. by-subjects random intercepts and slopes for a and b and the ab interaction, with by-stimulus random intercepts: `y ~ a * b + (a * b | subject_id) + (1 | stimulus_id)`
4. by-subjects random intercepts and slopes for a and b and the ab interaction, by-stimulus random intercepts and slopes for a and b and the ab interaction, `y ~ a * b + (a * b | subject_id) + (a * b | stimulus_id)`.
It is important to be clear about one thing.
<div class="info">
The "maximal random effects structure justified by the design" is not the same as the "maximum possible random effects structure"; that is, it does not entail automatically putting in all random slopes for all random factors for all predictors in your model. You have to follow the guidelines for random effects in the next section to decide whether inclusion of a particular random slope is, in fact, "justified by the design."
</div>
Some authors suggest a "data-driven" alternative to design-driven random effects, suggesting that researchers should only include random slopes justified by the design if they are further justified by the data [@Matuschek_et_al_2017]. For example, you might use a null-hypothesis test to determine whether including a by-subject random slope for `x` significantly improves the model fit, and only include that effect if it does. Although this could potentially improve power for the test of theoretical interest when random slopes are very small, it also exposes you to additional unknown risk of false positives, so it is questionable whether this the right approach in a confirmatory context. Thus, we do not recommend a data-driven approach.
### Rules for choosing random effects for categorical factors
The random effects structure for a linear mixed-effects model---in other words, your assumptions about what effects vary over what sampling units---is absolutely critical for ensuring that your parameters reflect the uncertainty introduced by sampling [@Barr_et_al_2013]. First off, note that we are focused on predictors representing **design variables** that are of theoretical interest and on which you will perform inferential tests. If you have predictors that represent **control variables**, over which you do not intend to perform statistical tests, it is unlikely that random slopes are needed.
The following rules are derived from @Barr_et_al_2013 and @Barr_2013. Consult these papers if you wish to find out more about these guidelines. Keep in mind that you can only use a mixed effects model if you have repeated measures data, either because of pseudoreplications and/or the presence of within-subject (or within-stimulus) factors. With crossed random factors, you inevitably have pseudoreplications---multiple observations per subject due to multiple stimuli, multiple observations per stimulus due to multiple subjects. The key to determining random effects structure is figuring out which factors are within-subjects or within-stimuli, and where any pseudoreplications are located in the design. You apply the rules once for subjects to determine the form of the `(1 + ... | subject_id)` part of the formula, and once for stimuli to determine the form of the `(1 + ... | stimulus_id)` part of the formula. Where you see the word "unit" or "sampling unit" below, substitute "subject" or "stimuli" as needed.
Here are the rules:
1. If there are repeated measures on sampling units, you need a random intercept for that random factor: `(1 | unit_id)`;
2. If a factor `x` is between-unit, you do not need a random slope for that factor;
3. Determine the highest order interaction of within-subject factors for the unit under consideration. If you have pseudoreplications within each cell defined by those combinations (i.e., multiple observations per cell), then for that unit you will need a slope for that interaction as well as for all lower order effects. If there are no pseudoreplications, then you do not need any random slopes.
The first two rules are straightforward, but the third requires some explanation. Let's first ask: how do we know whether some factor is between or within unit?
A simple way to determine whether a factor is between or within is to use the `dplyr::count()` function, which gives frequency counts, and which is loaded when you load tidyverse. Let's say you are interested in whether factor $A$ is within or between subjects, for the imaginary 2x2x2 factorial data `abc_data` below where $A$, $B$, and $C$ name the factors of your design.
```{r count-example}
library("tidyverse")
## run this code to create the table "abc_data"
abc_subj <- tibble(subject_id = 1:4,
B = rep(c("B1", "B2"), times = 2))
abc_item <- tibble(stimulus_id = 1:4,
A = rep(c("A1", "A2"), each = 2),
C = rep(c("C1", "C2"), times = 2))
abc_data <- crossing(abc_subj, abc_item) %>%
select(subject_id, stimulus_id, everything())
```
To see whether $A$ is within or between subjects, use:
```{r a-count}
abc_data %>%
count(subject_id, A)
```
In the resulting table, you can see that each subject gets both levels of $A$, making it a within-subject factor. What about $B$ and $C$?
```{r b-count}
abc_data %>%
count(subject_id, B)
```
```{r c-count}
abc_data %>%
count(subject_id, C)
```
OK $B$ is between subjects (each subject gets only one level), and $C$ is within (each subject gets all levels).
<div class="try">
*Exercise*
Answer these question about `abc_data`.
* Are the levels of factor $A$ administered between or within stimuli? `r mcq(c(answer = "between", "within"))`
`r hide("Solution")`
```{r a-stim}
abc_data %>%
count(stimulus_id, A)
```
`r unhide()`
* Are the levels of factor $B$ administered between or within stimuli? `r mcq(c("between", answer = "within"))`
`r hide("Solution")`
```{r b-stim}
abc_data %>%
count(stimulus_id, B)
```
`r unhide()`
* Are the levels of factor $C$ administered between or within stimuli? `r mcq(c(answer = "between", "within"))`
`r hide("Solution")`
```{r c-stim, webex.hide="Solution"}
abc_data %>%
count(stimulus_id, C)
```
`r unhide()`
</div>
OK, we've identified which factors are within and between subject, and which factors are within and between stimulus.
The second rule tells us that if a factor is between-unit, you do not need a random slope for that factor. Indeed, it is not possible to estimate a random slope for a between unit factor. If you think about the fact that random slopes capture variation in the effect over units, then it makes sense that you have to measure your response variable across all levels of that factor to be able to estimate that variation. For instance, if you have a two-level factor called treatment group (experimental, control), you cannot estimate the effect of "treatment" for a particular subject unless the subject has experienced both levels of the factor (i.e., it would have to be within-subject).
How do we now apply the third rule above to determine what random slopes are needed for our within-unit factors?
Consider that $A$ and $C$ were within-subjects, and $B$ was between. So the highest-order interaction of within-subject factors is $AC$. We will need random slopes for the $AC$ interaction as well as for the main effects $A$ and $C$ if we have pseudoreplications for each subject in each combination of $AC$. How can we find out?
```{r ac-reps}
abc_data %>%
count(subject_id, A, C)
```
This shows us that we have *one* observation per combination of $AC$, so we do not need random slopes for $AC$, nor for $A$ or $C$. The random effects part of the formula for subjects would just be `(1 | subject_id)`.
<div class="try">
What random slopes do you need for the random factor of stimulus?
`r hide()`
You have one within-stimulus factor, $B$, which has pseudoreplications.
```{r b-stim-reps}
abc_data %>%
count(stimulus_id, B)
```
Therefore the formula you need for stimuli is `(B | stimulus_id)`, making the full `lme4` formula:
`y ~ A * B * C + (1 | subject_id) + (B | stimulus_id)`.
`r unhide()`
</div>
### Troubleshooting non-convergence and 'singular fits'
When you attempt to fit models with maximal random effects, you can run into a couple of different problems. Recall that the estimation algorithm for linear mixed-effects model is *iterative*⸻that is, in a step-by-step manner, the fitting algorithm searches for parameter values that make the data most likely. Sometimes it looks and looks and cannot find them, and will give up, in which case you will get a 'convergence warning.' When this happens, it is probably not a good idea to trust any of the estimates from the non-converged model, and you'll need to simplify the model structure before trying again.
Another thing that can happen is that you'll get a message about a 'singular fit'. This latter message will appear when the estimation procedure yields a variance-covariance matrix for one or more random factors that has either (1) perfect or near-perfect (1.00, -1.00) positive or negative correlations, (2) one or more variances close to zero, or (3) both. It is possibly ok to ignore this message, but it is also reasonable to simplify the model structure until the message goes away.
How do you simplify a model to deal with convergence problems or singular fit messages? This should be done with care. I suggest the following strategy:
1. Constrain all covariance parameters to zero. This is accomplished using the double-bar `||` syntax, e.g., changing `(a * b | subject)` to `(a * b || subject)`. If you still run into estimation problems, then:
2. Inspect the parameter estimates from the non-converging or singular model. Are any of the slope variables zero or near to zero? Remove these and re-fit the model, repeating this step until the convergence warnings / singular fit messages go away.
For more technical details about convergence problems and what to do, see `?lme4::convergence` and `?lme4::isSingular`.
## Simulating data with crossed random factors
For these exercises, we will generate simulated data corresponding to an experiment with a single, two-level factor (independent variable) that is within-subjects and between-items. Let's imagine that the experiment involves lexical decisions to a set of words (e.g., is "PINT" a word or nonword?), and the dependent variable is response time (in milliseconds), and the independent variable is word type (noun vs verb). We want to treat both subjects and words as random factors (so that we can generalize to the population of events where subjects encounter words). You can play around with the web app (or [click here to open it in a new window](https://talklab.psy.gla.ac.uk/app/crossed-site/){target="_blank"}), which allows you to manipulate the data-generating parameters and see their effect on the data.
By now, you should have all the pieces of the puzzle that you need to simulate data from a study with crossed random effects. @Debruine_Barr_2020 provides a more detailed, step-by-step walkthrough of the exercise below.
Here is the DGP for response time $Y_{si}$ for subject $s$ and item $i$:
*Level 1:*
\begin{equation}
Y_{si} = \beta_{0s} + \beta_{1} X_{i} + e_{si}
\end{equation}
*Level 2:*
\begin{equation}
\beta_{0s} = \gamma_{00} + S_{0s} + I_{0i}
\end{equation}
\begin{equation}
\beta_{1} = \gamma_{10} + S_{1s}
\end{equation}
*Variance Components:*
\begin{equation}
\langle S_{0s}, S_{1s} \rangle \sim N\left(\langle 0, 0 \rangle, \mathbf{\Sigma}\right)
\end{equation}
\begin{equation}
\mathbf{\Sigma} = \left(\begin{array}{cc}{\tau_{00}}^2 & \rho\tau_{00}\tau_{11} \\
\rho\tau_{00}\tau_{11} & {\tau_{11}}^2 \\
\end{array}\right)
\end{equation}
\begin{equation}
I_{0s} \sim N\left(0, {\omega_{00}}^2\right)
\end{equation}
\begin{equation}
e_{si} \sim N\left(0, \sigma^2\right)
\end{equation}
In the above equation, $X_i$ is a numerical predictor coding which condition the item $i$ is in; e.g., -.5 for noun, .5 for verb.
We could just reduce levels 1 and 2 to
$$Y_{si} = \beta_0 + S_{0s} + I_{0i} + (\beta_1 + S_{1s})X_{i} + e_{si}$$
where:
|Parameter | Symbol| Description |
|:------------|:------|:--------------------------------------------------|
| \(Y_{si}\) | `Y` | RT for subject \(s\) responding to item \(i\); |
| \(\beta_0\) | `b0` | grand mean; |
| \(S_{0s}\) | `S_0s` | random intercept for subject $s$ \(s\); |
| \(I_{0i}\) | `I_0i` | random intercept for item $i$ \(i\); |
| \(\beta_1\) | `b1` | fixed effect of word type (slope); |
| \(S_{1s}\) | `S_1s` | by-subject random slope; |
| \(X_{i}\) | `cond` | deviation-coded predictor variable for word type; |
| \(\tau_{00}\) | `tau_00` | by-subject random intercept standard deviation |
| \(\tau_{11}\) | `tau_11` | by-subject random slope standard deviation |
| \(\rho\) | `rho` | correlation between random intercept and slope |
| \(\omega_{00}\) | `omega_00` | by-item random intercept standard deviation |
| \(e_{si}\) | `err` | residual for subject $s$ item $i$ error |
| \(\sigma\) | `sig` | residual error standard deviation |
### Set up the environment and define the parameters for the DGP
If you want to get the same results as everyone else for this exercise, then we all should seed the random number generator with the same value. While we're at it, let's load in the packages we need.
```{r sim-setup, message=FALSE}
library("lme4")
library("tidyverse")
set.seed(11709)
```
Now let's define the parameters for the DGP (data generating process).
```{r dgp-params}
nsubj <- 100 # number of subjects
nitem <- 50 # must be an even number
b0 <- 800 # grand mean
b1 <- 80 # 80 ms difference
effc <- c(-.5, .5) # deviation codes
omega_00 <- 80 # by-item random intercept sd (omega_00)
## for the by-subjects variance-covariance matrix
tau_00 <- 100 # by-subject random intercept sd
tau_11 <- 40 # by-subject random slope sd
rho <- .2 # correlation between intercept and slope
sig <- 200 # residual (standard deviation)
```
You'll create three tables:
| Name | Description |
|:-----------|:---------------------------------------------------------------------|
| `subjects` | table of subject data including `subj_id` and subject random effects |
| `items` | table of stimulus data including `item_id` and item random effect |
| `trials` | table of trials enumerating encounters between subjects/stimuli |
Then you will merge together the information in the three tables, and calculate the response variable according to the model formula above.
### Generate a sample of stimuli
Let's randomly generate our `r nitem` items. Create a tibble called `item` like the one below, where `iri` are the by-item random intercepts (drawn from a normal distribution with variance \(\omega_{00}^2\) = `r omega_00^2`). Half of the words are of type NOUN (`cond = -.5`) and half of type VERB (`cond = .5`).
```{r mk-items, echo=FALSE}
items <- tibble(item_id = 1:nitem,
cond = rep(c(-.5, .5), times = nitem / 2),
I_0i = rnorm(nitem, 0, sd = omega_00))
```
`r hide("Click to reveal full table")`
```{r mk-items2, echo=FALSE}
items %>% print(n = +Inf)
```
`r unhide()`
`r hide("Hint for making cond")`
`rep()`
`r unhide()`
`r hide("Hint for making item random effects")`
`rnorm()`
`r unhide()`
`r hide("Solution")`
```{r item-tib-sol, ref.label="mk-items", eval=FALSE}
```
`r unhide()`
### Generate a sample of subjects
To generate the by-subject random effects, you will need to generate data from a *bivariate normal distribution*. To do this, we will use the function `MASS::mvrnorm`.
<div class="warning">
REMEMBER: do not run `library("MASS")` just to get this one function, because `MASS` has a function `select()` that will overwrite the tidyverse version. Since all we want from MASS is the `mvrnorm()` function, we can just access it directly by the `pkgname::function` syntax, i.e., `MASS::mvrnorm()`.
</div>
Your subjects table should look like this:
`r hide("Click to reveal full table")`
```{r subj-tbl1, echo = FALSE}
cov <- rho * tau_00 * tau_11
mx <- matrix(c(tau_00^2, cov,
cov, tau_11^2),
nrow = 2)
by_subj_rfx <- MASS::mvrnorm(nsubj, mu = c(S_0s = 0, S_1s = 0), Sigma = mx)
subjects <- as_tibble(by_subj_rfx) %>%
mutate(subj_id = row_number()) %>%
select(subj_id, everything())
```
```{r subj-print, echo=FALSE}
subjects %>% print(n = +Inf)
```
`r unhide()`
`r hide("Hint 1")`
recall that:
* *`tau_00`*: by-subject random intercept standard deviation
* *`tau_11`*: by-subject random slope standard deviation
* *`rho`* : correlation between intercept and slope
`r unhide()`
`r hide("Hint 2")`
`covariance = rho * tau_00 * tau_11`
`r unhide()`
`r hide("Hint 3")`
```{r hint3, eval=FALSE}
matrix( tau_00^2, rho * tau_00 * tau_11,
rho * tau_00 * tau_11, tau_11^2, ...)
```
`r unhide()`
`r hide("Hint 4")`
```{r hint4, eval=FALSE}
as_tibble(mx) %>%
mutate(subj_id = ...)
```
`r unhide()`
`r hide("Solution")`
```{r code-subj-sol, ref.label="subj-tbl1", eval=FALSE}
```
`r unhide()`
### Generate a sample of encounters (trials)
Each trial is an *encounter* between a particular subject and stimulus. In this experiment, each subject will see each stimulus. Generate a table `trials` that lists the encounters in the experiments. Note: each participant encounters each stimulus item once. Use the `crossing()` function to create all possible encounters.
Now apply this example to generate the table below, where `err` is the residual term, drawn from \(N \sim \left(0, \sigma^2\right)\), where \(\sigma\) is `err_sd`.
```{r gen-encounters, echo=FALSE}
trials <- crossing(subj_id = subjects %>% pull(subj_id),
item_id = items %>% pull(item_id)) %>%
mutate(err = rnorm(nrow(.), mean = 0, sd = sig))
```
```{r gen-print, echo=FALSE}
trials
```
`r hide()`
```{r trials2, eval=FALSE, ref.label="gen-encounters"}
```
`r unhide()`
### Join `subjects`, `items`, and `trials`
Merge the information in `subjects`, `items`, and `trials` to create the full dataset `dat_sim`, which looks like this:
```{r make-dat, echo=FALSE}
dat_sim <- subjects %>%
inner_join(trials, "subj_id") %>%
inner_join(items, "item_id") %>%
arrange(subj_id, item_id) %>%
select(subj_id, item_id, S_0s, I_0i, S_1s, cond, err)
```
```{r make-dat-print, echo=FALSE}
dat_sim
```
`r hide("Hint")`
`inner_join()`
`r unhide()`
`r hide()`
```{r dat-sim-sol, ref.label="make-dat", eval=FALSE}
```
`r unhide()`
### Create the response variable
Add the response variable `Y` to dat according to the model formula:
$$Y_{si} = \beta_0 + S_{0s} + I_{0i} + (\beta_1 + S_{1s})X_{i} + e_{si}$$
so that the resulting table (`dat_sim2`) looks like this:
```{r dat-sim2, echo=FALSE}
dat_sim2 <- dat_sim %>%
mutate(Y = b0 + S_0s + I_0i + (S_1s + b1) * cond + err) %>%
select(subj_id, item_id, Y, everything())
```
```{r dat-sim3, echo=FALSE}
dat_sim2
```
Note: this is the full **decomposition table** for this model.
`r hide("Hint")`
```
... %>%
mutate(Y = ...) %>%
select(...)
```
`r unhide()`
`r hide()`
```{r dat-sim2-sol, ref.label="dat-sim2", eval=FALSE}
```
`r unhide()`
### Fitting the model
Now that you have created simulated data, estimate the model using `lme4::lmer()`, and run `summary()`.
`r hide()`
```{r fit-model}
mod_sim <- lmer(Y ~ cond + (1 + cond | subj_id) + (1 | item_id),
dat_sim2, REML = FALSE)
summary(mod_sim, corr = FALSE)
```
`r unhide()`
Now see if you can identify the data generating parameters in the output of `summary()`.
```{r dgp1, include=FALSE}
srfx <- attr(VarCorr(mod_sim)$subj_id, "stddev")
irfx <- attr(VarCorr(mod_sim)$item_id, "stddev")
rc <- attr(VarCorr(mod_sim)$subj_id, "correlation")[1, 2]
res <- attr(VarCorr(mod_sim), "sc")
ffx <- fixef(mod_sim)
```
First, try to find \(\beta_0\) and \(\beta_1\).
`r hide("Solution: Fixed effects")`
```{r fef-sol, echo = FALSE}
tribble(~parameter, ~variable, ~input, ~estimate,
"\\(\\hat{\\beta}_0\\)", "`b0`", b0, as.numeric(round(ffx[1], 3)),
"\\(\\hat{\\beta}_1\\)", "`b1`", b1, as.numeric(round(ffx[2], 3))) %>%
knitr::kable()
```
`r unhide()`
Now try to find estimates of random effects parameters \(\tau_{00}\), \(\tau_{11}\), \(\rho\), \(\omega_{00}\), and \(\sigma\).
`r hide("Solution: Random effects parameters")`
```{r rfx-sol, echo = FALSE}
tribble(~parameter, ~variable, ~input, ~estimate,
"\\(\\hat{\\tau}_{00}\\)", "`tau_00`", tau_00,
as.numeric(round(srfx[1], 3)),
"\\(\\hat{\\tau}_{11}\\)", "`tau_11`", tau_11,
as.numeric(round(srfx[2], 3)),
"\\(\\hat{\\rho}\\)", "`rho`", rho, as.numeric(round(rc, 3)),
"\\(\\hat{\\omega}_{00}\\)", "`omega_00`", omega_00,
as.numeric(round(irfx[1], 3)),
"\\(\\hat{\\sigma}\\)", "`sig`", sig,
as.numeric(round(res, 3))) %>%
knitr::kable()
```
`r unhide()`