-
Notifications
You must be signed in to change notification settings - Fork 2
/
91-appendix_brms.Rmd
1290 lines (1036 loc) · 83.2 KB
/
91-appendix_brms.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# An introduction to Bayesian multilevel models using brms {#appendix-brms}
```{r setupAPPENDIXbrms, include = FALSE, message = FALSE, warning = FALSE, results = "hide"}
library(tidyverse)
library(shinystan)
library(emojifont)
library(patchwork)
library(parallel)
library(ggridges)
library(viridis)
library(ellipse)
library(papaja)
library(phonR)
library(broom)
library(knitr)
library(here)
library(BEST)
library(lme4)
library(brms)
# set rstan options
rstan::rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores() )
# setting up knitr options
knitr::opts_chunk$set(
cache = TRUE, echo = FALSE, warning = FALSE, message = FALSE,
out.width = "100%", fig.align = "center"
)
```
<!-- NB: You can add comments using these tags -->
Bayesian multilevel models are increasingly used to overcome the limitations of frequentist approaches in the analysis of complex structured data. This paper introduces Bayesian multilevel modelling for the specific analysis of speech data, using the brms package developed in `R`. In this tutorial, we provide a practical introduction to Bayesian multilevel modelling, by reanalysing a phonetic dataset containing formant (F1 and F2) values for five vowels of Standard Indonesian (ISO 639-3:ind), as spoken by eight speakers (four females), with several repetitions of each vowel. We first give an introductory overview of the Bayesian framework and multilevel modelling. We then show how Bayesian multilevel models can be fitted using the probabilistic programming language `Stan` and the `R` package `brms`, which provides an intuitive formula syntax. Through this tutorial, we demonstrate some of the advantages of the Bayesian framework for statistical modelling and provide a detailed case study, with complete source code for full reproducibility of the analyses (https://osf.io/dpzcb/).^[This chapter is a published paper reformatted for the need of this thesis. Source: Nalborczyk, L., Batailler, C., L\oe venbruck, H., Vilain, A., & Bürkner, P.-C. (2019). An introduction to Bayesian multilevel models using brms: A case study of gender effects on vowel variability in standard Indonesian. *Journal of Speech, Language, and Hearing Research, 62*(5), 1225-1242. https://doi.org/10.1044/2018_JSLHR-S-18-0006.]
## Introduction
The last decade has witnessed noticeable changes in the way experimental data are analysed in phonetics, psycholinguistics, and speech sciences in general. In particular, there has been a shift from analysis of variance (ANOVA) to *linear mixed models*, also known as *hierarchical models* or *multilevel models* (MLMs), spurred by the spreading use of data-oriented programming languages such as `R` [@R-base], and by the enthusiasm of its active and ever growing community. This shift has been further sustained by the current transition in data analysis in social sciences, with researchers evolving from a widely criticised point-hypothesis mechanical testing [e.g.,@bakan_test_1966;@Gigerenzer2004;@Kline2004;@Lambdin2012;@trafimow_manipulating_2018] to an approach that emphasises parameter estimation, model comparison, and continuous model expansion [e.g.,@Cumming2012;@cumming_new_2014;@gelman_data_2006;@gelman_bayesian_2013;@kruschke_doing_2015;@kruschke_bayesian_2018-1;@kruschke_bayesian_2018;@mcelreath_statistical_2016].
MLMs offer great flexibility in the sense that they can model statistical phenomena that occur on different levels. This is done by fitting models that include both constant and varying effects (sometimes referred to as *fixed* and *random* effects). Among other advantages, this makes it possible to generalise the results to unobserved levels of the *groups* existing in the data [e.g., stimulus or participant, @janssen_twice_2012]. The multilevel strategy can be especially useful when dealing with repeated measurements (e.g., when measurements are nested into participants) or with unequal sample sizes, and more generally, when handling complex dependency structures in the data. Such complexities are frequently found in the kind of experimental designs used in speech science studies, for which MLMs are therefore particularly well suited.
The standard MLM is usually fitted in a frequentist framework, with the `lme4` package [@R-lme4] in `R` [@R-base]. However, when one tries to include the maximal varying effect structure, this kind of model tends either not to converge, or to give aberrant estimations of the correlation between varying effects [e.g.,@bates_parsimonious_2015].^[In this context, the *maximal varying effect structure* means that any potential source of systematic influence should be explicitly modelled, by adding appropriate varying effects.] Yet, fitting the maximal varying effect structure has been explicitly recommended [e.g.,@barr_random_2013-1]. In contrast, the maximal varying effect structure can generally be fitted in a Bayesian framework [@bates_parsimonious_2015;@sorensen_bayesian_2016;@nicenboim_statistical_2016;@eager_mixed_2017].
Another advantage of Bayesian statistical modelling is that it fits the way researchers intuitively understand statistical results. Widespread misinterpretations of frequentist statistics (like p-values and confidence intervals) are often attributable to the wrong interpretation of these statistics as resulting from a Bayesian analysis [e.g.,@dienes_bayesian_2011;@Hoekstra2014;@Gigerenzer2004;@kruschke_bayesian_2018;@morey_fallacy_2015]. However, the intuitive nature of the Bayesian approach might arguably be hidden by the predominance of frequentist teaching in undergraduate statistical courses.
Moreover, the Bayesian approach offers a natural solution to the problem of multiple comparisons, when the situation is adequately modelled in a multilevel framework [@scott_bayes_2010;@gelman_why_2012], and allows *a priori* knowledge to be incorporated in data analysis via the prior distribution. The latter feature is particularily relevant when dealing with contraint parameters or for the purpose of incorporating expert knowledge.
The aim of the current paper is to introduce Bayesian multilevel models, and to provide an accessible and illustrated hands-on tutorial for analysing typical phonetic data. This paper will be structured in two main parts. First, we will briefly introduce the Bayesian approach to data analysis and the multilevel modelling strategy. Second, we will illustrate how Bayesian MLMs can be implemented in `R` by using the `brms` package [@R-brms] to reanalyse a dataset from @mccloy_phonetic_2014 available in the `phonR` package [@R-phonR]. We will fit Bayesian MLMs of increasing complexity, going step by step, providing explanatory figures and making use of the tools available in the `brms` package for model checking and model comparison. We will then compare the results obtained in a Bayesian framework using `brms` with the results obtained using frequentist MLMs fitted with `lme4`. Throughout the paper, we will also provide comments and recommendations about the feasability and the relevance of such analysis for the researcher in speech sciences.
### Bayesian data analysis
The Bayesian approach to data analysis differs from the frequentist one in that each parameter of the model is considered as a random variable (contrary to the frequentist approach which considers parameter values as unknown and fixed quantities), and by the explicit use of probability to model the uncertainty [@gelman_bayesian_2013]. The two approaches also differ in their conception of what *probability* is. In the Bayesian framework, probability refers to the experience of uncertainty, while in the frequentist framework it refers to the limit of a relative frequency (i.e., the relative frequency of an event when the number of trials approaches infinity). A direct consequence of these two differences is that Bayesian data analysis allows researchers to discuss the probability of a parameter (or a vector of parameters) $\theta$, given a set of data $y$:
$$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}$$
\vspace{5mm}
Using this equation (known as Bayes' theorem), a probability distribution $p(\theta|y)$ can be derived (called the *posterior distribution*), that reflects knowledge about the parameter, given the data and the prior information. This distribution is the goal of any Bayesian analysis and contains all the information needed for inference.
The term $p(\theta)$ corresponds to the *prior distribution*, which specifies the prior information about the parameters (i.e., what is known about $\theta$ before observing the data) as a probability distribution. The left hand of the numerator $p(y|\theta)$ represents the *likelihood*, also called the *sampling distribution* or *generative model*, and is the function through which the data affect the posterior distribution. The likelihood function indicates how likely the data are to appear, for each possible value of $\theta$.
Finally, $p(y)$ is called the *marginal likelihood*. It is meant to normalise the posterior distribution, that is, to scale it in the "probability world". It gives the "probability of the data", summing over all values of $\theta$ and is described by $p(y) = \sum_{\theta} p(\theta) p(y|\theta)$ for discrete parameters, and by $p(y) = \int p(\theta) p(y|\theta) d\theta$ in the case of continuous parameters.
All this pieced together shows that the result of a Bayesian analysis, namely the posterior distribution $p(\theta|y)$, is given by the product of the information contained in the data (i.e., the likelihood) and the information available before observing the data (i.e., the prior). This constitutes the crucial principle of Bayesian inference, which can be seen as an updating mechanism [as detailed for instance in @kruschke_bayesian_2018-1]. To sum up, Bayes' theorem allows a prior state of knowledge to be updated to a posterior state of knowledge, which represents a compromise between the prior knowledge and the empirical evidence.
The process of Bayesian analysis usually involves three steps that begin with setting up a probability model for all the entities at hand, then computing the posterior distribution, and finally evaluating the fit and the relevance of the model [@gelman_bayesian_2013]. In the context of linear regression, for instance, the first step would require to specify a likelihood function for the data and a prior distribution for each parameter of interest (e.g., the intercept or the slope). We will go through these three steps in more details in the application section, but we will first give a brief overview of the multilevel modelling strategy.
### Multilevel modelling {#MLM}
MLMs can be considered as "multilevel" for at least two reasons. First, an MLM can generally be conceived as a regression model in which the parameters are themselves modelled as outcomes of another regression model. The parameters of this second-level regression are known as *hyperparameters*, and are also estimated from the data [@gelman_data_2006]. Second, the multilevel structure can arise from the data itself, for instance when one tries to model the second-language speech intelligibility of a child, who is considered within a particular class, itself considered within a particular school. In such cases, the hierarchical structure of the data itself calls for hierarchical modelling. In both conceptions, the number of levels that can be handled by MLMs is virtually unlimited [@mcelreath_statistical_2016]. When we use the term *multilevel* in the following, we will refer to the structure of the model, rather than to the structure of the data, as non-nested data can also be modelled in a multilevel framework.
As briefly mentioned earlier, MLMs offer several advantages compared to single-level regression models, as they can handle the dependency between units of analysis from the same group (e.g., several observations from the same participant). In other words, they can account for the fact that, for instance, several observations are not independent, as they relate to the same participant. This is achieved by partitioning the total variance into variation due to the groups (level-2) and to the individual (level-1). As a result, such models provide an estimation of the variance component for the second level (i.e., the variability of the participant-specific estimates) or higher levels, which can inform us about the generalisability of the findings [@janssen_twice_2012;@mcelreath_statistical_2016].
Multilevel modelling allows both *fixed* and *random* effects to be incorporated. However, as pointed out by @gelman_analysis_2005, we can find at least five different (and sometimes contradictory) ways of defining the meaning of the terms *fixed* and *random* effects. Moreover, @gelman_data_2006 remarked that what is usually called a *fixed* effect can generally be conceived as a *random* effect with a null variance. In order to use a consistent vocabulary, we follow the recommendations of @gelman_data_2006 and avoid these terms. We instead use the more explicit terms *constant* and *varying* to designate effects that are constant, or that vary by groups.^[Note that MLMs are sometimes called *mixed models*, as models that comprise both *fixed* and *random* effects.]
A question one is frequently faced with in multilevel modelling is to know which parameters should be considered as varying, and which parameters should be considered as constant. A practical answer is provided by @mcelreath_statistical_2016, who states that "any batch of parameters with *exchangeable* index values can be and probably should be pooled". For instance, if we are interested in the categorisation of native versus non-native phonemes and if for each phoneme in each category there are multiple audio stimuli (e.g., multiple repetitions of the same phoneme), and if we do not have any reason to think that, for each phoneme, audio stimuli may differ in intelligibility in any systematic way, then repetitions of the same phoneme should be pooled together. The essential feature of this strategy is that *exchangeability* of the lower units (i.e., the multiple repetitions of the same phoneme) is achieved by conditioning on indicator variables (i.e., the phonemes) that represent groupings in the population [@gelman_bayesian_2013].
To sum up, multilevel models are useful as soon as there are predictors at different levels of variation [@gelman_bayesian_2013]. One important aspect is that this varying-coefficients approach allows each subgroup to have a different mean outcome level, while still estimating the global mean outcome level. In an MLM, these two estimations inform each other in a way that leads to the phenomenon of *shrinkage*, that will be discussed in more detail below (see section \@ref(shrink)).
As an illustration, we will build an MLM starting from the ordinary linear regression model, and trying to predict an outcome $y_{i}$ (e.g., second-language (L2) speech-intelligibility) by a linear combination of an intercept $\alpha$ and a slope $\beta$ that quantifies the influence of a predictor $x_{i}$ (e.g., the number of lessons received in this second language):
$$
\begin{aligned}
y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\
\mu_{i} &= \alpha + \beta x_{i} \\
\end{aligned}
$$
\vspace{5mm}
This notation is strictly equivalent to the (maybe more usual) following notation:
$$
\begin{aligned}
y_{i} &= \alpha + \beta x_{i} + \epsilon_{i} \\
\epsilon_{i} &\sim \mathrm{Normal}(0,\sigma_e)
\end{aligned}
$$
\vspace{5mm}
We prefer to use the first notation as it generalises better to more complex models, as we will see later. In Bayesian terms, these two lines describe the *likelihood* of the model, which is the assumption made about the generative process from which the data is issued. We make the assumption that the outcomes $y_{i}$ are normally distributed around a mean $\mu_{i}$ with some error $\sigma_{e}$. This is equivalent to saying that the errors are normally distributed around $0$, as illustrated by the above equivalence. Then, we can extend this model to the following multilevel model, adding a varying intercept:
$$
\begin{aligned}
y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\
\mu_{i} &= \alpha_{j[i]} + \beta x_{i} \\
\alpha_{j} &\sim \mathrm{Normal}(\alpha, \sigma_{\alpha}) \\
\end{aligned}
$$
\vspace{5mm}
where we use the notation $\alpha_{j[i]}$ to indicate that each group $j$ (e.g., class) is given a unique intercept, issued from a Gaussian distribution centered on $\alpha$, the grand intercept,^[Acknowledging that these individual intercepts can also be seen as adjustments to the grand intercept $\alpha$, that are specific to group $j$.] meaning that there might be different mean scores for each class. From this notation we can see that in addition to the residual standard deviation $\sigma_{e}$, we are now estimating one more variance component $\sigma_{\alpha}$, which is the standard deviation of the distribution of varying intercepts. We can interpret the variation of the parameter $\alpha$ between groups $j$ by considering the *intra-class correlation* (ICC) $\sigma_{\alpha}^{2} / (\sigma_{\alpha}^{2} + \sigma_{e}^{2})$, which goes to $0$, if the grouping conveys no information, and to $1$, if all observations in a group are identical [@gelman_data_2006, page 258].
The third line is called a *prior* distribution in the Bayesian framework. This prior distribution describes the population of intercepts, thus modelling the dependency between these parameters.
Following the same strategy, we can add a varying slope, allowed to vary according to the group $j$:
$$
\begin{aligned}
y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\
\mu_{i} &= \alpha_{j[i]} + \beta_{j[i]} x_{i} \\
\alpha_{j} &\sim \mathrm{Normal}(\alpha, \sigma_{\alpha}) \\
\beta_{j} &\sim \mathrm{Normal}(\beta, \sigma_{\beta}) \\
\end{aligned}
$$
\vspace{5mm}
Indicating that the effect of the number of lessons on L2 speech intelligibility is allowed to differ from one class to another (i.e., the effect of the number of lessons might be more beneficial to some classes than others). These varying slopes are assigned a prior distribution centered on the grand slope $\beta$, and with standard deviation $\sigma_{\beta}$.
In this introductory section, we have presented the foundations of Bayesian analysis and multilevel modelling. Bayes' theorem allows prior knowledge about parameters to be updated according to the information conveyed by the data, while MLMs allow complex dependency structures to be modelled. We now move to a detailed case study in order to illustrate these concepts.
\vspace{2mm}
\begin{mybox}[label = random]{Where are my random effects ?}
In the Bayesian framework, every unknown quantity is considered as a random variable that we can describe using probability distributions. As a consequence, there is no such thing as a "fixed effect" or a "random effects distribution" in a Bayesian framework. However, these semantic quarrels disappear when we write down the model.
Suppose we have a dependent continuous variable $y$ and a dichotomic categorical predictor $x$ (assumed to be contrast-coded). Let $y_{ij}$ denote the score of the $i^{th}$ participant in the $j^{th}$ condition. We can write a "mixed effects" model (as containing both fixed and random effects) as follows:
$$
y_{ij} = \alpha + \alpha_{i} + \beta x_{j} + e_{ij},\ e_{ij} \sim \mathrm{Normal}(0, \sigma_{e}^{2}),\ \alpha_{i} \sim \mathrm{Normal}(0, \sigma_{a}^{2})
$$
\vspace{5mm}
Where the terms $\alpha$ and $\beta$ represent the "fixed effects" and denote the overall mean response and the condition difference in response, respectively. In addition, $e_{ij}$ are random errors assumed to be normally distributed with unknown variance $\sigma_{e}^{2}$, and $\alpha_{i}$’s are individual specific random effects normally distributed in the population with unknown variance $\sigma_{a}^{2}$.
We can rewrite this model to make apparent that the so-called "random effects distribution" can actually be considered a prior distribution (from a Bayesian standpoint), since by definition, distributions on unknown quantities are considered as priors:
$$
\begin{aligned}
y_{ij} &\sim \mathrm{Normal}(\mu_{ij}, \sigma_{e}^{2}) \\
\mu_{ij} &= \alpha_{i} + \beta x_{j} \\
\alpha_{i} &\sim \mathrm{Normal}(\alpha, \sigma_{\alpha}^{2}) \\
\end{aligned}
$$
\vspace{5mm}
where the parameters of this prior are learned from the data. As we have seen, the same mathematical entity can be conceived either as a "random effects distribution" or as a prior distribution, depending on the framework.
\end{mybox}
### Software programs
@sorensen_bayesian_2016 provided a detailed and accessible introduction to Bayesian MLMs (BMLMs) applied to linguistics, using the probabilistic language `Stan` [@carpenter_stan_2017]. However, discovering BMLMs and the `Stan` language all at once might seem a little overwhelming, as `Stan` can be difficult to learn for users that are not experienced with programming languages. As an alternative, we introduce the `brms` package [@R-brms], that implements BMLMs in `R`, using `Stan` under the hood, with an `lme4`-like syntax. Hence, the syntax required by `brms` will not surprise the researcher familiar with `lme4`, as models of the following form:
$$
\begin{aligned}
y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\
\mu_{i} &= \alpha + \alpha_{subject[i]} + \beta x_{i} \\
\end{aligned}
$$
\vspace{5mm}
are specified in `brms` (as in `lme4`) with: `y ~ 1 + x + (1|subject)`. In addition to linear regression models, `brms` allows generalised linear and non-linear multilevel models to be fitted, and comes with a great variety of distribution and link functions. For instance, `brms` allows fitting robust linear regression models, or modelling dichotomous and categorical outcomes using logistic and ordinal regression models. The flexibility of `brms` also allows for distributional models (i.e., models that include simultaneous predictions of all response parameters), Gaussian processes or non-linear models to be fitted, among others. More information about the diversity of models that can be fitted with `brms` and their implementation is provided in @R-brms and @burkner_advanced_2018.
## Application example
```{r data, echo = FALSE}
data(indoVowels)
```
To illustrate the use of BMLMs, we reanalysed a dataset from @mccloy_phonetic_2014, available in the `phonR` package [@R-phonR]. This dataset contains formant (F1 and F2) values for five vowels of Standard Indonesian (ISO 639-3:ind), as spoken by eight speakers (four females), with approximately 45 repetitions of each vowel. The research question we investigated here is the effect of gender on vowel production variability.
### Data pre-processing
Our research question was about the different amount of variability in the respective vowel productions of male and female speakers, due to cognitive or social differences. To answer this question, we first needed to get rid of the differences in vowel production that are due to physiological differences between males and females (e.g., shorter vocal tract length for females). More generally, we needed to eliminate the inter-individual differences due to physiological characteristics in our groups of participants. For that purpose, we first applied the Watt & Fabricius formant normalisation technique [@watt_evaluation_2002]. The principle of this method is to calculate for each speaker a "centre of gravity" $S$ in the F1/F2 plane, from the formant values of point vowels [i, a , u], and to express the formant values of each observation as ratios of the value of $S$ for that formant.
```{r vowelplot-ref, dev = "pdf", dpi = 300, warning = FALSE, fig.pos = "H", fig.align = "center", fig.showtext = TRUE, fig.cap = "Euclidean distances between each observation and the centres of gravity corresponding to each vowel across all participants, by gender (top row: female, bottom row: male) and by vowel (in column), in the normalised F1-F2 plane. The grey background plots represent the individual data collapsed for all individuals (male and female) and all vowels. Note that, for the sake of clarity, this figure represents a unique center of gravity for each vowel for all participants, whereas in the analysis, one center of gravity was used for each vowel and each participant."}
##############################################################
# vowel normalisation and distance computation
###################################################
load.fontawesome()
load.emojifont("OpenSansEmoji.ttf")
indo <-
indo %>%
mutate(
f1norm =
normVowels("wattfabricius",
f1 = f1,
f2 = f2,
vowel = vowel,
group = subj)[, 1],
f2norm =
normVowels("wattfabricius",
f1 = f1,
f2 = f2,
vowel = vowel,
group = subj)[, 2]
) %>%
group_by(vowel, subj) %>%
mutate(
distance = sqrt((f1norm - mean(f1norm) )^2 + (f2norm - mean(f2norm) )^2),
repetition = row_number()
) %>%
ungroup() %>%
mutate(vowel = paste0("/", vowel, "/") )
vowel_order <- c("/i/", "/e/", "/a/", "/o/", "/u/")
########################################################################
# plotting data
##################################################
indo %>%
# selecting a subset of observations for readibility
sample_frac(0.25) %>%
# recoding variable for plotting
group_by(gender, vowel) %>%
mutate(
meanx = mean(f2norm),
meany = mean(f1norm)
) %>%
ungroup() %>%
mutate(
gender = case_when(
.$gender == "m" ~ "fa-mars" %>% fontawesome(),
.$gender == "f" ~ "fa-venus" %>% fontawesome()
)
) %>%
mutate(vowel = factor(vowel, levels = vowel_order) ) %>%
arrange(vowel) %>%
# setting up axis
ggplot(aes(x = f2norm, y = f1norm) ) +
# setting up grid for plotting
facet_grid(gender ~ vowel, switch = "both") +
# drawing each observation for each plot and overall observations
geom_point(
data = indo %>% select(-gender, - vowel),
alpha = 0.5, shape = 16, size = 0.5, colour = "snow3"
) +
geom_segment(aes(xend = meanx, yend = meany) ) +
# aesthetics
theme_bw(base_size = 10) +
guides(color = "none") +
scale_y_reverse(position = "right") +
scale_x_reverse(position = "top") +
theme(
strip.text.x = element_text(size = 10),
strip.text.y = element_text(family = "fontawesome-webfont", size = 10, angle = 180),
axis.title.x = element_text(hjust = 0.5, angle = 0),
axis.title.y = element_text(hjust = 0.5, angle = 180)
) +
labs(x = "F2", y = "F1")
```
Then, for each vowel and participant, we computed the Euclidean distance between each observation and the centre of gravity of the whole set of observations in the F1-F2 plane for that participant and that vowel. The data obtained by this process are illustrated in Figure \@ref(fig:vowelplot-ref), and a sample of the final dataset can be found in Table \@ref(tab:datavis).
```{r datavis, results = "asis"}
sample_data <- head(data.frame(sample_n(indo, 10) ), 10)
#write.csv(sample_data, file = "Table_1.csv", row.names = FALSE)
apa_table(
sample_data,
placement = "H",
align = c("c", "c", "c", "c", "c", "c", "c", "c", "c"),
caption = "Ten randomly picked rows from the data.",
note = NULL,
small = TRUE,
format.args = list(
digits = c(0, 0, 3, 3, 3, 0),
margin = 2,
decimal.mark = ".", big.mark = ""
)
)
```
### Constant effect of gender on vowel production variability
We then built a first model with constant effects only and vague priors on $\alpha$ and $\beta$, the intercept and the slope. We contrast-coded `gender` (f = -0.5, m = 0.5). Our dependent variable was therefore the distance from each individual vowel centre of gravity, which we will refer to as *formant distance* in the following. The formal model can be expressed as:
$$
\begin{aligned}
\text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\
\mu_{i} &= \alpha + \beta \times \text{gender}_{i} \\
\alpha &\sim \mathrm{Normal}(0, 10) \\
\beta &\sim \mathrm{Normal}(0, 10) \\
\sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\
\end{aligned}
$$
\vspace{5mm}
where the first two lines of the model describe the likelihood and the linear model.^[Note that --for the sake of simplicity-- throughout this tutorial we use a Normal likelihood, but other (better) alternatives would include using skew-normal or log-normal models, which are implemented in `brms` with the `skew_normal` and `lognormal` families. We provide examples in the [supplementary materials](#suppApp).] The next three lines define the prior distribution for each parameter of the model, where $\alpha$ and $\beta$ are given a vague (weakly informative) Gaussian prior centered on $0$, and the residual variation is given a Half-Cauchy prior [@gelman_prior_2006;@polson_half-cauchy_2012], thus restricting the range of possible values to positive ones. As depicted in Figure \@ref(fig:priorsbmod1), the $\mathrm{Normal}(0,10)$ prior is weakly informative in the sense that it grants a relative high weight to $\alpha$ and $\beta$ values, between -25 and 25. This corresponds to very large (given the scale of our data) values for, respectively, the mean distance value $\alpha$, and the mean difference between males and females $\beta$. The $\mathrm{HalfCauchy}(10)$ prior placed on $\sigma_{e}$ also allows very large values of $\sigma_{e}$, as represented in the right panel of Figure \@ref(fig:priorsbmod1).
```{r priorsbmod1, dev = "pdf", dpi = 300, echo = FALSE, fig.pos = "H", fig.width = 6, fig.height = 3, fig.cap = "Prior distributions used in the first model, for $\\alpha$ and $\\beta$ (left panel) and for the residual variation $\\sigma_{e}$ (right panel)."}
p1 <-
data.frame(x = c(-50, 50) ) %>%
ggplot(aes(x = x) ) +
stat_function(
fun = dnorm, args = list(mean = 0, sd = 10),
geom = "area", alpha = 1,
fill = "#b3cde0"
) +
stat_function(
fun = dnorm, args = list(mean = 0, sd = 10),
geom = "line", alpha = 1
) +
xlab(expression(paste(alpha, ", ", beta) ) ) +
ylab("density") +
ggtitle("Normal(0, 10)") +
theme_bw(base_size = 10)
p2 <-
data.frame(x = c(0, 100) ) %>%
ggplot(aes(x = x) ) +
stat_function(
fun = dcauchy, args = list(scale = 10),
geom = "area", alpha = 1,
fill = "#b3cde0"
) +
stat_function(
fun = dcauchy, args = list(scale = 10),
geom = "line", alpha = 1
) +
xlab(expression(sigma[e]) ) +
ylab("") +
ggtitle("HalfCauchy(10)") +
theme_bw(base_size = 10)
p1 + p2
```
These priors can be specified in numerous ways (see `?set_prior` for more details), among which the following:
```{r priorbmod1, echo = TRUE}
prior1 <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = gender),
prior(cauchy(0, 10), class = sigma)
)
```
where a prior can be defined over a class of parameters (e.g., for all variance components, using the `sd` class) or for a specific one, for instance as above by specifying the coefficient (`coef`) to which the prior corresponds (here the slope of the constant effect of gender).
The model can be fitted with `brms` with the following command:
```{r bmod1vis, eval = FALSE, echo = TRUE}
library(brms)
bmod1 <- brm(
distance ~ gender,
data = indo, family = gaussian(),
prior = prior1,
warmup = 2000, iter = 5000
)
```
where `distance` is the distance from the centre of gravity. The `iter` argument serves to specify the total number of iterations of the Markov Chain Monte Carlo (MCMC) algorithm, and the `warmup` argument specifies the number of iterations that are run at the beginning of the process to "calibrate" the MCMC, so that only `iter - warmup` iterations are retained in the end to approximate the shape of the posterior distribution [for more details, see @mcelreath_statistical_2016].
```{r dataform}
indo <-
indo %>%
# contrast coding
mutate(gender = ifelse(indo$gender == "f", -0.5, 0.5) )
```
```{r bmod1}
bmod1 <- brm(
distance ~ gender,
data = indo, family = gaussian(),
prior = prior1,
warmup = 2000, iter = 5000,
chains = 2, cores = detectCores(),
control = list(adapt_delta = 0.95)
)
```
Figure \@ref(fig:plotbmod1) depicts the estimations of this first model for the intercept $\alpha$, the slope $\beta$, and the residual standard deviation $\sigma_{e}$. The left part of the plot shows histograms of draws taken from the posterior distribution, and from which several summaries can be computed (e.g., mean, mode, quantiles). The right part of Figure \@ref(fig:plotbmod1) shows the behaviour of the two simulations (i.e., the two chains) used to approximate the posterior distribution, where the x-axis represents the number of iterations and the y-axis the value of the parameter. This plot reveals one important aspect of the simulations that should be checked, known as *mixing*. A chain is considered well mixed if it explores many different values for the target parameters and does not stay in the same region of the parameter space. This feature can be evaluated by checking that these plots, usually referred to as *trace plots*, show random scatter around a mean value (they look like a "fat hairy caterpillar").
```{r plotbmod1, dev = "pdf", dpi = 300, echo = TRUE, fig.pos = "H", fig.cap = "Histograms of posterior samples and trace plots of the intercept, the slope for gender and the standard deviation of the residuals of the constant effects model."}
library(tidyverse)
bmod1 %>%
plot(
combo = c("hist", "trace"), widths = c(1, 1.5),
theme = theme_bw(base_size = 10)
)
```
The estimations obtained for this first model are summarised in Table \@ref(tab:sumbmod1), which includes the mean, the standard error (SE), and the lower and upper bounds of the 95% credible interval (CrI)^[Where a credible interval is the Bayesian analogue of a classical confidence interval, except that probability statements can be made based upon it (e.g., "given the data and our prior assumptions, there is a 0.95 probability that this interval encompasses the population value $\theta$").] of the posterior distribution for each parameter. As `gender` was contrast-coded before the analysis (f = -0.5, m = 0.5), the intercept $\alpha$ corresponds to the grand mean of the formant distance over all participants and has its mean around `r round(fixef(bmod1)[1, 1], 3)`. The estimate of the slope ($\beta =$ `r round(fixef(bmod1)[2, 1], 3)`) suggests that females are more variable than males in the way they pronounce vowels, while the 95% CrI can be interpreted in a way that there is a $0.95$ probability that the value of the intercept lies in the [`r round(fixef(bmod1)[2, 3], 3)`, `r round(fixef(bmod1)[2, 4], 3)`] interval.
```{r sumbmod1, results = "asis"}
a <- summary(bmod1)
summary_mod1 <- rbind(data.frame(a$fixed), data.frame(a$spec_pars) )
rownames(summary_mod1) <- c("$\\alpha$", "$\\beta$", "$\\sigma_{e}$")
colnames(summary_mod1) <- c("mean","SE", "lower bound", "upper bound", "ESS", "Rhat")
summary_mod1 %<>%
select(-ESS) %>% # removing ESS
rownames_to_column(var = "parameter")
#write.csv(summary_mod1, file = "Table_2.csv", row.names = FALSE)
apa_table(
summary_mod1,
placement = "H",
align = c("c", "c", "c", "c", "c", "c"),
caption = "Posterior mean, standard error, 95\\% credible interval and $\\hat{R}$
statistic for each parameter of the constant effect model bmod1.",
note = NULL,
small = TRUE,
digits = 3,
escape = FALSE
)
```
The `Rhat` value corresponds to the *potential scale reduction factor* $\hat{R}$ [@gelman_inference_1992], that provides information about the convergence of the algorithm. This index can be conceived as equivalent to the F-ratio in ANOVA. It compares the between-chains variability (i.e., the extent to which different chains differ one from each other) to the within-chain variability (i.e., how widely a chain explores the parameter space), and, as such, gives an index of the convergence of the chains. An overly large between-chains variance (as compared to the within-chain variability) would be a sign that chain-specific characteristics, like the starting value of the algorithm, have a strong influence on the final result. Ideally, the value of `Rhat` should be close to 1, and should not exceed 1.1. Otherwise, one might consider running more iterations or defining stronger priors [@R-brms;@gelman_bayesian_2013].
```{r lmer_models, echo = FALSE}
mod1 <- lme4::lmer(distance ~ gender + (1|subj), REML = FALSE, data = indo)
mod2 <- lme4::lmer(distance ~ gender + (1|subj) + (1|vowel), REML = FALSE, data = indo)
mod3 <- lme4::lmer(distance ~ gender + (1|subj) + (1+gender|vowel), REML = FALSE, data = indo)
mod4 <- lme4::lmer(distance ~ gender + (1|subj) + (1+gender|vowel) + (1|subj:vowel), REML = FALSE, data = indo)
```
### Varying intercept model {#shrink}
The first model can be improved by taking into account the dependency between vowel formant measures for each participant. This is handled in MLMs by specifying unique intercepts $\alpha_{subject[i]}$ and by assigning them a common prior distribution. This strategy corresponds to the following by-subject varying-intercept model, `bmod2`:
$$
\begin{aligned}
\text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\
\mu_{i} &= \alpha + \alpha_{subject[i]} + \beta \times \text{gender}_{i} \\
\alpha_{subject} &\sim \mathrm{Normal}(0, \sigma_{subject}) \\
\alpha &\sim \mathrm{Normal}(0, 10) \\
\beta &\sim \mathrm{Normal}(0, 10) \\
\sigma_{subject} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\
\end{aligned}
$$
\vspace{5mm}
This model can be fitted with `brms` with the following command (where we specify the HalfCauchy prior on $\sigma_{subject}$ by applying it on parameters of class `sd`):
```{r brms_mod2_disp, eval = FALSE, echo = TRUE}
prior2 <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = gender),
prior(cauchy(0, 10), class = sd),
prior(cauchy(0, 10), class = sigma)
)
bmod2 <- brm(
distance ~ gender + (1|subj),
data = indo, family = gaussian(),
prior = prior2,
warmup = 2000, iter = 10000
)
```
```{r brms_mod2, echo = FALSE}
prior2 <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = gender),
prior(cauchy(0, 10), class = sd),
prior(cauchy(0, 10), class = sigma)
)
bmod2 <- brm(
distance ~ gender + (1|subj),
data = indo, family = gaussian(),
prior = prior2,
warmup = 2000, iter = 10000,
chains = 2, cores = parallel::detectCores(),
control = list(adapt_delta = 0.99)
)
```
As described in the first part of the present paper, we now have two sources of variation in the model: the standard deviation of the residuals $\sigma_{e}$ and the standard deviation of the by-subject varying intercepts $\sigma_{subject}$. The latter represents the standard deviation of the population of varying intercepts, and is also learned from the data. It means that the estimation of each unique intercept will inform the estimation of the population of intercepts, which, in return, will inform the estimation of the other intercepts. We call this sharing of information between groups the *partial pooling* strategy, in comparison with the *no pooling* strategy, where each intercept is estimated independently, and with the *complete pooling* strategy, in which all intercepts are given the same value [@gelman_data_2006;@gelman_bayesian_2013;@mcelreath_statistical_2016]. This is one of the most essential features of MLMs, and what leads to better estimations than single-level regression models for repeated measurements or unbalanced sample sizes. This pooling of information is made apparent through the phenomenon of *shrinkage*, which is illustrated in Figure \@ref(fig:ranefplotbmod2), and later on, in Figure \@ref(fig:shrinkageplot).
Figure \@ref(fig:ranefplotbmod2) shows the posterior distribution as estimated by this second model for each participant, in relation to the raw mean of its category (i.e., females or males), represented by the vertical dashed lines. We can see for instance that participants `M02` and `F09` have smaller average distance than the means of their groups, while participants `M03` and `F08` have larger ones. The arrows represent the amount of *shrinkage*, that is, the deviation between the mean in the raw data (represented by a cross underneath each density) and the estimated mean of the posterior distribution (represented by the peak of the arrow). As shown in Figure \@ref(fig:ranefplotbmod2), this *shrinkage* is always directed toward the mean of the considered group (i.e., females or males) and the amount of *shrinkage* is determined by the deviation of the individual mean from its group mean. This mechanism acts like a safeguard against overfitting, preventing the model from overly trusting each individual datum.
```{r ranefplotbmod2, message = FALSE, dev = "pdf", dpi = 300, fig.pos = "H", fig.cap = "Posterior distributions by subject, as estimated by the `bmod2` model. The vertical dashed lines represent the means of the formant distances for the female and male groups. Crosses represent the mean of the raw data, for each participant. Arrows represent the amount of shrinkage, between the raw mean and the estimation of the model (the mean of the posterior distribution)."}
post <- posterior_samples(bmod2, pars = "^r")
colnames(post) <- substr(colnames(post), 8, 10)
for (i in 1:ncol(post) ) {
post[, i] <- post[, i] + posterior_samples(bmod2, pars = "b_Intercept")
}
for (i in 1:4) {
post[, i] <- post[, i] - posterior_samples(bmod2, pars = "b_gender") / 2
}
for (i in 5:8) {
post[, i] <- post[, i] + posterior_samples(bmod2, pars = "b_gender") / 2
}
no_pool <- aggregate(distance ~ subj, indo, mean)
gender_mean <- aggregate(distance ~ gender, indo, mean)$distance
posterior_mean <-
post %>%
gather() %>%
aggregate(value ~ key, data = ., mean) %>%
add_column(model = "posterior")
no_pool2 <-
no_pool %>%
rename(key = subj, value = distance) %>%
add_column(model = "nopool")
posterior_data <- bind_rows(posterior_mean, no_pool2)
library(magrittr)
post %>%
gather %>%
group_by(key) %>%
do(density(.$value) %$% data_frame(value = x, dens = y) ) %>%
ggplot(aes(x = value, y = key, fill = key) ) +
geom_density_ridges(
aes(height = dens),
fill = "#b3cde0",
alpha = 0.5,
scale = 1.5,
size = 0.4, stat = "identity", show.legend = FALSE
) +
geom_segment(x = as.numeric(gender_mean[1]), y = 0.6,
xend = as.numeric(gender_mean[1]), yend = 5.6, lty = 2) +
geom_segment(x = as.numeric(gender_mean[2]), y = 4.6,
xend = as.numeric(gender_mean[2]), yend = 10.8, lty = 2) +
geom_point(data = no_pool, aes(x = distance, y = subj), inherit.aes = FALSE,
shape = 4, size = 2, stroke = 0.8, show.legend = FALSE) +
geom_path(data = posterior_data, aes(group = key, color = NULL),
arrow = arrow(length = unit(.015, "npc"), ends = "first", type = "closed" ) ) +
labs(x = "distance", y = "participant") +
theme_bw(base_size = 10)
```
The marginal posterior distribution of each parameter obtained with `bmod2` is summarised in Table \@ref(tab:sumbmod2), where the `Rhat` values close to $1$ suggest that the model has converged. We see that the estimates of $\alpha$ and $\beta$ are similar to the estimates of the first model, except that the SE is now slightly larger. This result might seem surprising at first sight, as we expected to improve the first model by adding a by-subject varying intercept. In fact, it reveals an underestimation of the SE when using the first model. Indeed, the first model assumes independence of observations, which is violated in our case. This highlights the general need for careful consideration of the model's assumptions when interpreting its estimations. The first model seemingly gives highly certain estimates, but these estimations are only valid in the "independence of observations" world [see also the distinction between the *small world* and the *large world* in @mcelreath_statistical_2016]. Moreover, estimating an intercept by subject (as in the second model) increases the precision of estimation, but it also makes the average estimation less certain, thus resulting in a larger SE.
```{r sumbmod2, results = "asis"}
a <- summary(bmod2)
summary_mod2 <- rbind(data.frame(a$fixed), data.frame(a$random$subj), data.frame(a$spec_pars) )
rownames(summary_mod2) <- c("$\\alpha$", "$\\beta$", "$\\sigma_{subject}$", "$\\sigma_{e}$")
colnames(summary_mod2) <- c("mean","SE", "lower bound", "upper bound", "ESS", "Rhat")
summary_mod2 %<>%
select(-ESS) %>% # removing ESS
rownames_to_column(var = "parameter")
#write.csv(summary_mod2, file = "Table_3.csv", row.names = FALSE)
apa_table(
summary_mod2,
placement = "H",
align = c("c", "c", "c", "c", "c", "c"),
caption = "Posterior mean, standard error, 95\\% credible interval and $\\hat{R}$
statistic for each parameter of model bmod2 with a varying intercept by subject.",
note = NULL,
small = TRUE,
digits = 3,
escape = FALSE
)
```
This model (`bmod2`), however, is still not adequate to describe the data, as the dependency between repetitions of each vowel is not taken into account. In `bmod3`, we added a by-vowel varying intercept, thus also allowing each vowel to have a different general level of variability.
$$
\begin{aligned}
\text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\
\mu_{i} &= \alpha + \alpha_{subject[i]} + \alpha_{vowel[i]} + \beta \times \text{gender}_{i} \\
\alpha_{subj} &\sim \mathrm{Normal}(0, \sigma_{subject}) \\
\alpha_{vowel} &\sim \mathrm{Normal}(0, \sigma_{vowel}) \\
\alpha &\sim \mathrm{Normal}(0, 10) \\
\beta &\sim \mathrm{Normal}(0, 10) \\
\sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{subject} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{vowel} &\sim \mathrm{HalfCauchy}(10) \\
\end{aligned}
$$
\vspace{5mm}
This model can be fitted with `brms` with the following command:
```{r brms_mod3_disp, eval = FALSE, echo = TRUE}
prior3 <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = gender),
prior(cauchy(0, 10), class = sd),
prior(cauchy(0, 10), class = sigma)
)
bmod3 <- brm(
distance ~ gender + (1|subj) + (1|vowel),
data = indo, family = gaussian(),
prior = prior3,
warmup = 2000, iter = 10000
)
```
```{r priorbmod3, echo = FALSE}
prior3 <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = gender),
prior(cauchy(0, 10), class = sd),
prior(cauchy(0, 10), class = sigma)
)
bmod3 <- brm(
distance ~ gender + (1|subj) + (1|vowel),
data = indo, family = gaussian(),
prior = prior3,
warmup = 2000, iter = 10000,
chains = 2, cores = parallel::detectCores(),
control = list(adapt_delta = 0.99)
)
```
where the same Half-Cauchy is specified for the two varying intercepts, by applying it directly to the `sd` class.
```{r sumbmod3, results = "asis"}
a <- summary(bmod3)
summary_mod3 <- rbind(
data.frame(a$fixed), data.frame(do.call(rbind, a$random) ), data.frame(a$spec_pars)
)
rownames(summary_mod3) <- c(
"$\\alpha$", "$\\beta$", "$\\sigma_{subject}$", "$\\sigma_{vowel}$", "$\\sigma_{e}$"
)
colnames(summary_mod3) <- c("mean","SE", "lower bound", "upper bound", "ESS", "Rhat")
summary_mod3 %<>%
select(-ESS) %>% # removing ESS
rownames_to_column(var = "parameter")
#write.csv(summary_mod3, file = "Table_4.csv", row.names = FALSE)
ICCsubject <- summary_mod3[3, 2]^2 / (summary_mod3[3, 2]^2 + summary_mod3[5, 2]^2)
ICCvowel <- summary_mod3[4, 2]^2 / (summary_mod3[4, 2]^2 + summary_mod3[5, 2]^2)
apa_table(
summary_mod3,
placement = "H",
align = c("c", "c", "c", "c", "c", "c"),
caption = "Posterior mean, standard error, 95\\% credible interval and $\\hat{R}$
statistic for each parameter of model bmod3 with a varying intercept by subject and by vowel.",
note = NULL,
small = TRUE,
digits = 3,
escape = FALSE
)
```
The marginal posterior distribution of each parameter is summarised in Table \@ref(tab:sumbmod3). We can compute the intra-class correlation (ICC, see section \@ref(MLM)) to estimate the relative variability associated with each varying effect: $ICC_{subject}$ is equal to `r round(ICCsubject, 3)` and $ICC_{vowel}$ is equal to `r round(ICCvowel, 3)`. The rather high ICC for vowels suggests that observations are highly correlated within each vowel, thus stressing the relevance of allocating a unique intercept by vowel.^[But please note that we do not mean to suggest that the varying intercept for subjects should be removed because its ICC is low.]
### Including a correlation between varying intercept and varying slope
One can legitimately question the assumption that the differences between male and female productions are identical for each vowel. To explore this issue, we thus added a varying slope for the effect of gender, allowing it to vary by vowel. Moreover, we can exploit the correlation between the baseline level of variability by vowel, and the amplitude of the difference between males and females in pronouncing them. For instance, we can observe that the pronunciation of /a/ is more variable in general. We might want to know whether females tend to pronounce vowels that are situated at a specific location in the F1-F2 plane with less variability than males. In other words, we might be interested in knowing whether the effect of `gender` is correlated with the baseline level of variability. This is equivalent to investigating the *dependency*, or the correlation between the varying intercepts and the varying slopes. We thus estimated this correlation by modelling $\alpha_{vowel}$ and $\beta_{vowel}$ as issued from the same multivariate normal distribution (a multivariate normal distribution is a generalisation of the usual normal distribution to more than one dimension), centered on $0$ and with some covariance matrix $\textbf{S}$, as specified on the third line of the following model:
$$
\begin{aligned}
\text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\
\mu_{i} &= \alpha + \alpha_{subject[i]} + \alpha_{vowel[i]} + (\beta + \beta_{vowel[i]}) \times \text{gender}_{i} \\
\begin{bmatrix}
\alpha_{\text{vowel}} \\
\beta_{\text{vowel}} \\
\end{bmatrix}
&\sim \mathrm{MVNormal}\bigg(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \textbf{S}\bigg) \\
\textbf{S} &=
\begin{pmatrix}
\sigma_{\alpha_{vowel}}^{2} & \sigma_{\alpha_{vowel}}\sigma_{\beta{vowel}} \rho \\
\sigma_{\alpha_{vowel}}\sigma_{\beta{vowel}} \rho & \sigma_{\beta_{vowel}}^{2} \\
\end{pmatrix} \\
\alpha_{subject} &\sim \mathrm{Normal}(0, \sigma_{subject}) \\
\alpha &\sim \mathrm{Normal}(0, 10) \\
\beta &\sim \mathrm{Normal}(0, 10) \\
\sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{\alpha_{vowel}} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{\beta_{vowel}} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{subject} &\sim \mathrm{HalfCauchy}(10) \\
\textbf{R} &\sim \mathrm{LKJcorr}(2) \\
\end{aligned}
$$
\vspace{5mm}
where $\textbf{R}$ is the correlation matrix $\textbf{R} = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}$ and $\rho$ is the correlation between intercepts and slopes, used in the computation of $\textbf{S}$. This matrix is given the LKJ-Correlation prior [@lewandowski_generating_2009] with a parameter $\zeta$ (zeta) that controls the strength of the correlation.^[The LKJ prior is the default prior for correlation matrices in `brms`.] When $\zeta = 1$, the prior distribution on the correlation is uniform between $-1$ and $1$. When $\zeta > 1$, the prior distribution is peaked around a zero correlation, while lower values of $\zeta$ ($0 < \zeta < 1$) allocate more weight to extreme values (i.e., close to -1 and 1) of $\rho$ (see Figure \@ref(fig:lkj)).
```{r lkj, echo = FALSE, dev = "pdf", dpi = 300, fig.pos = "H", fig.width = 4, fig.height = 4, fig.cap = "Visualisation of the LKJ prior for different values of the shape parameter $\\zeta$."}
nsims <- 1e6
data.frame(
zeta05 = rethinking::rlkjcorr(nsims, K = 2, eta = 0.5)[, 1, 2],
zeta1 = rethinking::rlkjcorr(nsims, K = 2, eta = 1)[, 1, 2],
zeta5 = rethinking::rlkjcorr(nsims, K = 2, eta = 5)[, 1, 2],
zeta50 = rethinking::rlkjcorr(nsims, K = 2, eta = 10)[, 1, 2]
) %>%
gather(shape, y, zeta05:zeta50) %>%
ggplot(aes(x = y, linetype = shape) ) +
geom_line(stat = "density", position = "identity", size = 0.8, alpha = 1) +
xlab(expression(rho) ) +
ylab("density") +
theme_bw(base_size = 10) +
scale_linetype_manual(
values = c("dotted", "dotdash", "dashed", "solid"),
labels = c(
expression(paste(zeta, " = ", "0.5") ),
expression(paste(zeta, " = ", "1") ),
expression(paste(zeta, " = ", "10") ),
expression(paste(zeta, " = ", "50") )
)
) +
theme(
legend.text.align = 0,
legend.position = c(0.75, 0.8),
legend.background = element_rect(size = 0.5, colour = "black")
)
```
```{r brms_mod4_disp, eval = FALSE, echo = TRUE}
prior4 <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = gender),
prior(cauchy(0, 10), class = sd),
prior(cauchy(0, 10), class = sigma),
prior(lkj(2), class = cor)
)
bmod4 <- brm(
distance ~ gender + (1|subj) + (1 + gender|vowel),
data = indo, family = gaussian(),
prior = prior4,
warmup = 2000, iter = 10000
)
```
```{r brms_mod4, echo = FALSE}
prior4 <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = gender),
prior(cauchy(0, 10), class = sd),
prior(cauchy(0, 10), class = sigma),
prior(lkj(2), class = cor)
)
bmod4 <- brm(
distance ~ gender + (1|subj) + (1+gender|vowel),
data = indo, family = gaussian(),
prior = prior4,
warmup = 2000, iter = 10000,
chains = 2, cores = parallel::detectCores(),
control = list(adapt_delta = 0.99)
)
```
Estimates of this model are summarised in Table \@ref(tab:sumbmod4). This summary reveals a negative correlation between the intercepts and slopes for vowels, meaning that vowels with a large "baseline level of variability" (i.e., with a large average `distance` value) tend to be pronounced with more variability by females than by males. However, we notice that this model's estimation of $\beta$ is even more uncertain than that of the previous models, as shown by the associated standard error and the width of the credible interval.
```{r sumbmod4, results = "asis"}
a <- summary(bmod4)
summary_mod4 <- rbind(
data.frame(a$fixed), data.frame(do.call(rbind, a$random) ), data.frame(a$spec_pars)
)
rownames(summary_mod4) <- c(
"$\\alpha$","$\\beta$","$\\sigma_{subject}$",
"$\\sigma_{\\alpha_{vowel}}$","$\\sigma_{\\beta_{vowel}}$","$\\rho$","$\\sigma_{e}$"
)
colnames(summary_mod4) <- c("mean", "SE", "lower bound", "upper bound", "ESS", "Rhat")
summary_mod4 %<>%
select(-ESS) %>% # removing ESS
rownames_to_column(var = "parameter")
#write.csv(summary_mod4, file = "Table_5.csv", row.names = FALSE)
apa_table(
summary_mod4,
placement = "H",
align = c("c", "c", "c", "c", "c", "c"),
caption = "Posterior mean, standard error, 95\\% credible interval and $\\hat{R}$
statistic for each parameter of model bmod4 with a varying intercept and varying slope by vowel.",
note = NULL,
small = TRUE,
digits = 3,
escape = FALSE
)
```
Figure \@ref(fig:shrinkageplot) illustrates the negative correlation between the by-vowel intercepts and the by-vowel slopes, meaning that vowels that tend to have higher "baseline variability" (i.e., /e/, /o/, /a/), tend to show a stronger effect of `gender`. This figure also illustrates the amount of shrinkage, here in the parameter space. We can see that the *partial pooling* estimate is shrunk somewhere between the *no pooling* estimate and the *complete pooling* estimate (i.e., the grand mean). This illustrates again the mechanism by which MLMs balance the risk of overfitting and underfitting [@mcelreath_statistical_2016].
```{r shrinkage, eval = TRUE, echo = FALSE}
bmod1_pooled <-
fixef(bmod1)[, 1] %>%
as.matrix %>%
t
complete_pooling <-
data.frame(
vowel = unique(indo$vowel),
Intercept = bmod1_pooled[1],
gender = bmod1_pooled[2],
model = "complete_pooling")
no_pooling <-
lme4::lmList(distance ~ gender|vowel, data = indo) %>%
coef() %>%
rownames_to_column("vowel") %>%
mutate(model = "no pooling") %>%
rename(Intercept = `(Intercept)`)
partial_pooling <-
coef(bmod4)$vowel[, 1, ] %>%
data.frame %>%
rownames_to_column("vowel") %>%
mutate(model = "partial pooling")
shrinkage <- bind_rows(no_pooling, partial_pooling)
# Extracting posterior samples
post <- posterior_samples(bmod4, pars = c("^b_", "^sd", "^cor") )
# Computing posterior mean bivariate Gaussian
mu <- c(mean(post$b_Intercept), mean(post$b_gender) )
rho <- mean(post$cor_vowel__Intercept__gender)
sda <- mean(post$sd_vowel__Intercept)
sdb <- mean(post$sd_vowel__gender)
cov_ab <- sda * sdb * rho
sigma <- matrix(c(sda^2, cov_ab, cov_ab, sdb^2), ncol = 2)
###########################################################################################
# Helper function to make ellipse, credits to Tristan Mahr
# https://tjmahr.github.io/plotting-partial-pooling-in-mixed-effects-models/
##################################################################################
make_ellipse <- function(cov_mat, center, level) {
ellipse(cov_mat, centre = center, level = level) %>%
as.data.frame() %>%
add_column(level = level)
}
levels <- c(.1, .3, .5, .7)
df_ellipse <-
levels %>%
purrr::map_df(~ make_ellipse(sigma, mu, level = .x) ) %>%
rename(Intercept = x, gender = y)
```
```{r shrinkageplot, dev = "pdf", dpi = 300, fig.pos = "H", fig.cap = "Shrinkage of estimates in the parameter space, due to the pooling of information between clusters (based on the `bmod4` model). The ellipses represent the contours of the bivariate distribution, at different degrees of confidence 0.1, 0.3, 0.5 and 0.7."}
shrinkage %>%
ggplot(aes(x = Intercept, y = gender, color = model) ) +
scale_color_grey() +
geom_point(size = 2, show.legend = TRUE) +
geom_point(
data = complete_pooling,
aes(x = Intercept, y = gender),
size = 2, color = "snow4",
show.legend = FALSE, inherit.aes = FALSE
) +
geom_path(aes(group = vowel, color = NULL), show.legend = FALSE ) +
ggrepel::geom_text_repel(
aes(label = vowel, color = NULL),
data = no_pooling, show.legend = FALSE
) +
geom_path(
aes(group = level, color = NULL),
data = df_ellipse,
linetype = "dashed", color = "grey40", alpha = 0.8
) +
labs(x = "intercept", y = "slope") +
coord_cartesian(
xlim = c(min(shrinkage$Intercept), max(shrinkage$Intercept) ),
ylim = c(min(shrinkage$gender), max(shrinkage$gender) ),
expand = TRUE
) +
theme_bw(base_size = 10)
```
### Varying intercept and varying slope model, interaction between subject and vowel
So far, we modelled varying effects of subjects and vowels. In this study, these varying factors were crossed, meaning that every subject had to pronounce every vowel. Let us now imagine a situation in which Subject 4 systematically mispronounced the /i/ vowel. This would be a source of systematic variation over replicates which is not considered in the model (`bmod4`), because this model can only adjust parameters for either vowel or participant, but not for a specific vowel for a specific participant.
In building the next model, we added a varying intercept for the interaction between subject and vowel, that is, we created an index variable that allocates a unique value at each crossing of the two variables (e.g., Subject1-vowel/a/, Subject1-vowel/i/, etc.), resulting in $8 \times 5 = 40$ intercepts to be estimated [for a review of multilevel modeling in various experimental designs, see @judd_experiments_2017]. This varying intercept for the interaction between subject and vowel represents the systematic variation associated with a specific subject pronouncing a specific vowel. This model can be written as follows, for any observation $i$:
$$
\begin{aligned}
\text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\
\mu_{i} &= \alpha + \alpha_{subject[i]} + \alpha_{vowel[i]} + \alpha_{subject:vowel[i]} + (\beta + \beta_{vowel[i]}) \times \text{gender}_{i} \\
\begin{bmatrix}
\alpha_{\text{vowel}} \\
\beta_{\text{vowel}} \\
\end{bmatrix}
&\sim \mathrm{MVNormal}\bigg(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \textbf{S}\bigg) \\
\textbf{S} &=
\begin{pmatrix}
\sigma_{\alpha_{vowel}}^{2} & \sigma_{\alpha_{vowel}}\sigma_{\beta{vowel}} \rho \\
\sigma_{\alpha_{vowel}}\sigma_{\beta{vowel}} \rho & \sigma_{\beta_{vowel}}^{2} \\
\end{pmatrix} \\
\alpha_{subject} &\sim \mathrm{Normal}(0, \sigma_{subject}) \\
\alpha_{subject:vowel} &\sim \mathrm{Normal}(0, \sigma_{subject:vowel}) \\
\alpha &\sim \mathrm{Normal}(0, 10) \\
\beta &\sim \mathrm{Normal}(0, 10) \\
\sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{subject} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{subject:vowel} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{\alpha_{vowel}} &\sim \mathrm{HalfCauchy}(10) \\
\sigma_{\beta_{vowel}} &\sim \mathrm{HalfCauchy}(10) \\
\textbf{R} &\sim \mathrm{LKJcorr}(2) \\
\end{aligned}
$$
\vspace{5mm}
This model can be fitted with the following command:
```{r brms_mod5_disp, eval = FALSE, echo = TRUE}
prior5 <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = gender),
prior(cauchy(0, 10), class = sd),
prior(cauchy(0, 10), class = sigma),
prior(lkj(2), class = cor)
)
bmod5 <- brm(
distance ~ gender + (1|subj) + (1 + gender|vowel) + (1|subj:vowel),
data = indo, family = gaussian(),
prior = prior5,
warmup = 2000, iter = 10000
)
```
```{r brms_mod5, echo = FALSE}
prior5 <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = gender),
prior(cauchy(0, 10), class = sd),
prior(cauchy(0, 10), class = sigma),
prior(lkj(2), class = cor)
)
bmod5 <- brm(
distance ~ gender + (1|subj) + (1+gender|vowel) + (1|subj:vowel),
data = indo, family = gaussian(),
prior = prior5,
warmup = 2000, iter = 10000,
chains = 2, cores = parallel::detectCores(),
control = list(adapt_delta = 0.99)
)
```
Estimates of this model are summarised in Table \@ref(tab:sumbmod5). From this table, we first notice that the more varying effects we add, the more the model is uncertain about the estimation of $\alpha$ and $\beta$, which can be explained in the same way as in section 2.2. Second, we see the opposite pattern for $\sigma_{e}$, the residuals standard deviation, which has decreased by a considerable amount compared to the first model, indicating a better fit.
```{r sumbmod5, results = "asis"}
a <- summary(bmod5)
summary_mod5 <- rbind(data.frame(a$fixed), data.frame(do.call(rbind, a$random) ), data.frame(a$spec_pars) )
summary_mod5 <- unique(summary_mod5)
rownames(summary_mod5) <- c(
"$\\alpha$","$\\beta$","$\\sigma_{subject}$",
"$\\sigma_{subject:vowel}$","$\\sigma_{\\alpha_{vowel}}$",
"$\\sigma_{\\beta_{vowel}}$","$\\rho$","$\\sigma_{e}$"
)
colnames(summary_mod5) <- c("mean","SE","lower bound","upper bound","ESS","Rhat")
summary_mod5 %<>%
select(-ESS) %>% # removing ESS
rownames_to_column(var = "parameter")
#write.csv(summary_mod5, file = "Table_6.csv", row.names = FALSE)