-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-inference.Rmd
875 lines (581 loc) · 47.6 KB
/
09-inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
# (PART) Analysis {-}
# Overview {-#analysis-overview}
In this section we turn to the analysis of datasets, the evaluation of results, and the interpretation of the findings. We will outline the three main types of statistical analyses: Inferential Data Analysis (IDA), Predictive Data Analysis (PDA), and Exploratory Data Analysis (EDA). Each of these analysis types have distinct, non-overlapping aims and therefore should be determined from the outset of the research project and included as part of the research blueprint. The aim of this section is to establish a clearer picture of the goals, methods, and value of each of these approaches.
```{r, child="_common.Rmd"}
```
# Inference {#inference-chapter}
<p style="font-weight:bold; color:red;">INCOMPLETE DRAFT</p>
> People generally see what they look for, and hear what they listen for.
>
> -- Harper Lee, To Kill a Mockingbird
```{block, type="rmdkey"}
The essential questions for this chapter are:
- what are the three main types of inferential analysis approaches?
- how does the informational value of the dependent variable relate to the statistical approach adopted?
- how to descriptive, statistical, and evaluative steps work together to produce the reliable results?
```
<!-- COURSE STRUCTURE
TUTORIALS:
- ...
SWIRL:
- ...
WORKED/ RECIPE:
- ...
PROJECT:
- ...
GOALS:
...
-->
```{r inference-data-packages, echo=FALSE, message=FALSE}
# Packages
pacman::p_load(tidyverse, patchwork, knitr, broom, effectsize, report, janitor, skimr, languageR)
```
In this chapter we consider approaches to deriving knowledge from information which can be generalized to the population from which the data is sampled. This process is known as statistical inference. The discussion here builds on concepts developed in [Chapter 3 "Approaching analysis"](#approaching-analysis) and implements descriptive assessments, statistical tests, and evaluation procedures for a series of contexts which are common in the analysis of corpus-based data. The chapter is structured into three main sections which correspond to the number of variables included in the statistical procedure. Each of these sections includes a subsection dedicated to the informational value of the dependent variable; the variable whose variation is to be explained.
For this discussion two datasets will be used as the base to pose various questions to submit for interrogation. It is of note that the questions in the subsequent sections are posited to highlight various descriptive, statistic, and evaluation procedures and do not reflect the standard approach to hypothesis testing which assumes that the null and alternative hypotheses are developed at the outset of the research project.
The process for each inferential data analysis in this section will include three steps: (1) descriptive assessment, (2) statistical interrogation, and (3) evaluation of the results.
<!-- TODO:
- Add Bresnan citation for questions to interrogate `dative`?
- Add Tottie citation for questions to interrogate `sdac_disfluencies`?
-->
## Preparation
At this point let's now get familiar with the datasets and prepare them for analysis. The first dataset to consider is the `dative` dataset. This dataset can be loaded from the languageR package [@R-languageR].
```{r i-dative-load}
dative <-
languageR::dative %>% # load the `dative` dataset
as_tibble() # convert the data frame to a tibble object
glimpse(dative) # preview structure
```
From `glimpse()` we can see that this dataset contains 3,263 observations and 15 columns.
The R Documentation can be consulted using `?dative` in the R Console. The description states:
> Data describing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection.
For a bit more context, a dative is the phrase which reflects the entity that takes the recipient role in a ditransitive clause. In English, the recipient (dative) can be realized as either a noun phrase (NP) as seen in (1) or as a prepositional phrase (PP) as seen in (2) below.
1. They give [you ~NP~] a drug test.
2. They give a drug test [to you ~PP~].
Together these two syntactic options are known as the Dative Alternation.
The observational unit for this dataset is `RealizationOfRecipient` variable which is either 'NP' or 'PP'. For the purposes of this chapter I will select a subset of the key variables we will use in the upcoming analyses and drop the others.
```{r i-dative-simplify}
dative <-
dative %>% # dataset
select(RealizationOfRecipient, Modality, LengthOfRecipient, LengthOfTheme) %>% # select key variables
janitor::clean_names() # normalize variable names
```
```{r i-dative-preview, echo=FALSE}
dative %>%
slice_head(n = 10) %>%
kable(booktabs = TRUE,
caption = "First 10 observations of simplified `dative` dataset.")
```
In Table \@ref(tab:i-dative-dictionary) I've created a data dictionary describing the variables in our new `dative` dataset based on the variable descriptions in the `languageR::dative` documentation.
```{r i-dative-dictionary, echo=FALSE}
tribble(
~variable_name, ~name, ~description,
"realization_of_recipient", "Realization of Recipient", "A factor with levels NP and PP coding the realization of the dative.",
"modality", "Language Modality", "A factor with levels *spoken*, *written*.",
"length_of_recipient", "Length of Recipient", "A numeric vector coding the number of words comprising the recipient.",
"length_of_theme", "Length of Theme", "A numeric vector coding the number of words comprising the theme."
) %>%
kable(booktabs = TRUE,
caption = "Data dictionary for the `dative` dataset.")
```
The second dataset that we will use in this chapter is the `sdac_disfluencies` dataset that we worked to derived in the previous chapter. Let's read in the dataset and preview the structure.
```{r i-sdac-disfluencies-read-show, eval=FALSE}
sdac_disfluencies <-
read_csv(file = "../data/derived/sdac/sdac_disfluencies.csv") # read transformed dataset
glimpse(sdac_disfluencies) # preview structure
```
```{r i-sdac-disfluencies-read-run, echo=FALSE, message=FALSE}
sdac_disfluencies <-
read_csv(file = "resources/09-inference/data/derived/sdac/sdac_disfluencies.csv") # read transformed dataset
glimpse(sdac_disfluencies) # preview structure
```
We prepared a data dictionary that reflects this transformed dataset. Let's read that file and then view it Table \@ref(tab:i-sdac-disfluencies-dictionary).
```{r i-sdac-disfluencies-diciontary-read-show, eval=FALSE}
sdac_disfluencies_dictionary <- read_csv(file = "../data/derived/sdac/sdac_disfluencies_data_dictionary.csv") # read data dictionary
```
```{r i-sdac-disfluencies-diciontary-read-run, echo=FALSE, message=FALSE}
sdac_disfluencies_dictionary <- read_csv(file = "resources/09-inference/data/derived/sdac/sdac_disfluencies_data_dictionary.csv")
```
```{r i-sdac-disfluencies-dictionary, echo=FALSE}
sdac_disfluencies_dictionary %>%
kable(booktabs = TRUE,
caption = "Data dictionary for the `sdac_disfluencies` dataset.")
```
For our analysis purposes we will reduce this dataset, as we did for the `dative` dataset, retaining only the variables of interest for the upcoming analyses.
```{r i-sdac-disfluencies-simplify}
sdac_disfluencies <-
sdac_disfluencies %>% # dataset
select(speaker_id, filler, count, sex, birth_year, education) # select key variables
```
Let's preview this simplified `sdac_disfluencies` dataset.
```{r i-sdac-disfluencies-preview, echo=FALSE}
sdac_disfluencies %>%
slice_head(n = 10) %>%
kable(booktabs = TRUE,
caption = "First 10 observations of simplified `sdac_disfluencies` dataset.")
```
Now the `sdac_disfluencies` dataset needs some extra transformation to better prepare it for statistical interrogation. On the one hand the variables `birth_year` and `education` are not maximally informative. First it would be more ideal if `birth_year` would reflect the age of the speaker at the time of the conversation(s) and furthermore the coded values of `education` are not explicit as far what the numeric values refer to.
The second issue has to do with preparing the `sdac_disfluencies` dataset for statistical analysis. This involves converting our column types to the correct vector types for statistical methods. Specifically we need to convert our categorical variables to the R type 'factor' (fct). This includes of our current variables which are character vectors, but also the `speaker_id` and `education` which appear as numeric but do not reflect a continuous variables; one is merely a code which uniquely labels each speaker and the other is an ordinal list of educational levels.
This will be a three step process, first we will normalize the `birth_year` to reflect the age of the speaker, second we will convert all the relevant categorical variables to factors, and third we will convert the `education` variable to a factor adding meaningful labels for the levels of this factor.
Consulting the [online manual for this corpus](https://catalog.ldc.upenn.edu/docs/LDC97S62/swb1_manual.txt), we see that the recording date for these conversations took place in 1992, so we can simply subtract the `birth_year` from 1992 to get each participant's age. We'll rename this new column `age` and drop the `birth_year` column.
```{r i-sdac-disfluencies-age-transform}
sdac_disfluencies <-
sdac_disfluencies %>% # dataset
mutate(age = (1992 - birth_year)) %>% # calculate age
select(-birth_year) # drop `birth_year` column
```
Now let's convert all the variables which are character vectors. We can do this using the the `factor()` function; first on `speaker_id` and then, with the help of `mutate_if()`, to all the other variables which are character vectors.
```{r i-sdac-disfluencies-to-factors}
sdac_disfluencies <-
sdac_disfluencies %>% # dataset
mutate(speaker_id = factor(speaker_id)) %>% # convert numeric to factor
mutate_if(is.character, factor) # convert all character to factor
```
We know from the data dictionary that the `education` column contains four values (0, 1, 2, 3, and 9). Again, consulting the corpus manual we can see what these values mean.
```
EDUCATION COUNT
--------------------
0 14 less than high school
1 39 less than college
2 309 college
3 176 more than college
9 4 unknown
```
So let's convert `education` to a factor adding these descriptions as factor level labels. The function `factor()` can take an argument `labels = ` which we can manually assign the label names for the factor levels in the order of the factor levels. Since the original values were numeric, the factor level ordering defaults to ascending order.
```{r i-sdac-disfluencies-education-transform}
sdac_disfluencies <-
sdac_disfluencies %>% # dataset
mutate(education = factor(education,
labels = c("less than high school", # value 0
"less than college", # value 1
"college", # value 2
"more than college", # value 3
"unknown"))) # value 9
```
So let's take a look at the `sdac_disfluencies` dataset we've prepared for analysis.
```{r i-sdac-disfluencies-transformed-preview}
glimpse(sdac_disfluencies)
```
Now the datasets `dative` and `sdac_disfluencies` are ready to be statistically interrogated.
## Univariate analysis
In what follows I will provide a description of inferential data analysis when only one variable is to be interrogated. This is known as a univariate analysis, or one-variable analysis. We will consider a case when the variable is categorical and the other continuous.
### Categorical
As an example of a univariate analysis where the variable used in the analysis is categorical we will look at the `dative` dataset. In this analysis we may be interested in knowing whether the recipient role in a ditransitive construction is realized more as an 'NP' or 'PP'.
**Descriptive assessment**
The `realization_of_recipient` variable contains the relevant information. Let's take a first look using the skimr package.
```{r i-uni-cat-description-dative}
dative %>% # dataset
select(realization_of_recipient) %>% # select the variable
skim() %>% # get data summary
yank("factor") # only show factor-oriented information
```
The output from `skim()` produces various pieces of information that can be helpful. On the one hand we get diagnostics that tell us if there are missing cases (`NA` values), what the proportion of complete cases is, if the the factor is ordered, how many distinct levels the factor has, as well as the level counts.
Looking at the `top_counts` we can see that of the 3,263 observations, in 2,414 the dative is expressed as an 'NP' and 849 as 'PP'. Numerically we can see that there is a difference between the use of the alternation types. A visualization is often helpful for descriptive purposes in statistical analysis. In this particular case, however, we are considering a single categorical variable with only two levels (values) so a visualization is not likely to be more informative than the numeric values we have already obtained. But for demonstration purposes and to get more familiar with building plots, let's create a visualization.
```{r i-uni-cat-visual-dative}
dative %>% # dataset
ggplot(aes(x = realization_of_recipient)) + # mapping
geom_bar() + # geometry
labs(x = "Realization of recipient", y = "Count") # labels
```
The question we want to address, however, is whether this numerical difference is in fact a statistical difference.
**Statistical interrogation**
<!-- TODO:
- consider binom.test() vs. chisq.test() for two outcome univariate
-->
To statistical assess the distribution for a categorical variable, we will turn to the Chi-squared test. This test aims to gauge whether the numerical differences between 'NP' and 'PP' counts observed in the data is greater than what would be expected by chance. Chance in the case where there are only two possible outcome levels is 50/50. For our particular data where there are 3,263 observations half would be 'NP' and the other half 'PP' --specifically 1631.5 for each.
To run this test we first will need to create a cross-tabulation of the variable. We will use the `xtabs()` function to create the table.
```{r i-uni-cat-table-dative}
ror_table <-
xtabs(formula = ~ realization_of_recipient, # formula selecting the variable
data = dative) # dataset
ror_table # preview
```
No new information here, but the format (i.e. an object of class 'table') is what is important for the input argument for the `chisq.test()` function we will use to run the test.
```{r i-uni-cat-chi-squared-dative}
c1 <- chisq.test(x = ror_table) # apply the chi-squared test to `ror_table`
c1 # preview the test results
```
The preview of the `c1` object reveals the main information of interest including the Chi-squared statistic, the degrees of freedom, and the $p$-value (albeit in scientific notation). However, the `c1` is an 'htest' object an includes a number of other pieces information about the test.
```{r i-uni-cat-htest-dative}
names(c1) # preview column names
```
For our purposes let's simply confirm that the $p$-value is lower than the standard .05 threshold for statistical significance.
```{r i-uni-cat-p-value-confirm}
c1$p.value < .05 # confirm p-value below .05
```
Other information can be organized in a more readable format using the broom package's `augment()` function.
```{r i-uni-cat-tidy-results}
c1 %>% # statistical result
augment() # view detailed statistical test information
```
Here we can see the observed and expected counts and the proportions for each level of `realization_of_recipient`. We also get additional information concerning residuals, but we will leave these aside.
**Evaluation**
At this point we may think we are done. We have statistically interrogated the `realization_of_recipient` variable and found that the difference between 'NP' and 'PP' realization in the datives in this dataset is statistically significant. However, we need to evaluate the size ('effect size') and the reliability of the effect ('confidence interval'). The effectsize package provides a function `effectsize()` that can provide us both the effect size and confidence interval.
```{r i-uni-cat-eval-dative}
effects <-
effectsize(c1, type = "phi") # evaluate effect size and generate a confidence interval (phi type given 2x1 contingency table)
effects # preview effect size and confidence interval
```
`effectsize()` recognizes the type of test results in `c1` and calculates the appropriate effect size measure and generates a confidence interval. Since the effect statistic ("Phi") falls between the 95\% confidence interval this suggests the results are reliably interpreted (chances of Type I (false positive) or Type II (false negative) are low).
Now, the remaining question is to evaluate whether the significant result here is a strong effect or not. To do this we can pass the effect size measure to the `interpret_r()` function.
```{r i-uni-cat-eval-dative-effect}
interpret_r(effects$phi) # interpret the effect size
```
Turns out we have a strong effect; the realization of dative alternation heavily favors the 'NP' form in our data. The potential reasons why are not considered in this univariate analysis, but we will return to this question later as we add independent variables to the statistical analysis.
### Continuous
Now let's turn to a case when the variable we aim to interrogate is non-categorical. For this case we will turn to the `sdac_disfluencies` dataset. Specifically we will aim to test whether the use of fillers is normally distributed across speakers.
```{block, type="rmdtip"}
This is an important step when working with numeric dependent variables as the type of distribution will dictate decisions about whether we will use parametric or non-parametric tests if we consider the extent to which an independent variable (or variables) can explain the variation of the dependent variable.
```
Since the dataset is currently organized around fillers as the observational unit, I will first transform this dataset to sum the use of fillers for each speaker in the dataset.
```{r i-uni-cont-sdac-transform}
sdac_speaker_fillers <-
sdac_disfluencies %>% # dataset
group_by(speaker_id) %>% # group by each speaker
summarise(sum = sum(count)) %>% # add up all fillers used
ungroup() # remove grouping parameter
```
```{r i-uni-cont-sdac-transform-preview, echo=FALSE}
sdac_speaker_fillers %>%
slice_head(n = 10) %>%
kable(booktabs = TRUE,
caption = "First 10 observations of `sdac_speaker_fillers` dataset.")
```
**Descriptive assessment**
Let's perform some descriptive assessement of the variable of interest `sum`. First let's apply the `skim()` function and retrieve just the relevant numeric descriptors with `yank()`. One twist here, however, is that I've customized the `skim()` function using the `skim_with()` to remove the default histogram and add the Interquartile Range (IQR) to the output. This new skim function `num_skim()` will take the place of `skim()`.
```{r i-uni-cont-sdac-description}
num_skim <-
skim_with(numeric = sfl(hist = NULL, # remove hist skim
iqr = IQR)) # add IQR to skim
sdac_speaker_fillers %>% # dataset
select(sum) %>% # variable of interest
num_skim() %>% # get custom data summary
yank("numeric") # only show numeric-oriented information
```
We see here that the mean use of fillers is 87.1 across speakers. However, the standard deviation and IQR are large relative to this mean which indicates that the dispersion is quite large, in other words this suggests that there are large differences between speakers. Furthermore, since the median (p50) is smaller than the mean, the distribution is right skewed.
Let's look a couple visualizations of this distribution to appreciate these descriptives. A histogram will provide us a view of the distribution using the counts of the values of `sum` and a density plot will provide a smooth curve which represents the scaled distribution of the observed data.
<!-- TODO:
- consider generating a normal distribution to overlay on the density plot
https://homepage.divms.uiowa.edu/~luke/classes/STAT4580/histdens.html
- consider log-transformed distribution?
- # View `sum` as order of magnitudes
-->
```{r i-uni-cont-sdac-visual}
p1 <-
sdac_speaker_fillers %>% # dataset
ggplot(aes(sum)) + # mapping
geom_histogram() + # geometry
labs(x = "Fillers", y = "Count")
p2 <-
sdac_speaker_fillers %>% # dataset
ggplot(aes(sum)) + # mapping
geom_density() + # geometry
geom_rug() + # visualize individual observations
labs(x = "Fillers", y = "Density")
p1 + p2 + plot_annotation("Filler count distributions.")
```
From these plots that our initial intuitions about the distribution of `sum` are correct. There is large dispersion between speakers and the data distribution is right skewed.
```{block, type="rmdtip"}
Note that I've used the patchwork package for organizing the display of plots and including a plot annotation label.
```
Since our aim is to test for normality, we can generate a Quantile-Quantile plots (QQ Plot).
```{r i-uni-cont-sdac-qq-plot}
sdac_speaker_fillers %>% # dataset
ggplot(aes(sample = sum)) + # mapping
stat_qq() + # calculate expected quantile-quantile distribution
stat_qq_line() # plot the qq-line
```
Since many points do not fall on the expected normal distribution line we have even more evidence to support the notion that the distribution of `sum` is non-normal.
**Statistical interrogation**
Although the descriptives and visualizations strongly suggest that we do not have normally distributed data let's run a normality test. For this we turn to the `shapiro.test()` function which performs the Shapiro-Wilk test of normality. We pass the `sum` variable to this function to run the test.
```{r i-uni-cont-sdac-test}
s1 <- shapiro.test(sdac_speaker_fillers$sum) # apply the normality test to `sum`
s1 # preview the test results
```
As we saw with the results from the `chisq.test()` function, the `shapiro.test()` function produces an object with information about the test including the $p$-value. Let's run our logical test to see if the test is statistically significant.
```{r i-uni-cont-sdac-test-confirm}
s1$p.value < .05 #
```
**Evaluation**
The results from the Shapiro-Wilk Normality Test tell us that the distribution of `sum` is statistically found to differ from the normal distribution. So in this case, statistical significance suggests that `sum` cannot be used as a parametric dependent variable. For our aims this is all the evaluation required. Effect size and confidence intervals are not applicable.
It is of note, however, that the expectation that the variable `sum` would conform to the normal distribution was low from the outset as we are working with count data. Count data, or frequencies, are in a strict sense not continuous, but rather discrete --meaning that they are real numbers (whole numbers which are always positive). This is a common informational type to encounter in text analysis.
## Bivariate analysis
A more common scenario in statistical analysis is the consideration of the relationship between two-variables, known as bivariate analysis.
### Categorical
Let's build on our univariate analysis of `realization_of_recipient` and include an explanatory, or independent variable which we will explore to test whether it can explain our earlier finding that 'NP' datives are more common that 'PP' datives. The question to test, then, is whether modality explains the distribution of the `realization_of_recipient`.
**Descriptive assessment**
Both the `realization_of_recipient` and `modality` variables are categorical, specifically nominal as we can see by using `skim()`.
```{r i-bi-cat-description}
dative %>%
select(realization_of_recipient, modality) %>% # select key variables
skim() %>% # get custom data summary
yank("factor") # only show factor-oriented information
```
For this reason measures of central tendency are not applicable and we will turn to a contingency table to summarize the relationship. The janitor package has a set of functions, the primary function being `tabyl()`. Other functions used here are to adorn the contingency table with totals, percentages, and to format the output for readability.
```{r i-bi-cat-contingency-table}
dative %>%
tabyl(realization_of_recipient, modality) %>% # cross-tabulate
adorn_totals(c("row", "col")) %>% # provide row and column totals
adorn_percentages("col") %>% # add percentages to the columns
adorn_pct_formatting(rounding = "half up", digits = 0) %>% # round the digits
adorn_ns() %>% # add observation number
adorn_title("combined") %>% # add a header title
kable(booktabs = TRUE, # pretty table
caption = "Contingency table for `realization_of_recipient` and `modality`.") # caption
```
To gain a better appreciation for this relationship let's generate a couple plots one which shows cross-tabulated counts and the other calculated proportions.
```{r i-bi-cat-visual}
p1 <-
dative %>% # dataset
ggplot(aes(x = realization_of_recipient, fill = modality)) + # mappings
geom_bar() + # geometry
labs(y = "Count", x = "Realization of recipient") # labels
p2 <-
dative %>% # dataset
ggplot(aes(x = realization_of_recipient, fill = modality)) + # mappings
geom_bar(position = "fill") + # geometry, with fill for proportion plot
labs(y = "Proportion", x = "Realization of recipient", fill = "Modality") # labels
p1 <- p1 + theme(legend.position = "none") # remove legend from left plot
p1 + p2 + plot_annotation("Relationship between Realization of recipient and Modality.")
```
Looking at the count plot (in the left pane) we see that large difference between the realization of the dative as an 'NP' or 'PP' obscures to some degree our ability to see to what degree modality is related to the realization of the dative. So, a proportion plot (in the right pane) standardizes each level of `realization_of_recipient` to provide a more comparable view. From the proportion plot we see that there appears to be a trend towards more use of 'PP' than 'NP' in the written modality.
**Statistical interrogation**
Although the proportion plot is visually helpful, we use the raw counts to statistically analyze this relationship. Again, as we are working with categorical variables, now for a dependent and independent variable, we use the Chi-squared test. And as before we need to create the cross-tabulation table to pass to the `chisq.test()` to perform the test.
```{r i-bi-cat-test}
ror_mod_table <-
xtabs(formula = ~ realization_of_recipient + modality, # formula
data = dative) # dataset
c2 <- chisq.test(ror_mod_table) # apply the chi-squared test to `ror_mod_table`
c2 # # preview the test results
c2$p.value < .05 # confirm p-value below .05
```
We can preview the result and provide a confirmation of the $p$-value. This evidence suggests that there is a difference between the distribution of dative realization according to modality.
We can also see more details about the test.
```{r i-bi-cat-test-information}
c2 %>% # statistical result
augment() # view detailed statistical test information
```
**Evaluation**
Now we want to calculate the effect size and the confidence interval to provide measures of assurance that our finding is robust.
```{r i-bi-cat-effect-ci}
effects <- effectsize(c2) # evaluate effect size and generate a confidence interval
effects # preview effect size and confidence interval
interpret_r(effects$Cramers_v) # interpret the effect size
```
We get effect size and confidence interval information. Note that the effect size, reflected by Cramer's V, for this relationship is weak. This points out an important aspect to evaluation of statistical tests. The fact that a test is significant does not mean that it is meaningful. A small effect size suggests that we should be cautious about the extent to which this significant finding is robust in the population from which the data is sampled.
### Continuous
For a bivariate analysis in which the dependent variable is not categorical, we will turn to the `sdac_disfluencies` dataset. The question we will pose to test is whether the use of fillers is related to the type of filler ('uh' or 'um').
**Descriptive assessment**
The key variables to assess in this case are the variables `count` and `filler`. But before we start to explore this relationship we will need to transform the dataset such that each speaker's use of the levels of `filler` is summed. We will use `group_by()` to group `speaker_id` and `filler` combinations and then use `summarize()` to then sum the counts for each filler type for each speaker
```{r i-bi-cont-sdac-fillers}
sdac_fillers <-
sdac_disfluencies %>% # dataset
group_by(speaker_id, filler) %>% # grouping parameters
summarize(sum = sum(count)) %>% # summed counts for each speaker-filler combination
ungroup() # remove the grouping parameters
```
Let's preview this transformation.
```{r i-bi-cont-sdac-fillers-preview, echo=FALSE}
sdac_fillers %>%
slice_head(n = 10) %>%
kable(booktabs = TRUE,
caption = "First 10 observations from `sdac_fillers` dataset.")
```
Let's take a look at them together by grouping the dataset by `filler` and then using the custom skim function `num_skim()` for the numeric variable`count`.
```{r i-bi-cont-description}
sdac_fillers %>% # dataset
group_by(filler) %>% # grouping parameter
num_skim() %>% # get custom data summary
yank("numeric") # only show numeric-oriented information
```
We see here that the standard deviation and IQR for both 'uh' and 'um' are relatively large for the respective means (71.4 and 15.7) suggesting the distribution is quite dispersed. Let's take a look at a boxplot to visualize the counts in `sum` for each level of `filler`.
```{r i-bi-cont-visual}
p1 <-
sdac_fillers %>% # dataset
ggplot(aes(x = filler, y = sum)) + # mappings
geom_boxplot(notch = TRUE) + # geometry
labs(x = "Filler", y = "Counts") # labels
p2 <-
sdac_fillers %>% # dataset
ggplot(aes(x = filler, y = sum)) + # mappings
geom_boxplot(notch = TRUE) + # geometry
ylim(0, 100) + # scale the y axis to trim outliers
labs(x = "Filler", y = "") # labels
p1 + p2
```
In the plot in the left pane we see a couple things. First, it appears that there is in fact quite a bit of dispersion as there are quite a few outliers (dots) above the lines extending from the boxes. Recall that the boxes represent the first and third quantile, that is the IQR and that the notches represent the confidence interval. Second, when we compare the boxes and their notches we see that there is little overlap (looking horizontally). In the right pane I've zoomed in a bit trimming some outliers to get a better view of the relationship between the boxes. Since the overlap is minimal and in particular the notches do not overlap at all, this is a good indication that there is a significant trend.
From the descriptive statistics and the visual summary it appears that the filler 'uh' is more common than 'um'. It's now time to submit this to statistical interrogation.
**Statistical interrogation**
<!-- TODO:
- Investigate How to choose a family
- https://www.researchgate.net/post/How-to-determine-which-family-function-to-use-when-fitting-generalized-linear-model-glm-in-R
-->
In a bivariate (and multivariate) analysis where the dependent variable is non-categorical we apply Linear Regression Modeling (LM). The default assumption of linear models, however, is that the dependent variable is normally distributed. As we have seen our variable `sum` does not conform to the normal distribution. We know this because of our tests in the univariate case, but as mentioned at the end of that section, we are working with count data which by nature is understood as discrete and not continuous in a strict technical sense. So instead of using the linear model for our regression analysis we will use the Generalized Linear Model (GLM) [@Baayen2008a; @Gries2013a].
The function `glm()` implements generalized linear models. In addition to the formula (`sum ~ filler`) and the dataset to use, we also include an appropriate distribution family for the dependent variable. For count and frequency data the appropriate family is the "Poisson" distribution.
```{r i-bi-cont-test}
m1 <-
glm(formula = sum ~ filler, # formula
data = sdac_fillers, # dataset
family = "poisson") # distribution family
summary(m1) # preview the test results
```
Let's focus on the coefficients, specifically for the 'fillerum' line. Since our factor `filler` has two levels one level is used as the reference to contrast with the other level. In this case by default the first level is used as the reference. Therefore the coefficients we see in 'fillerum' are 'um' in contrast to 'uh'. Without digging into the details of the other parameter statistics, let's focus on the last column which contains the $p$-value. A convenient aspect of the `summary()` function when applied to regression model results is that it provides statistical significance codes. In this case we can see that the contrast between 'uh' and 'um' is signficant at $p < .001$ which of course is lower than our standard threshold of $.05$.
Therefore we can say with some confidence that the filler 'uh' is more frequent than 'um'.
**Evaluation**
Given we have found a significant effect for `filler`, let's look at evaluating the effect size and the confidence interval. Again, we use the `effectsize()` function. We then can preview the `effects` object. Note that effect size of interest is in the second row of the coefficient (`Std_Coefficient`) so we subset this column to extract only the effect coefficient for the `filler` contrast.
```{r i-bi-cont-eval-test}
effects <- effectsize(m1) # evaluate effect size and generate a confidence interval
effects # preview effect size and confidence interval
interpret_r(effects$Std_Coefficient[2]) # interpret the effect size
```
The coefficient statistic falls within the confidence interval and the effect size is strong so we can be confident that our findings are reliable given this data.
## Multivariate analysis
The last case to consider is when we have more than one independent variable we want to use to assess their potential relationship to the dependent variable. Again we will consider a categorical and non-categorical dependent variable. But, in this case the implementation methods are quite similar, as we will see.
### Categorical
For the categorical multivariate case we will again consider the `dative` dataset and build on the previous analyses. The question to be posed is whether modality in combination with the length of the recipient (`length_of_recipient`) together explain the distribution of the realization of the recipient (`realization_of_recipient`).
**Descriptive assessment**
Now that we have three variables, there is more to summarize to get our descriptive information. Luckily, however, the same process can be applied to three (or more) variables using the `group_by()` function and then passed to `skim()`. In this case we have two categorical variables and one numeric variable. So we will group by both the categorical variables and then pass the numeric variable to the custom `num_skim()` function --pulling out only the relevant descriptive information for numeric variables with `yank()`.
```{r i-multi-cat-description}
dative %>% # dataset
select(realization_of_recipient, modality, length_of_recipient) %>% # select key variables
group_by(realization_of_recipient, modality) %>% # grouping parameters
num_skim() %>% # get custom data summary
yank("numeric") # only show numeric-oriented information
```
There is much more information now that we are considering multiple independent variables, but if we look over the measures of dispersion we can see that the median and the IQR are relatively similar to their respective means suggesting that there are fewer outliers and relativley little skew.
Let's take a look at a visualization of this information. Since we are working with a categorical dependent variable and there is one non-categorical variable we can use a boxplot. The addition here is to include a `color` mapping which will provide a distinct box for each level of modality ('written' and 'spoken').
```{r i-multi-cat-visual}
p1 <-
dative %>% # dataset
ggplot(aes(x = realization_of_recipient, y = length_of_recipient, color = modality)) + # mappings
geom_boxplot(notch = TRUE) + # geometry
labs(x = "Realization of recipient", y = "Length of recipient (in words)", color = "Modality") # labels
p2 <-
dative %>% # dataset
ggplot(aes(x = realization_of_recipient, y = length_of_recipient, color = modality)) + # mappings
geom_boxplot(notch = TRUE) + # geometry
ylim(0, 15) + # scale the y axis to trim outliers
labs(x = "Realization of recipient", y = "", color = "Modality") # labels
p1 <- p1 + theme(legend.position = "none") # remove the legend from the left pane plot
p1 + p2
```
In the left pane we see the entire visualization including all outliers. From this view it appears that there is a potential trend that the length of the recipient is larger when the realization of the recipient is 'PP'. There is also a potential trend for modality with written language showing longer recipient lengths overall. The pane on the right is scaled to get a better view of the boxes by scaling the y-axis down and as such trimming the outliers. This plot shows more clearly that the length of the recipient is longer when the recipient is realized as a 'PP'. Again, the contrast in modality is also a potential trend, but the boxes (of the same color), particularly for the spoken modality overlap to some degree.
So we have some trends in mind which will help us interpret the statistical interrogation so let's move there next.
**Statistical interrogation**
Once we involve more than two variables, the choice of statistical method turns towards regression. In the case that the dependent variable is categorical, however, we will use Logistic Regression. The workhorse function `glm()` can be used for a series of regression models, including logistic regression. The requirement, however, is that we specify the family of the distribution. For logistic regression the family is "binomial". The formula includes the dependent variable as a function of our other two variables, each are separated by the `+` operator.
```{r i-multi-cat-test}
m1 <- glm(formula = realization_of_recipient ~ modality + length_of_recipient, # formula
data = dative, # dataset
family = "binomial") # distribution family
summary(m1) # preview the test results
```
The results from the model again provide a wealth of information. But the key information to focus on is the coefficients. In particular the coefficients for the independent variables `modality` and `length_of_recipient`. What we notice, is that the $p$-value for `length_of_recipient` is significant, but the contrast between 'written' and 'spoken' for `modality` is not. If you recall, we used this same dataset to explore `modality` as a single indpendent variable earlier --and it was found to be significant. So why now is it not? The answer is that when multiple variables are used to explain the distribution of a measure (dependent variable) each variable now adds more information to explain the dependent variable --each has it's own contribution. Since `length_of_recipient` is significant, this suggests that the explanatory power of `modality` is weak, especially when compared to `length_of_recipient`. This make sense as we saw in the earlier model the fact that the effect size for `modality` was not strong and that is now more evident that the `length_of_recipient` is included in the model.
**Evaluation**
Now let's move on and gauge the effect size and calculate the confidence interval for `length_of_recipient` in our model. We apply the `effectsize()` function to the model and then use `interpret_r()` on the coefficient of interest (which is in the fourth row of the `Std_Coefficients` column).
```{r i-multi-cat-effects}
effects <- effectsize(m1) # evaluate effect size and generate a confidence interval
effects # preview effect size and confidence interval
interpret_r(effects$Std_Coefficient[4]) # interpret the effect size
```
We see we have a coefficient that falls within the confidence interval and the effect size is large. So we can saw with some confidence that the length of the recipient is a significant predictor of the use of 'PP' as the realization of the recipient in the dative alternation.
### Continuous
The last case we will consider here is when the dependent variable is non-categorical and we have more than one independent variable. The question we will pose is whether the type of filler and the sex of the speaker can explain the use of fillers in conversational speech.
We will need to prepare the data before we get started as our current data frame `sdac_fillers` has filler and the sum count for each filler grouped by speaker --but it does not include the `sex` of each speaker. The `sdac_disfluencies` data frame does have the `sex` column, but it has not been grouped by speaker. So let's transform the `sdac_disfluencies` summarizing it to only get the `speaker_id` and `sex` combinations. This should result in a data frame with 441 observations, one observation for each speaker in the corpus.
```{r i-multi-cont-transform-sdac}
sdac_speakers_sex <-
sdac_disfluencies %>% # dataset
distinct(speaker_id, sex) # summarize for distinct `speaker_id` and `sex` values
```
Let's preview the first 10 observations form this transformation.
```{r i-multi-cont-transform-sdac-preview, echo=FALSE}
sdac_speakers_sex %>%
arrange(speaker_id) %>%
slice_head(n = 10) %>%
kable(booktabs = TRUE,
caption = "First 10 observations of the `sdac_speakers_sex` data frame.")
```
Great, now we have each `speaker_id` and `sex` for all 441 speakers. One thing to note, however, is that speaker '155' does not have a value for `sex` --this seems to be an error in the metadata that we will need to deal with before we proceed in our analysis. Let's move on to join our new `sdac_speakers_sex` data frame and the `sdac_fillers` data frame.
Now that we have a complete dataset with `speaker_id` and `sex` we will now join this dataset with our `sdac_fillers` dataset effectively adding the column `sex`. We want to keep all the observations in `sdac_fillers` and add the column `sex` for observations that correspond between each data frame for the column `speaker_id` so we will use a left join with the function `left_join()` with the `sdac_fillers` dataset on the left.
```{r i-multi-cont-join-fillers-sex}
sdac_fillers_sex <-
left_join(sdac_fillers, sdac_speakers_sex) # join
```
Now let's preview the first observations in this new `sdac_fillers_sex` data frame.
```{r i-multi-cont-sdac-fillers-sex-preview, echo=FALSE}
sdac_fillers_sex %>%
arrange(speaker_id) %>%
slice_head(n = 10) %>%
kable(booktabs = TRUE,
caption = "First 10 observations of the `sdac_fillers_sex` data frame.")
```
At this point let's drop this speaker from the `sdac_speakers_sex` data frame.
```{r i-multi-cont-drop-speaker}
sdac_fillers_sex <-
sdac_fillers_sex %>% # dataset
filter(speaker_id != "155") # drop speaker_id 155
```
We are now ready to proceed in our analysis.
**Descriptive assessment**
The process by now should be quite routine for getting our descriptive statistics: select the key variables, group by the categorical variables, and finally pull the descriptives for the numeric variable.
```{r i-multi-cont-description}
sdac_fillers_sex %>% # dataset
select(sum, filler, sex) %>% # select key variables
group_by(filler, sex) %>% # grouping parameters
num_skim() %>% # get custom data summary
yank("numeric") # only show numeric-oriented information
```
Looking at these descriptives, it seems like there is quite a bit of variability for some combinations and not others. In short, it's a mixed bag. Let's try to make sense of these numbers with a boxplot.
```{r i-mulit-cont-visual}
p1 <-
sdac_fillers_sex %>% # dataset
ggplot(aes(x = filler, y = sum, color = sex)) + # mappings
geom_boxplot(notch = TRUE) + # geometry
labs(x = "Filler", y = "Counts", color = "Sex") # labels
p2 <-
sdac_fillers_sex %>% # dataset
ggplot(aes(x = filler, y = sum, color = sex)) + # mappings
geom_boxplot(notch = TRUE) + # geometry
ylim(0, 200) + # scale the y axis to trim outliers
labs(x = "Filler", y = "", color = "Sex") # labels
p1 <- p1 + theme(legend.position = "none") # drop the legend from the left pane plot
p1 + p2
```
We can see that 'uh' is used more than 'um' overall. But that whereas men and women use 'uh' in similar ways, women use more 'um' than men. This is known as an interaction. So we will approach our statistical analysis with this in mind.
**Statistical interrogation**
We will again use a generalized linear model with the `glm()` function to conduct our test. The distribution family will be the same has we are again using the `sum` as our dependent variable which contains discrete count values. The formula we will use, however, is new. Instead of adding a new variable to our independent variables, we will test the possible interaction between `filler` and `sex` that we noted in the descriptive assessment. To encode an interaction the `*` operator is used. So our formula will take the form `sum ~ filler * sex`. Let's generate the model and view the summary of the test results as we have done before.
```{r i-multi-cont-test}
m1 <-
glm(formula = sum ~ filler * sex, # formula
data = sdac_fillers_sex, # dataset
family = "poisson") # distribution family
summary(m1) # preview the test results
```
Again looking at the coefficients we something new. First we see that there is a row for the `filler` contrast and the `sex` contrast but also the interaction between `filler` and `sex` ('fillerum:sexMALE'). All rows show significant effects. It is important to note that when an interaction is explored and it is found to be significant, the other simple effects, known as main effects ('fillerum' and 'sexMALE'), are ignored. Only the higer-order effect is considered significant.
Now what does the 'fillerum:sexMALE' row mean. It means that there is an interaction between `filler` and `sex`. the directionality of that interaction should be interpreted using our descriptive assessment, in particular the visual boxplots we generated. In sum, women use more 'um' than men or stated another way men use 'um' less than women.
**Evaluation**
We finalize our analysis by looking at the effect size and confidence intervals.
```{r i-multi-cont-effects}
effects <- effectsize(m1) # evaluate effect size and generate a confidence interval
effects # preview effect size and confidence interval
interpret_r(effects$Std_Coefficient[4]) # interpret the effect size
```
We can conclude, then, that there is a strong interaction effect for `filler` and `sex` and that women use more 'um' than men.
## Summary
In this chapter we have discussed various approaches to conducting inferential data analysis. Each configuration, however, always includes a descriptive assessment, statistical interrogation, and an evaluation of the results. We considered univariate, bivariate, and multivariate analyses using both categorical and non-categorical dependent variables to explore the similarities and differences between these approaches.
<!-- IDEAS:
- Use Switchboard Dialogue Act Corpus
- Disfluencies (uh/ um)
- languageR::dative
-
-------
- Use BELC dataset from "Approaching analysis" chapter
- Question(s):
-
- Santa Barbara Corpus of Spoken American English
- analyzr::sbc
- Love on the Spectrum/ Love is Blind
- Word frequency differences (from SUBTLEXus) as the dependent variable
- Use other datasets from the "Transform datasets" chapter
RESOURCES:
- Example for Fillers: https://github.com/WFU-TLC/analyzr/blob/master/vignettes/analysis-an-inference-example.Rmd
-
-->