09-inference.Rmd

# (PART) Analysis {-}

# Overview {-#analysis-overview}

In this section we turn to the analysis of datasets, the evaluation of results, and the interpretation of the findings. We will outline the three main types of statistical analyses: Inferential Data Analysis (IDA), Predictive Data Analysis (PDA), and Exploratory Data Analysis (EDA). Each of these analysis types have distinct, non-overlapping aims and therefore should be determined from the outset of the research project and included as part of the research blueprint. The aim of this section is to establish a clearer picture of the goals, methods, and value of each of these approaches.

```{r, child="_common.Rmd"}
```

# Inference {#inference-chapter}

<p style="font-weight:bold; color:red;">INCOMPLETE DRAFT</p>

> People generally see what they look for, and hear what they listen for.
>
> -- Harper Lee, To Kill a Mockingbird


```{block, type="rmdkey"}
The essential questions for this chapter are:

- what are the three main types of inferential analysis approaches?
- how does the informational value of the dependent variable relate to the statistical approach adopted?
- how to descriptive, statistical, and evaluative steps work together to produce the reliable results?
```

<!-- COURSE STRUCTURE

TUTORIALS:

- ...

SWIRL:

- ...

WORKED/ RECIPE:

- ...

PROJECT:

- ...

GOALS:

...

-->


```{r inference-data-packages, echo=FALSE, message=FALSE}
# Packages
pacman::p_load(tidyverse, patchwork, knitr, broom, effectsize, report, janitor, skimr, languageR)
```

In this chapter we consider approaches to deriving knowledge from information which can be generalized to the population from which the data is sampled. This process is known as statistical inference. The discussion here builds on concepts developed in [Chapter 3 "Approaching analysis"](#approaching-analysis) and implements descriptive assessments, statistical tests, and evaluation procedures for a series of contexts which are common in the analysis of corpus-based data. The chapter is structured into three main sections which correspond to the number of variables included in the statistical procedure. Each of these sections includes a subsection dedicated to the informational value of the dependent variable; the variable whose variation is to be explained.

For this discussion two datasets will be used as the base to pose various questions to submit for interrogation. It is of note that the questions in the subsequent sections are posited to highlight various descriptive, statistic, and evaluation procedures and do not reflect the standard approach to hypothesis testing which assumes that the null and alternative hypotheses are developed at the outset of the research project.

The process for each inferential data analysis in this section will include three steps: (1) descriptive assessment, (2) statistical interrogation, and (3) evaluation of the results.

<!-- TODO:
- Add Bresnan citation for questions to interrogate `dative`?
- Add Tottie citation for questions to interrogate `sdac_disfluencies`?

-->

## Preparation

At this point let's now get familiar with the datasets and prepare them for analysis. The first dataset to consider is the `dative` dataset. This dataset can be loaded from the languageR package [@R-languageR].  

```{r i-dative-load}
dative <- 
  languageR::dative %>% # load the `dative` dataset  
  as_tibble() # convert the data frame to a tibble object
  
glimpse(dative) # preview structure 
```

From `glimpse()` we can see that this dataset contains 3,263 observations and 15 columns. 

The R Documentation can be consulted using `?dative` in the R Console. The description states: 

> Data describing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection.

For a bit more context, a dative is the phrase which reflects the entity that takes the recipient role in a ditransitive clause. In English, the recipient (dative) can be realized as either a noun phrase (NP) as seen in (1) or as a prepositional phrase (PP) as seen in (2) below. 

1. They give [you ~NP~] a drug test.
2. They give a drug test [to you ~PP~].

Together these two syntactic options are known as the Dative Alternation.

The observational unit for this dataset is `RealizationOfRecipient` variable which is either 'NP' or 'PP'. For the purposes of this chapter I will select a subset of the key variables we will use in the upcoming analyses and drop the others. 

```{r i-dative-simplify}
dative <- 
  dative %>% # dataset
  select(RealizationOfRecipient, Modality, LengthOfRecipient, LengthOfTheme) %>% # select key variables
  janitor::clean_names() # normalize variable names
```
```{r i-dative-preview, echo=FALSE}
dative %>% 
  slice_head(n = 10) %>% 
  kable(booktabs = TRUE,
        caption = "First 10 observations of simplified `dative` dataset.")
```

In Table \@ref(tab:i-dative-dictionary) I've created a data dictionary describing the variables in our new `dative` dataset based on the variable descriptions in the `languageR::dative` documentation.

```{r i-dative-dictionary, echo=FALSE}
tribble(
  ~variable_name, ~name, ~description,
  "realization_of_recipient", "Realization of Recipient", "A factor with levels NP and PP coding the realization of the dative.",
  "modality", "Language Modality", "A factor with levels *spoken*, *written*.",
  "length_of_recipient", "Length of Recipient", "A numeric vector coding the number of words comprising the recipient.",
  "length_of_theme", "Length of Theme", "A numeric vector coding the number of words comprising the theme."
) %>% 
  kable(booktabs = TRUE,
        caption = "Data dictionary for the `dative` dataset.")
```

The second dataset that we will use in this chapter is the `sdac_disfluencies` dataset that we worked to derived in the previous chapter. Let's read in the dataset and preview the structure. 

```{r i-sdac-disfluencies-read-show, eval=FALSE}
sdac_disfluencies <- 
  read_csv(file = "../data/derived/sdac/sdac_disfluencies.csv") # read transformed dataset

glimpse(sdac_disfluencies) # preview structure
```

```{r i-sdac-disfluencies-read-run, echo=FALSE, message=FALSE}
sdac_disfluencies <- 
  read_csv(file = "resources/09-inference/data/derived/sdac/sdac_disfluencies.csv") # read transformed dataset

glimpse(sdac_disfluencies) # preview structure

```

We prepared a data dictionary that reflects this transformed dataset. Let's read that file and then view it Table \@ref(tab:i-sdac-disfluencies-dictionary). 

```{r i-sdac-disfluencies-diciontary-read-show, eval=FALSE}
sdac_disfluencies_dictionary <- read_csv(file = "../data/derived/sdac/sdac_disfluencies_data_dictionary.csv") # read data dictionary
```

```{r i-sdac-disfluencies-diciontary-read-run, echo=FALSE, message=FALSE}
sdac_disfluencies_dictionary <- read_csv(file = "resources/09-inference/data/derived/sdac/sdac_disfluencies_data_dictionary.csv")
```


```{r i-sdac-disfluencies-dictionary, echo=FALSE}
sdac_disfluencies_dictionary %>% 
  kable(booktabs = TRUE,
        caption = "Data dictionary for the `sdac_disfluencies` dataset.")
```

For our analysis purposes we will reduce this dataset, as we did for the `dative` dataset, retaining only the variables of interest for the upcoming analyses. 


```{r i-sdac-disfluencies-simplify}
sdac_disfluencies <- 
  sdac_disfluencies %>% # dataset
  select(speaker_id, filler, count, sex, birth_year, education) # select key variables
```

Let's preview this simplified `sdac_disfluencies` dataset. 

```{r i-sdac-disfluencies-preview, echo=FALSE}
sdac_disfluencies %>% 
  slice_head(n = 10) %>% 
  kable(booktabs = TRUE,
        caption = "First 10 observations of simplified `sdac_disfluencies` dataset.")
```

Now the `sdac_disfluencies` dataset needs some extra transformation to better prepare it for statistical interrogation. On the one hand the variables `birth_year` and `education` are not maximally informative. First it would be more ideal if `birth_year` would reflect the age of the speaker at the time of the conversation(s) and furthermore the coded values of `education` are not explicit as far what the numeric values refer to.

The second issue has to do with preparing the `sdac_disfluencies` dataset for statistical analysis. This involves converting our column types to the correct vector types for statistical methods. Specifically we need to convert our categorical variables to the R type 'factor' (fct). This includes of our current variables which are character vectors, but also the `speaker_id` and `education` which appear as numeric but do not reflect a continuous variables; one  is merely a code which uniquely labels each speaker and the other is an ordinal list of educational levels.

This will be a three step process, first we will normalize the `birth_year` to reflect the age of the speaker, second we will convert all the relevant categorical variables to factors, and third we will convert the `education` variable to a factor adding meaningful labels for the levels of this factor.

Consulting the [online manual for this corpus](https://catalog.ldc.upenn.edu/docs/LDC97S62/swb1_manual.txt), we see that the recording date for these conversations took place in 1992, so we can simply subtract the `birth_year` from 1992 to get each participant's age. We'll rename this new column `age` and drop the `birth_year` column. 

```{r i-sdac-disfluencies-age-transform}
sdac_disfluencies <- 
  sdac_disfluencies %>% # dataset
  mutate(age = (1992 - birth_year)) %>% # calculate age
  select(-birth_year) # drop `birth_year` column
```

Now let's convert all the variables which are character vectors. We can do this using the the `factor()` function; first on `speaker_id` and then, with the help of `mutate_if()`, to all the other variables which are character vectors. 


```{r i-sdac-disfluencies-to-factors}
sdac_disfluencies <- 
  sdac_disfluencies %>% # dataset
  mutate(speaker_id = factor(speaker_id)) %>% # convert numeric to factor
  mutate_if(is.character, factor) # convert all character to factor
```


We know from the data dictionary that the `education` column contains four values (0, 1, 2, 3, and 9). Again, consulting the corpus manual we can see what these values mean. 

```
EDUCATION    COUNT
--------------------

0            14      less than high school
1            39      less than college
2            309     college
3            176     more than college
9            4       unknown
```

So let's convert `education` to a factor adding these descriptions as factor level labels. The function `factor()` can take an argument `labels = ` which we can manually assign the label names for the factor levels in the order of the factor levels. Since the original values were numeric, the factor level ordering defaults to ascending order.

```{r i-sdac-disfluencies-education-transform}
sdac_disfluencies <- 
  sdac_disfluencies %>% # dataset
  mutate(education = factor(education, 
                            labels = c("less than high school", # value 0
                                       "less than college", # value 1
                                       "college", # value 2
                                       "more than college", # value 3 
                                       "unknown"))) # value 9
```

So let's take a look at the `sdac_disfluencies` dataset we've prepared for analysis.

```{r i-sdac-disfluencies-transformed-preview}
glimpse(sdac_disfluencies)
```

Now the datasets `dative` and `sdac_disfluencies` are ready to be statistically interrogated.


## Univariate analysis

In what follows I will provide a description of inferential data analysis when only one variable is to be interrogated. This is known as a univariate analysis, or one-variable analysis. We will consider a case when the variable is categorical and the other continuous. 

### Categorical

As an example of a univariate analysis where the variable used in the analysis is categorical we will look at the `dative` dataset. In this analysis we may be interested in knowing whether the recipient role in a ditransitive construction is realized more as an 'NP' or 'PP'. 

**Descriptive assessment**

The `realization_of_recipient` variable contains the relevant information. Let's take a first look using the skimr package.

```{r i-uni-cat-description-dative}
dative %>% # dataset
  select(realization_of_recipient) %>% # select the variable
  skim() %>% # get data summary
  yank("factor") # only show factor-oriented information
```

The output from `skim()` produces various pieces of information that can be helpful. On the one hand we get diagnostics that tell us if there are missing cases (`NA` values), what the proportion of complete cases is, if the the factor is ordered, how many distinct levels the factor has, as well as the level counts. 

Looking at the `top_counts` we can see that of the 3,263 observations, in 2,414 the dative is expressed as an 'NP' and 849 as 'PP'. Numerically we can see that there is a difference between the use of the alternation types. A visualization is often helpful for descriptive purposes in statistical analysis. In this particular case, however, we are considering a single categorical variable with only two levels (values) so a visualization is not likely to be more informative than the numeric values we have already obtained. But for demonstration purposes and to get more familiar with building plots, let's create a visualization.


```{r i-uni-cat-visual-dative}
dative %>% # dataset
  ggplot(aes(x = realization_of_recipient)) + # mapping
  geom_bar() + # geometry
  labs(x = "Realization of recipient", y = "Count") # labels
```


The question we want to address, however, is whether this numerical difference is in fact a statistical difference.

**Statistical interrogation**

<!-- TODO:
- consider binom.test() vs. chisq.test() for two outcome univariate
-->

To statistical assess the distribution for a categorical variable, we will turn to the Chi-squared test. This test aims to gauge whether the numerical differences between 'NP' and 'PP' counts observed in the data is greater than what would be expected by chance. Chance in the case where there are only two possible outcome levels is 50/50. For our particular data where there are 3,263 observations half would be 'NP' and the other half 'PP' --specifically 1631.5 for each. 

To run this test we first will need to create a cross-tabulation of the variable. We will use the `xtabs()` function to create the table.

```{r i-uni-cat-table-dative}
ror_table <- 
  xtabs(formula = ~ realization_of_recipient, # formula selecting the variable
        data = dative) # dataset

ror_table # preview
```

No new information here, but the format (i.e. an object of class 'table') is what is important for the input argument for the `chisq.test()` function we will use to run the test. 

```{r i-uni-cat-chi-squared-dative}
c1 <- chisq.test(x = ror_table) # apply the chi-squared test to `ror_table`

c1 # preview the test results
```

The preview of the `c1` object reveals the main information of interest including the Chi-squared statistic, the degrees of freedom, and the $p$-value (albeit in scientific notation). However, the `c1` is an 'htest' object an includes a number of other pieces information about the test.

```{r i-uni-cat-htest-dative}
names(c1) # preview column names
```

For our purposes let's simply confirm that the $p$-value is lower than the standard .05 threshold for statistical significance.

```{r i-uni-cat-p-value-confirm}
c1$p.value < .05 # confirm p-value below .05
```

Other information can be organized in a more readable format using the broom package's `augment()` function. 

```{r i-uni-cat-tidy-results}
c1 %>% # statistical result
  augment() # view detailed statistical test information
```
Here we can see the observed and expected counts and the proportions for each level of `realization_of_recipient`. We also get additional information concerning residuals, but we will leave these aside. 

**Evaluation**

At this point we may think we are done. We have statistically interrogated the `realization_of_recipient` variable and found that the difference between 'NP' and 'PP' realization in the datives in this dataset is statistically significant. However, we need to evaluate the size ('effect size') and the reliability of the effect ('confidence interval'). The effectsize package provides a function `effectsize()` that can provide us both the effect size and confidence interval.

```{r i-uni-cat-eval-dative}
effects <- 
  effectsize(c1, type = "phi") # evaluate effect size and generate a confidence interval (phi type given 2x1 contingency table)

effects # preview effect size and confidence interval
```

`effectsize()` recognizes the type of test results in `c1` and calculates the appropriate effect size measure and generates a confidence interval. Since the effect statistic ("Phi") falls between the 95\% confidence interval this suggests the results are reliably interpreted (chances of Type I (false positive) or Type II (false negative) are low). 

Now, the remaining question is to evaluate whether the significant result here is a strong effect or not. To do this we can pass the effect size measure to the `interpret_r()` function. 

```{r i-uni-cat-eval-dative-effect}
interpret_r(effects$phi) # interpret the effect size 
```

Turns out we have a strong effect; the realization of dative alternation heavily favors the 'NP' form in our data. The potential reasons why are not considered in this univariate analysis, but we will return to this question later as we add independent variables to the statistical analysis. 

### Continuous

Now let's turn to a case when the variable we aim to interrogate is non-categorical. For this case we will turn to the `sdac_disfluencies` dataset. Specifically we will aim to test whether the use of fillers is normally distributed across speakers.

```{block, type="rmdtip"}
This is an important step when working with numeric dependent variables as the type of distribution will dictate decisions about whether we will use parametric or non-parametric tests if we consider the extent to which an independent variable (or variables) can explain the variation of the dependent variable.
```

Since the dataset is currently organized around fillers as the observational unit, I will first transform this dataset to sum the use of fillers for each speaker in the dataset. 

```{r i-uni-cont-sdac-transform}
sdac_speaker_fillers <- 
  sdac_disfluencies %>% # dataset
  group_by(speaker_id) %>% # group by each speaker
  summarise(sum = sum(count)) %>% # add up all fillers used
  ungroup() # remove grouping parameter
```


```{r i-uni-cont-sdac-transform-preview, echo=FALSE}
sdac_speaker_fillers %>% 
  slice_head(n = 10) %>% 
  kable(booktabs = TRUE,
        caption = "First 10 observations of `sdac_speaker_fillers` dataset.")
```

**Descriptive assessment**

Let's perform some descriptive assessement of the variable of interest `sum`. First let's apply the `skim()` function and retrieve just the relevant numeric descriptors with `yank()`. One twist here, however, is that I've customized the `skim()` function using the `skim_with()` to remove the default histogram and add the Interquartile Range (IQR) to the output. This new skim function `num_skim()` will take the place of `skim()`.

```{r i-uni-cont-sdac-description}
num_skim <- 
  skim_with(numeric = sfl(hist = NULL, # remove hist skim
                                   iqr = IQR)) # add IQR to skim

sdac_speaker_fillers %>% # dataset
  select(sum) %>% # variable of interest
  num_skim() %>% # get custom data summary
  yank("numeric") # only show numeric-oriented information
```

We see here that the mean use of fillers is 87.1 across speakers. However, the standard deviation and IQR are large relative to this mean which indicates that the dispersion is quite large, in other words this suggests that there are large differences between speakers. Furthermore, since the median (p50) is smaller than the mean, the distribution is right skewed.

Let's look a couple visualizations of this distribution to appreciate these descriptives. A histogram will provide us a view of the distribution using the counts of the values of `sum` and a density plot will provide a smooth curve which represents the scaled distribution of the observed data.

<!-- TODO:
- consider generating a normal distribution to overlay on the density plot
https://homepage.divms.uiowa.edu/~luke/classes/STAT4580/histdens.html
- consider log-transformed distribution?
  - # View `sum` as order of magnitudes
-->

```{r i-uni-cont-sdac-visual}
p1 <- 
  sdac_speaker_fillers %>% # dataset
  ggplot(aes(sum)) + # mapping
  geom_histogram() +  # geometry
  labs(x = "Fillers", y = "Count")

p2 <- 
  sdac_speaker_fillers %>% # dataset
  ggplot(aes(sum)) + # mapping
  geom_density() + # geometry
  geom_rug() +  # visualize individual observations
  labs(x = "Fillers", y = "Density")

p1 + p2 + plot_annotation("Filler count distributions.")
```
From these plots that our initial intuitions about the distribution of `sum` are correct. There is large dispersion between speakers and the data distribution is right skewed. 

```{block, type="rmdtip"}
Note that I've used the patchwork package for organizing the display of plots and including a plot annotation label.
```

Since our aim is to test for normality, we can generate a Quantile-Quantile plots (QQ Plot). 

```{r i-uni-cont-sdac-qq-plot}
sdac_speaker_fillers %>% # dataset
  ggplot(aes(sample = sum)) + # mapping
  stat_qq() + # calculate expected quantile-quantile distribution
  stat_qq_line() # plot the qq-line
```

Since many points do not fall on the expected normal distribution line we have even more evidence to support the notion that the distribution of `sum` is non-normal.

**Statistical interrogation**

Although the descriptives and visualizations strongly suggest that we do not have normally distributed data let's run a normality test. For this we turn to the `shapiro.test()` function which performs the Shapiro-Wilk test of normality. We pass the `sum` variable to this function to run the test.

```{r i-uni-cont-sdac-test}
s1 <- shapiro.test(sdac_speaker_fillers$sum) # apply the normality test to `sum`

s1 # preview the test results
```

As we saw with the results from the `chisq.test()` function, the `shapiro.test()` function produces an object with information about the test including the $p$-value. Let's run our logical test to see if the test is statistically significant. 

```{r i-uni-cont-sdac-test-confirm}
s1$p.value < .05 # 
```


**Evaluation**

The results from the Shapiro-Wilk Normality Test tell us that the distribution of `sum` is statistically found to differ from the normal distribution. So in this case, statistical significance suggests that `sum` cannot be used as a parametric dependent variable. For our aims this is all the evaluation required. Effect size and confidence intervals are not applicable.

It is of note, however, that the expectation that the variable `sum` would conform to the normal distribution was low from the outset as we are working with count data. Count data, or frequencies, are in a strict sense not continuous, but rather discrete --meaning that they are real numbers (whole numbers which are always positive). This is a common informational type to encounter in text analysis. 

## Bivariate analysis

A more common scenario in statistical analysis is the consideration of the relationship between two-variables, known as bivariate analysis. 

### Categorical

Let's build on our univariate analysis of `realization_of_recipient` and include an explanatory, or independent variable which we will explore to test whether it can explain our earlier finding that 'NP' datives are more common that 'PP' datives. The question to test, then, is whether modality explains the distribution of the `realization_of_recipient`. 

**Descriptive assessment**

Both the `realization_of_recipient` and `modality` variables are categorical, specifically nominal as we can see by using `skim()`. 

```{r i-bi-cat-description}
dative %>% 
  select(realization_of_recipient, modality) %>% # select key variables
  skim() %>% # get custom data summary
  yank("factor") # only show factor-oriented information
```

For this reason measures of central tendency are not applicable and we will turn to a contingency table to summarize the relationship. The janitor package has a set of functions, the primary function being `tabyl()`. Other functions used here are to adorn the contingency table with totals, percentages, and to format the output for readability. 

```{r i-bi-cat-contingency-table}
dative %>% 
  tabyl(realization_of_recipient, modality) %>% # cross-tabulate
  adorn_totals(c("row", "col")) %>% # provide row and column totals
  adorn_percentages("col") %>% # add percentages to the columns
  adorn_pct_formatting(rounding = "half up", digits = 0) %>% # round the digits
  adorn_ns() %>% # add observation number
  adorn_title("combined") %>% # add a header title
  kable(booktabs = TRUE, # pretty table
        caption = "Contingency table for `realization_of_recipient` and `modality`.") # caption
```

To gain a better appreciation for this relationship let's generate a couple plots one which shows cross-tabulated counts and the other calculated proportions. 

```{r i-bi-cat-visual}
p1 <- 
  dative %>% # dataset
  ggplot(aes(x = realization_of_recipient, fill = modality)) + # mappings
  geom_bar() + # geometry
  labs(y = "Count", x = "Realization of recipient") # labels

p2 <- 
  dative %>% # dataset
  ggplot(aes(x = realization_of_recipient, fill = modality)) + # mappings
  geom_bar(position = "fill") + # geometry, with fill for proportion plot
  labs(y = "Proportion", x = "Realization of recipient", fill = "Modality") # labels

p1 <- p1 + theme(legend.position = "none") # remove legend from left plot

p1 + p2 + plot_annotation("Relationship between Realization of recipient and Modality.")
```

Looking at the count plot (in the left pane) we see that large difference between the realization of the dative as an 'NP' or 'PP' obscures to some degree our ability to see to what degree modality is related to the realization of the dative. So, a proportion plot (in the right pane) standardizes each level of `realization_of_recipient` to provide a more comparable view. From the proportion plot we see that there appears to be a trend towards more use of 'PP' than 'NP' in the written modality.

**Statistical interrogation**

Although the proportion plot is visually helpful, we use the raw counts to statistically analyze this relationship. Again, as we are working with categorical variables, now for a dependent and independent variable, we use the Chi-squared test. And as before we need to create the cross-tabulation table to pass to the `chisq.test()` to perform the test. 

```{r i-bi-cat-test}
ror_mod_table <- 
  xtabs(formula = ~ realization_of_recipient + modality, # formula 
        data = dative) # dataset

c2 <- chisq.test(ror_mod_table) # apply the chi-squared test to `ror_mod_table`

c2 # # preview the test results

c2$p.value < .05 # confirm p-value below .05
```

We can preview the result and provide a confirmation of the $p$-value. This evidence suggests that there is a difference between the distribution of dative realization according to modality. 

We can also see more details about the test.

```{r i-bi-cat-test-information}
c2 %>% # statistical result
  augment() # view detailed statistical test information
```


**Evaluation**

Now we want to calculate the effect size and the confidence interval to provide measures of assurance that our finding is robust.

```{r i-bi-cat-effect-ci}
effects <- effectsize(c2) # evaluate effect size and generate a confidence interval

effects # preview effect size and confidence interval

interpret_r(effects$Cramers_v) # interpret the effect size
```

We get effect size and confidence interval information. Note that the effect size, reflected by Cramer's V, for this relationship is weak. This points out an important aspect to evaluation of statistical tests. The fact that a test is significant does not mean that it is meaningful. A small effect size suggests that we should be cautious about the extent to which this significant finding is robust in the population from which the data is sampled.


### Continuous

For a bivariate analysis in which the dependent variable is not categorical, we will turn to the `sdac_disfluencies` dataset. The question we will pose to test is whether the use of fillers is related to the type of filler ('uh' or 'um'). 

**Descriptive assessment**

The key variables to assess in this case are the variables `count` and `filler`. But before we start to explore this relationship we will need to transform the dataset such that each speaker's use of the levels of `filler` is summed. We will use `group_by()` to group `speaker_id` and `filler` combinations and then use `summarize()` to then sum the counts for each filler type for each speaker

```{r i-bi-cont-sdac-fillers}
sdac_fillers <- 
  sdac_disfluencies %>% # dataset
  group_by(speaker_id, filler) %>% # grouping parameters
  summarize(sum = sum(count)) %>% # summed counts for each speaker-filler combination
  ungroup() # remove the grouping parameters
```

Let's preview this transformation.

```{r i-bi-cont-sdac-fillers-preview, echo=FALSE}
sdac_fillers %>% 
  slice_head(n = 10) %>% 
  kable(booktabs = TRUE,
        caption = "First 10 observations from `sdac_fillers` dataset.")
```


Let's take a look at them together by grouping the dataset by `filler` and then using the custom skim function `num_skim()` for the numeric variable`count`. 

```{r i-bi-cont-description}
sdac_fillers %>% # dataset
  group_by(filler) %>% # grouping parameter
  num_skim() %>% # get custom data summary
  yank("numeric") # only show numeric-oriented information
```

We see here that the standard deviation and IQR for both 'uh' and 'um' are relatively large for the respective means (71.4 and 15.7) suggesting the distribution is quite dispersed. Let's take a look at a boxplot to visualize the counts in `sum` for each level of `filler`. 


```{r i-bi-cont-visual}
p1 <- 
  sdac_fillers %>% # dataset
  ggplot(aes(x = filler, y = sum)) + # mappings
  geom_boxplot(notch = TRUE) + # geometry
  labs(x = "Filler", y = "Counts") # labels

p2 <- 
  sdac_fillers %>% # dataset
  ggplot(aes(x = filler, y = sum)) + # mappings
  geom_boxplot(notch = TRUE) + # geometry
  ylim(0, 100) + # scale the y axis to trim outliers
  labs(x = "Filler", y = "") # labels

p1 + p2
```

In the plot in the left pane we see a couple things. First, it appears that there is in fact quite a bit of dispersion as there are quite a few outliers (dots) above the lines extending from the boxes. Recall that the boxes represent the first and third quantile, that is the IQR and that the notches represent the confidence interval. Second, when we compare the boxes and their notches we see that there is little overlap (looking horizontally). In the right pane I've zoomed in a bit trimming some outliers to get a better view of the relationship between the boxes. Since the overlap is minimal and in particular the notches do not overlap at all, this is a good indication that there is a significant trend.

From the descriptive statistics and the visual summary it appears that the filler 'uh' is more common than 'um'. It's now time to submit this to statistical interrogation. 


**Statistical interrogation**

<!-- TODO:
- Investigate How to choose a family
  - https://www.researchgate.net/post/How-to-determine-which-family-function-to-use-when-fitting-generalized-linear-model-glm-in-R
-->

In a bivariate (and multivariate) analysis where the dependent variable is non-categorical we apply Linear Regression Modeling (LM). The default assumption of linear models, however, is that the dependent variable is normally distributed. As we have seen our variable `sum` does not conform to the normal distribution. We know this because of our tests in the univariate case, but as mentioned at the end of that section, we are working with count data which by nature is understood as discrete and not continuous in a strict technical sense. So instead of using the linear model for our regression analysis we will use the Generalized Linear Model (GLM) [@Baayen2008a; @Gries2013a]. 

The function `glm()` implements generalized linear models. In addition to the formula (`sum ~ filler`) and the dataset to use, we also include an appropriate distribution family for the dependent variable. For count and frequency data the appropriate family is the "Poisson" distribution. 

```{r i-bi-cont-test}
m1 <- 
  glm(formula = sum ~ filler, # formula
      data = sdac_fillers, # dataset
      family = "poisson") # distribution family

summary(m1) # preview the test results
```

Let's focus on the coefficients, specifically for the 'fillerum' line. Since our factor `filler` has two levels one level is used as the reference to contrast with the other level. In this case by default the first level is used as the reference. Therefore the coefficients we see in 'fillerum' are 'um' in contrast to 'uh'. Without digging into the details of the other parameter statistics, let's focus on the last column which contains the $p$-value. A convenient aspect of the `summary()` function when applied to regression model results is that it provides statistical significance codes. In this case we can see that the contrast between 'uh' and 'um' is signficant at $p < .001$ which of course is lower than our standard threshold of $.05$.  

Therefore we can say with some confidence that the filler 'uh' is more frequent than 'um'. 


**Evaluation**

Given we have found a significant effect for `filler`, let's look at evaluating the effect size and the confidence interval. Again, we use the `effectsize()` function. We then can preview the `effects` object. Note that effect size of interest is in the second row of the coefficient (`Std_Coefficient`) so we subset this column to extract only the effect coefficient for the `filler` contrast.

```{r i-bi-cont-eval-test}
effects <- effectsize(m1) # evaluate effect size and generate a confidence interval

effects # preview effect size and confidence interval

interpret_r(effects$Std_Coefficient[2]) # interpret the effect size
```

The coefficient statistic falls within the confidence interval and the effect size is strong so we can be confident that our findings are reliable given this data. 


## Multivariate analysis

The last case to consider is when we have more than one independent variable we want to use to assess their potential relationship to the dependent variable. Again we will consider a categorical and non-categorical dependent variable. But, in this case the implementation methods are quite similar, as we will see. 

### Categorical

For the categorical multivariate case we will again consider the `dative` dataset and build on the previous analyses. The question to be posed is whether modality in combination with the length of the recipient (`length_of_recipient`) together explain the distribution of the realization of the recipient (`realization_of_recipient`).

**Descriptive assessment**

Now that we have three variables, there is more to summarize to get our descriptive information. Luckily, however, the same process can be applied to three (or more) variables using the `group_by()` function and then passed to `skim()`. In this case we have two categorical variables and one numeric variable. So we will group by both the categorical variables and then pass the numeric variable to the custom `num_skim()` function --pulling out only the relevant descriptive information for numeric variables with `yank()`. 

```{r i-multi-cat-description}
dative %>% # dataset
  select(realization_of_recipient, modality, length_of_recipient) %>% # select key variables
  group_by(realization_of_recipient, modality) %>% # grouping parameters
  num_skim() %>% # get custom data summary
  yank("numeric") # only show numeric-oriented information
```

There is much more information now that we are considering multiple independent variables, but if we look over the measures of dispersion we can see that the median and the IQR are relatively similar to their respective means suggesting that there are fewer outliers and relativley little skew. 

Let's take a look at a visualization of this information. Since we are working with a categorical dependent variable and there is one non-categorical variable we can use a boxplot. The addition here is to include a `color` mapping which will provide a distinct box for each level of modality ('written' and 'spoken'). 


```{r i-multi-cat-visual}
p1 <- 
  dative %>% # dataset
  ggplot(aes(x = realization_of_recipient, y = length_of_recipient, color = modality)) + # mappings
  geom_boxplot(notch = TRUE) + # geometry
  labs(x = "Realization of recipient", y = "Length of recipient (in words)", color = "Modality") # labels

p2 <- 
  dative %>% # dataset
  ggplot(aes(x = realization_of_recipient, y = length_of_recipient, color = modality)) + # mappings
  geom_boxplot(notch = TRUE) + # geometry
  ylim(0, 15) + # scale the y axis to trim outliers
  labs(x = "Realization of recipient", y = "", color = "Modality") # labels

p1 <- p1 + theme(legend.position = "none") # remove the legend from the left pane plot

p1 + p2
```

In the left pane we see the entire visualization including all outliers. From this view it appears that there is a potential trend that the length of the recipient is larger when the realization of the recipient is 'PP'. There is also a potential trend for modality with written language showing longer recipient lengths overall. The pane on the right is scaled to get a better view of the boxes by scaling the y-axis down and as such trimming the outliers. This plot shows more clearly that the length of the recipient is longer when the recipient is realized as a 'PP'. Again, the contrast in modality is also a potential trend, but the boxes (of the same color), particularly for the spoken modality overlap to some degree. 

So we have some trends in mind which will help us interpret the statistical interrogation so let's move there next.

**Statistical interrogation**

Once we involve more than two variables, the choice of statistical method turns towards regression. In the case that the dependent variable is categorical, however, we will use Logistic Regression. The workhorse function `glm()` can be used for a series of regression models, including logistic regression. The requirement, however, is that we specify the family of the distribution. For logistic regression the family is "binomial". The formula includes the dependent variable as a function of our other two variables, each are separated by the `+` operator.  

```{r i-multi-cat-test}
m1 <- glm(formula = realization_of_recipient ~ modality + length_of_recipient, # formula
          data = dative, # dataset
          family = "binomial") # distribution family

summary(m1) # preview the test results
```

The results from the model again provide a wealth of information. But the key information to focus on is the coefficients. In particular the coefficients for the independent variables `modality` and `length_of_recipient`. What we notice, is that the $p$-value for `length_of_recipient` is significant, but the contrast between 'written' and 'spoken' for `modality` is not. If you recall, we used this same dataset to explore `modality` as a single indpendent variable earlier --and it was found to be significant. So why now is it not? The answer is that when multiple variables are used to explain the distribution of a measure (dependent variable) each variable now adds more information to explain the dependent variable --each has it's own contribution. Since `length_of_recipient` is significant, this suggests that the explanatory power of `modality` is weak, especially when compared to `length_of_recipient`. This make sense as we saw in the earlier model the fact that the effect size for `modality` was not strong and that is now more evident that the `length_of_recipient` is included in the model.


**Evaluation**

Now let's move on and gauge the effect size and calculate the confidence interval for `length_of_recipient` in our model. We apply the `effectsize()` function to the model and then use `interpret_r()` on the coefficient of interest (which is in the fourth row of the `Std_Coefficients` column).

```{r i-multi-cat-effects}
effects <- effectsize(m1) # evaluate effect size and generate a confidence interval

effects # preview effect size and confidence interval

interpret_r(effects$Std_Coefficient[4]) # interpret the effect size
```

We see we have a coefficient that falls within the confidence interval and the effect size is large. So we can saw with some confidence that the length of the recipient is a significant predictor of the use of 'PP' as the realization of the recipient in the dative alternation.

### Continuous

The last case we will consider here is when the dependent variable is non-categorical and we have more than one independent variable. The question we will pose is whether the type of filler and the sex of the speaker can explain the use of fillers in conversational speech. 

We will need to prepare the data before we get started as our current data frame `sdac_fillers` has filler and the sum count for each filler grouped by speaker --but it does not include the `sex` of each speaker. The `sdac_disfluencies` data frame does have the `sex` column, but it has not been grouped by speaker. So let's transform the `sdac_disfluencies` summarizing it to only get the `speaker_id` and `sex` combinations. This should result in a data frame with 441 observations, one observation for each speaker in the corpus.


```{r i-multi-cont-transform-sdac}
sdac_speakers_sex <- 
  sdac_disfluencies %>% # dataset
  distinct(speaker_id, sex) # summarize for distinct `speaker_id` and `sex` values
```

Let's preview the first 10 observations form this transformation.

```{r i-multi-cont-transform-sdac-preview, echo=FALSE}
sdac_speakers_sex %>% 
  arrange(speaker_id) %>% 
  slice_head(n = 10) %>% 
  kable(booktabs = TRUE,
        caption = "First 10 observations of the `sdac_speakers_sex` data frame.")
```

Great, now we have each `speaker_id` and `sex` for all 441 speakers. One thing to note, however, is that speaker '155' does not have a value for `sex` --this seems to be an error in the metadata that we will need to deal with before we proceed in our analysis. Let's move on to join our new `sdac_speakers_sex` data frame and the `sdac_fillers` data frame.

Now that we have a complete dataset with `speaker_id` and `sex` we will now join this dataset with our `sdac_fillers` dataset effectively adding the column `sex`. We want to keep all the observations in `sdac_fillers` and add the column `sex` for observations that correspond between each data frame for the column `speaker_id` so we will use a left join with the function `left_join()` with the `sdac_fillers` dataset on the left. 

```{r i-multi-cont-join-fillers-sex}
sdac_fillers_sex <- 
  left_join(sdac_fillers, sdac_speakers_sex) # join
```

Now let's preview the first observations in this new `sdac_fillers_sex` data frame.

```{r i-multi-cont-sdac-fillers-sex-preview, echo=FALSE}
sdac_fillers_sex %>% 
  arrange(speaker_id) %>% 
  slice_head(n = 10) %>% 
  kable(booktabs = TRUE,
        caption = "First 10 observations of the `sdac_fillers_sex` data frame.")
```

At this point let's drop this speaker from the `sdac_speakers_sex` data frame. 

```{r i-multi-cont-drop-speaker}
sdac_fillers_sex <- 
  sdac_fillers_sex %>% # dataset
  filter(speaker_id != "155") # drop speaker_id 155
```

We are now ready to proceed in our analysis.

**Descriptive assessment**

The process by now should be quite routine for getting our descriptive statistics: select the key variables, group by the categorical variables, and finally pull the descriptives for the numeric variable.

```{r i-multi-cont-description}
sdac_fillers_sex %>% # dataset
  select(sum, filler, sex) %>% # select key variables
  group_by(filler, sex) %>% # grouping parameters
  num_skim() %>% # get custom data summary
  yank("numeric") # only show numeric-oriented information
```

Looking at these descriptives, it seems like there is quite a bit of variability for some combinations and not others. In short, it's a mixed bag. Let's try to make sense of these numbers with a boxplot.


```{r i-mulit-cont-visual}
p1 <- 
  sdac_fillers_sex %>% # dataset
  ggplot(aes(x = filler, y = sum, color = sex)) + # mappings
  geom_boxplot(notch = TRUE) + # geometry
  labs(x = "Filler", y = "Counts", color = "Sex") # labels

p2 <- 
  sdac_fillers_sex %>% # dataset
  ggplot(aes(x = filler, y = sum, color = sex)) + # mappings
  geom_boxplot(notch = TRUE) + # geometry
  ylim(0, 200) + # scale the y axis to trim outliers
  labs(x = "Filler", y = "", color = "Sex") # labels

p1 <- p1 + theme(legend.position = "none") # drop the legend from the left pane plot

p1 + p2
```

We can see that 'uh' is used more than 'um' overall. But that whereas men and women use 'uh' in similar ways, women use more 'um' than men. This is known as an interaction. So we will approach our statistical analysis with this in mind.

**Statistical interrogation**

We will again use a generalized linear model with the `glm()` function to conduct our test. The distribution family will be the same has we are again using the `sum` as our dependent variable which contains discrete count values. The formula we will use, however, is new. Instead of adding a new variable to our independent variables, we will test the possible interaction between `filler` and `sex` that we noted in the descriptive assessment. To encode an interaction the `*` operator is used. So our formula will take the form `sum ~ filler * sex`. Let's generate the model and view the summary of the test results as we have done before.

```{r i-multi-cont-test}
m1 <- 
  glm(formula = sum ~ filler * sex, # formula
      data = sdac_fillers_sex, # dataset
      family = "poisson") # distribution family

summary(m1) # preview the test results
```

Again looking at the coefficients we something new. First we see that there is a row for the `filler` contrast and the `sex` contrast but also the interaction between `filler` and `sex` ('fillerum:sexMALE'). All rows show significant effects. It is important to note that when an interaction is explored and it is found to be significant, the other simple effects, known as main effects ('fillerum' and 'sexMALE'), are ignored. Only the higer-order effect is considered significant. 

Now what does the 'fillerum:sexMALE' row mean. It means that there is an interaction between `filler` and `sex`. the directionality of that interaction should be interpreted using our descriptive assessment, in particular the visual boxplots we generated. In sum, women use more 'um' than men or stated another way men use 'um' less than women.


**Evaluation**

We finalize our analysis by looking at the effect size and confidence intervals.

```{r i-multi-cont-effects}
effects <- effectsize(m1) # evaluate effect size and generate a confidence interval

effects # preview effect size and confidence interval

interpret_r(effects$Std_Coefficient[4]) # interpret the effect size
```

We can conclude, then, that there is a strong interaction effect for `filler` and `sex` and that women use more 'um' than men. 


## Summary

In this chapter we have discussed various approaches to conducting inferential data analysis. Each configuration, however, always includes a descriptive assessment, statistical interrogation, and an evaluation of the results. We considered univariate, bivariate, and multivariate analyses using both categorical and non-categorical dependent variables to explore the similarities and differences between these approaches. 

<!-- IDEAS:

- Use Switchboard Dialogue Act Corpus
  - Disfluencies (uh/ um)

- languageR::dative
  - 
  
-------

- Use BELC dataset from "Approaching analysis" chapter
  - Question(s): 
    - 
    
- Santa Barbara Corpus of Spoken American English
  - analyzr::sbc

- Love on the Spectrum/ Love is Blind
  - Word frequency differences (from SUBTLEXus) as the dependent variable


- Use other datasets from the "Transform datasets" chapter

RESOURCES:

- Example for Fillers: https://github.com/WFU-TLC/analyzr/blob/master/vignettes/analysis-an-inference-example.Rmd

- 


-->