07-multiple-regression.Rmd

# Multiple Regression {#multiple-regression}

```{r, include=FALSE, purl=FALSE}
chap <- 6
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**

knitr::opts_chunk$set(
  tidy = FALSE,
  out.width = "\\textwidth",
  message = FALSE,
  warning = FALSE
  )
options(scipen = 99, digits = 3)

# This bit of code is a bug fix on asis blocks, which we use to show/not show LC
# solutions, which are written like markdown text. In theory, it shouldn't be
# necessary for knitr versions <=1.11.6, but I've found I still need to for
# everything to knit properly in asis blocks. More info here: 
# https://stackoverflow.com/questions/32944715/conditionally-display-block-of-markdown-text-using-knitr
library(knitr)
knit_engines$set(asis = function(options) {
  if (options$echo && options$eval) knit_child(text = options$code)
})

# This controls which LC solutions to show. Options for solutions_shown: "ALL"
# (to show all solutions), or subsets of c('5-1', '5-2','5-3', '5-4'), including
# the null vector c("") to show no solutions.
solutions_shown <- c("")
show_solutions <- function(section){
  return(solutions_shown == "ALL" | section %in% solutions_shown)
  }
```

Now that we are equipped with data visualization skills from Chapter \@ref(viz), data wrangling skills from Chapter \@ref(wrangling), and an understanding of the "tidy" data format from Chapter \@ref(tidy), we now proceed with data modeling. The fundamental premise of data modeling is *to model the relationship* between:

* An outcome variable $y$, also called a dependent variable and 
* An explanatory/predictor variable $x$, also called an independent variable or covariate. 

Another way to state this is using mathematical terminology: we will model the outcome variable $y$ *as a function* of the explanatory/predictor variable $x$.  Why do we have two different labels, explanatory and predictor, for the variable $x$? That's because roughly speaking data modeling can be used for two purposes:

1. **Modeling for prediction**: You want to predict an outcome variable $y$ based on the information contained in a set of predictor variables. You don't care so much about understanding how all the variables relate and interact, but so long as you can make good predictions about $y$, you're fine. For example, if we know many individuals' risk factors for lung cancer, such as smoking habits and age, can we predict whether or not they will develop lung cancer? Here we wouldn't care so much about distinguishing the degree to which the different risk factors contribute to lung cancer, but instead only on whether or not they could be put together to make reliable predictions.
1. **Modeling for explanation**: You want to explicitly describe the relationship between an outcome variable $y$ and a set of explanatory variables, determine the significance of any found relationships, and have measures summarizing these. Continuing our example from above, we would now be interested in describing the individual effects of the different risk factors and quantifying the magnitude of these effects. One reason could be to design an intervention to reduce lung cancer cases in a population, such as targeting smokers of a specific age group with advertisement for smoking cessation programs. In this book, we'll focus more on this latter purpose.

Data modeling is used in a wide variety of fields, including statistical inference, causal inference, artificial intelligence, and machine learning. There are many techniques for data modeling, such as tree-based models, neural networks/deep learning, and more. However, we'll focus on one particular technique: *linear regression*, one of the most commonly-used and easy to understand approaches to modeling. Recall our discussion in Chapter \@ref(explore-dataframes) on numerical and categorical variables. Linear regression involves:

* An outcome variable $y$ that is *numerical*
* Explanatory variables $\vec{x}$ that are either *numerical* or *categorical*

Whereas there is always only one numerical outcome variable $y$, we have choices on both the number and the type of explanatory variables $\vec{x}$ to use. We're going to the following regression scenarios, where we have 

1. A single numerical explanatory variable $x$ in Chapter \ref@(model1). This scenario is known as *simple linear regression*. 
1. A single categorical explanatory variable $x$ in Chapter \ref@(model2). 
1. Two numerical explanatory variables $x_1$ and $x_2$ in Chapter \ref@(model3). This is the first of two *multiple regression* scenarios we'll cover where there are now more than one explanatory variable.
1. One numerical and one categorical explanatory variable in Chapter \ref@(model3). This is the second *multiple regression* scenario we'll cover. We'll also introduce *interaction models* here.

We'll study all 4 of these regression scenarios using real data, all easily accessible via R packages! <!--We will also discuss the concept of *correlation* and how it is frequently incorrectly used to imply *causation*.-->


### Needed packages {-}

In this chapter we introduce a new package, `moderndive`, that is an accompaniment package to the ModernDive book that includes useful functions for linear regression and other functions and data used in later in the book. However because this is a relatively new package, we're have to install this package in a different way than was covered in Section \@ref(packages). 

First, install the `remotes` package like you would normally install a package, then run the following line to install the `moderndive` package:

```{r, message=FALSE, warning=FALSE}
library(remotes)
# remotes::install_github("moderndive/moderndive")
```

Let's now load all the packages needed for this chapter. If needed, read Section \@ref(packages) for information on how to install and load R packages. 

```{r, message=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
library(moderndive)
```

```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
library(mvtnorm)
library(tidyr)
library(forcats)
library(gridExtra)
```


## Two numerical x {#model3}

Much like our teacher evaluation data in Section \@ref(model1), let's now attempt to identify factors that are associated with how much credit card debt an individual will hold. The textbook [An Introduction to Statistical Learning with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is an intermediate-level textbook on statistical and machine learning freely available [here](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf). It has an accompanying R packaged called `ISLR` with datasets that the authors use to demonstrate various machine learning methods. 
One dataset that is frequently used by the authors is the `Credit` dataset where predictions are made on the balance held by various credit card holders based on information like income, credit limit, and education level. Let's consider the following variables:

* Outcome variable $y$: Credit card balance in dollars
* Explanatory variables $x$:
    + income in $1000's
    + credit card limit in dollars
    + credit rating
    + card holder's age
    + years of education

Let's load the data and `select()` only the 6 variables listed above:

```{r, warning=FALSE, message=FALSE}
library(ISLR)
Credit <- Credit %>%
  select(Balance, Limit, Income, Rating, Age, Education)
```

Unfortunately, no information on how this data was sampled was provided, so we're not able to generalize any such results to a greater population, but we will still use this data to demonstrate *multiple regression*. Multiple regression is a form of linear regression where there are now more than one explanatory variables and thus the interpretation of the associated effect of any one explanatory variable must be made in conjunction with the other explanatory variable. We'll perform multiple regression with:

1. A numerical outcome variable $y$, in this case teaching credit card `Balance`
1. Two explanatory variables:
    1. A first numerical explanatory variable $x_1$, in this case credit `Limit`
    1. A second numerical explanatory variable $x_2$, in this case `Income`


### Exploratory data analysis {#model3EDA}

Let's look at the raw data values both by bringing up RStudio's spreadsheet viewer, although in Table \@ref(tab:model3-data-preview) we only show 5 randomly selected credit card holders out of `r nrow(Credit)`, and using the `glimpse()` function:

```{r, eval=FALSE}
View(Credit)
```

```{r model3-data-preview, echo=FALSE}
Credit %>%
  sample_n(5) %>%
  knitr::kable(
    digits = 3,
    caption = "Random sample of 5 credit card holders",
    booktabs = TRUE
  )
```

```{r}
glimpse(Credit)
```

Let's look at some summary statistics:

```{r}
summary(Credit)
```

We observe for example

1. The typical credit card balance is around $500
1. The typical credit card limit is around $5000
1. The typical income is around $40,000.

Since our outcome variable `Balance` and the explanatory variables
`Limit` and `Rating` are numerical, we can compute the correlation coefficient
between pairs of these variables. However, instead running the `cor()` command
twice, once for each explanatory variable:

```{r, eval=FALSE}
cor(Credit$Balance, Credit$Limit)
cor(Credit$Balance, Credit$Income)
```

We can simultaneously compute them by returning a *correlation matrix* in 
Table \@ref(tab:model3-correlation). We read off the correlation coefficient
for any pair of variables by looking them up in the appropriate row/column combination.

```{r, eval=FALSE}
Credit %>% 
  select(Balance, Limit, Income) %>% 
  cor()
```
```{r model3-correlation, echo=FALSE}
Credit %>% 
  select(Balance, Limit, Income) %>% 
  cor() %>% 
  knitr::kable(
    digits = 3,
    caption = "Correlations between credit card balance, credit limit, and credit rating", 
    booktabs = TRUE
  )
```

For example, the correlation coefficients of:

1. `Balance` with itself is 1
1. `Balance` with `Limit` is 0.862, indicating a strong positive linear relationship, which makes sense as only individuals with large credit limits and accrue large credit card balances.
1. `Balance` with `Income` is 0.464, suggestive of another positive linear relationship, although not as strong as the relationship between `Balance` and `Limit`.
1. As an added bonus, we can read off the correlation coefficient of the two explanatory variables, `Limit` and `Income` of 0.792. In this case, we say there is a high degree of *collinearity* between these two explanatory variables.

Let's visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots

```{r, eval=FALSE}
ggplot(Credit, aes(x = Limit, y = Balance)) +
  geom_point() +
  labs(x = "Credit limit (in $)", y = "Credit card balance (in $)", 
       title = "Relationship between balance and credit limit") +
  geom_smooth(method = "lm", se = FALSE)
  
ggplot(Credit, aes(x = Income, y = Balance)) +
  geom_point() +
  labs(x = "Income (in $1000)", y = "Credit card balance (in $)", 
       title = "Relationship between balance and income") +
  geom_smooth(method = "lm", se = FALSE)
```

```{r 2numxplot1, echo=FALSE, fig.height=4, fig.cap="Relationship between credit card balance and credit limit/income"}
model3_balance_vs_limit_plot <- ggplot(Credit, aes(x = Limit, y = Balance)) +
  geom_point() +
  labs(x = "Credit limit (in $)", y = "Credit card balance (in $)", 
       title = "Balance vs credit limit") +
  geom_smooth(method = "lm", se = FALSE)
model3_balance_vs_income_plot <- ggplot(Credit, aes(x = Income, y = Balance)) +
  geom_point() +
  labs(x = "Income (in $1000)", y = "Credit card balance (in $)", 
       title = "Balance vs income") +
  geom_smooth(method = "lm", se = FALSE)
grid.arrange(model3_balance_vs_limit_plot, model3_balance_vs_income_plot, nrow = 1)
```

In both cases, there seems to be a positive relationship. However the two plots in Figure \@ref(fig:2numxplot1) only focus on the relationship of the outcome variable with each of the explanatory variables independently. To get a sense of the *joint* relationship of all three variables simultaneously through a visualization, let's display the data in a 3-dimensional (3D) scatterplot, where

1. The numerical outcome variable $y$ `Balance` is on the z-axis (vertical one)
1. The two numerical explanatory variables form the x and y axes (on the floor). In this case
    1. The first numerical explanatory variable $x_1$ `Income` is on the x-axis
    1. The second numerical explanatory variable $x_2$ `Limit` is on the y-axis

Click on the following image to open an interactive 3D scatterplot in your browser:

<center><a target="_blank" href="http://rpubs.com/moderndive/credit_card_balance_3D_scatterplot"><img src="images/credit_card_balance_3D_scatterplot.png" title="3D scatterplot" width="600"/></a></center>

```{r, eval=FALSE, echo=FALSE}
# Save as 798 x 562 images/credit_card_balance_3D_scatterplot.png
library(ISLR)
library(plotly)
plot_ly(showscale=FALSE) %>%
  add_markers(
    x = Credit$Income,
    y = Credit$Limit,
    z = Credit$Balance,
    hoverinfo = 'text',
    text = ~paste("x1 - Income: ", Credit$Income, 
                  "</br> x2 - Limit: ", Credit$Limit, 
                  "</br> y - Balance: ", Credit$Balance)
  ) %>% 
  layout(
    scene = list(
      xaxis = list(title = "x1 - Income (in $10K)"),
      yaxis = list(title = "x2 - Limit ($)"),
      zaxis = list(title = "y - Balance ($)")
    )
  )
```

Previously in Figure \@ref(fig:numxplot4), we plotted a "best-fitting" regression line through a set of points where the numerical outcome variable $y$ was teaching `score` and a single numerical explanatory variable $x$ `bty_avg`. What is the analogous concept when we have *two* numerical predictor variables? Instead of a best-fitting line, we now have a best-fitting *plane*, which is 3D generalization of lines which exist in 2D. Click on the following image to open an interactive plot of the regression plane in your browser.

<center><a target="_blank" href="https://beta.rstudioconnect.com/connect/#/apps/3214/"><img src="images/credit_card_balance_regression_plane.png" title="Regression plane" width="600"/></a></center>

```{r, eval=FALSE, echo=FALSE}
# Save as 798 x 562 images/credit_card_balance_regression_plane.png
library(ISLR)
library(plotly)
library(tidyverse)

# setup hideous grid required by plotly
model_lm <- lm(Balance ~ Income + Limit, data=Credit)
x_grid <- seq(from=min(Credit$Income), to=max(Credit$Income), length=100)
y_grid <- seq(from=min(Credit$Limit), to=max(Credit$Limit), length=200)
z_grid <- expand.grid(x_grid, y_grid) %>%
  tbl_df() %>%
  rename(
    x_grid = Var1,
    y_grid = Var2
  ) %>%
  mutate(z = coef(model_lm)[1] + coef(model_lm)[2]*x_grid + coef(model_lm)[3]*y_grid) %>%
  .[["z"]] %>%
  matrix(nrow=length(x_grid)) %>%
  t()

# plot points and plane
plot_ly(showscale=FALSE) %>%
  add_markers(
    x = Credit$Income,
    y = Credit$Limit,
    z = Credit$Balance,
    hoverinfo = 'text',
    text = ~paste("x1 - Income: ", Credit$Income, "</br> x2 - Limit: ", 
                  Credit$Limit, "</br> y - Balance: ", Credit$Balance)
  ) %>% 
  layout(
    scene = list(
      xaxis = list(title = "x1 - Income (in $10K)"),
      yaxis = list(title = "x2 - Limit ($)"),
      zaxis = list(title = "y - Balance ($)")
    )
  ) %>% 
  add_surface(
    x = x_grid,
    y = y_grid,
    z = z_grid
  )
```


### Multiple regression {#model3table}

Just as we did when we had a single numerical explanatory variable $x$ in Section \@ref(model1table) and when we had a single categorical explanatory variable $x$ in Section \@ref(model2table), we fit a regression model and obtain the regression table in our two numerical explanatory variable scenario. To fit a regression model and obtain a table using `get_regression_table()`, we now use a `+` to consider multiple explanatory variables. In this case since we want to preform a regression of `Limit` and `Income` simultaneously, we input `Balance ~ Limit + Income`

```{r, eval=FALSE}
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
get_regression_table(Balance_model)
```
```{r, echo=FALSE}
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
Credit_line <- get_regression_table(Balance_model) %>%
  pull(estimate)
```
```{r model3-table-output, echo=FALSE}
get_regression_table(Balance_model) %>% 
  knitr::kable(
    digits = 3,
    caption = "Multiple regression table", 
    booktabs = TRUE
  )
```

How do we interpret these three values that define the regression plane?

* Intercept: -$385.18. The intercept in our case represents the credit card balance for an individual who has both a credit `Limit` of $0 and `Income` of $0. In our data however, the intercept has limited practical interpretation as no individuals had `Limit` or `Income` values of $0 and furthermore the smallest credit card balance was $0. Rather, it is used to situate the regression plane in 3D space. 
* Limit: $0.26. Now that we have multiple variables to consider, we have to add
a caveat to our interpretation: *all other things being equal, for every increase of one unit in credit `Limit` (dollars), there is an associated increase of on average $0.26 in credit card balance*. Note:
    + Just as we did in Section \@ref(model1table), we are not making any causal statements, only statements relating to the association between credit limit and balance
    + The *all other things being equal* is making a statement about all other explanatory variables, in this case only one: `Income`. This is equivalent to saying "holding `Income` constant, we observed an associated increase of $0.26 in credit card balance for every dollar increase in credit limit"
* Income: -$7.66. Similarly, *all other things being equal, for every increase of one unit in `Income` (in other words, $1000 in income), there is an associated decrease of on average $7.66 in credit card balance*.

However, recall in Figure \@ref(fig:2numxplot1) that when considered separately, both `Limit` and `Income` had positive relationships with the outcome variable `Balance`: as card holders' credit limits increased their credit card balances tended to increase as well, and a similar relationship held for incomes and balances. In the above multiple regression, however, the slope for `Income` is now -7.66, suggesting a *negative relationship* between income and credit card balance. What explains these contradictory results? This is a phenomenon known as Simpson's Paradox, a phenomenon in which a trend appears in several different groups of data but disappears or reverses when these groups are combined; we expand on this in Section \@ref(simpsonsparadox).


### Observed/fitted values and residuals {#model3points}

As we did previously, in Table \@ref(tab:model3-points-table) let's unpack the output of the `get_regression_points()` function for our model for credit card balance

* for all `r nrow(Credit)` card holders in the dataset, in other words
* for all `r nrow(Credit)` rows in the `Credit` data frame, in other words
* for all `r nrow(Credit)` points

```{r, eval=FALSE}
regression_points <- get_regression_points(Balance_model)
regression_points
```
```{r model3-points-table, echo=FALSE}
set.seed(76)
regression_points <- get_regression_points(Balance_model)
regression_points %>%
  slice(1:5) %>%
  knitr::kable(
    digits = 3,
    caption = "Regression points (first 5 rows of 400)",
    booktabs = TRUE
  )
```

Recall the format of the output:

* `balance` corresponds to $y$ the observed value
* `balance_hat` corresponds to $\widehat{y}$ the fitted value
* `reisdual` corresponds to the residual $y - \widehat{y}$

As we did in Section \@ref(model1points) let's prove to ourselves that the results make sense by manually replicate the internal computation that `get_regression_points()` performs using the `mutate()` verb from Chapter \@ref(wrangling)

1. First we compute a duplicate of the fitted values $\widehat{y} = b_0 + b_1 \times x_1 + b_2 \times x_2$ where the intercept $b_0$ = `r Credit_line[1]`, the slope $b_1$ for `Limit` of `r Credit_line[2]`, and the slope $b_2$ for `Income` of `r Credit_line[3]` and store these in `balance_hat_2`
1. Then we compute the a duplicated the residuals $y - \widehat{y}$ and store these in `residual_2`

```{r, eval=FALSE}
regression_points <- regression_points %>% 
  mutate(Balance_hat_2 = -385.179 + 0.264 * Limit - 7.663 * Income) %>% 
  mutate(residual_2 = Balance - Balance_hat_2)
regression_points
```
```{r, echo=FALSE}
set.seed(76)
regression_points <- regression_points %>% 
  mutate(Balance_hat_2 = -385.179 + 0.264 * Limit - 7.663 * Income) %>% 
  mutate(residual_2 = Balance - Balance_hat_2)
regression_points %>%
  slice(1:5) %>%
  knitr::kable(
    digits = 3,
    caption = "Regression points (first five rows)",
    booktabs = TRUE
  )
```

We see that our manually computed variables `balance_hat_2` and `residual_2` equal the original results (up to rounding error).

### Residual analysis {#model3residuals}

Recall in Section \@ref(model1residuals), our first residual analysis plot investigated the presense of any systematic pattern in the residuals when we had a single numerical predictor: `bty_age`. For the `Credit` card dataset, since we have two numerical predictors, `Limit` and `Income`, we must perform this twice:

```{r, eval=FALSE}
ggplot(regression_points, aes(x = Limit, y = residual)) +
  geom_point() +
  labs(x = "Credit limit (in $)", y = "Residual", title = "Residuals vs credit limit")
  
ggplot(regression_points, aes(x = Income, y = residual)) +
  geom_point() +
  labs(x = "Income (in $1000)", y = "Residual", title = "Residuals vs income")
```

```{r, echo=FALSE, fig.height=4, fig.cap="Residuals vs credit limit and income"}
model3_residual_vs_limit_plot <- ggplot(regression_points, aes(x = Limit, y = residual)) +
  geom_point() +
  labs(x = "Credit limit (in $)", y = "Residual", 
       title = "Residuals vs credit limit")
model3_residual_vs_income_plot <- ggplot(regression_points, aes(x = Income, y = residual)) +
  geom_point() +
  labs(x = "Income (in $1000)", y = "Residual", 
       title = "Residuals vs income")
grid.arrange(model3_residual_vs_limit_plot, model3_residual_vs_income_plot, nrow = 1)
```

In this case, there does appear to be a systematic pattern to the residuals, as the scatter of the residuals around the line $y=0$ is definitely not consistent. This behavior of the residuals is further evidenced by the histogram of residuals in Figure \@ref(fig:model3_residuals_hist). We observe that the residuals have a slight right-skew (a long tail towards the right). Ideally, these residuals should be bell-shaped around a residual value of 0. 

```{r model3_residuals_hist, warning=FALSE, fig.cap="Plot of residuals over continent"}
ggplot(regression_points, aes(x = residual)) +
  geom_histogram(color = "white") +
  labs(x = "Residual")
```

Another way to intepret this histogram is that since the residual is computed as $y - \widehat{y}$ = `balance` - `balance_hat`, we have some values where the fitted value $\widehat{y}$ is very much lower than the observed value $y$. In other words, we are underfitting certain credit card holder's balance by a very large amount. 


### Your turn

Repeat this same analysis but using `Age` and `Rating` as the two numerical explanatory variables. Recall the four steps we've been following:

1. Perform an exploratory data analysis (EDA)
1. Fit a linear regression model and get information about the model
1. Get information on all $n$ data points considered in this analysis. 
1. Perform a residual analysis and look for any systematic patterns in the residuals.


## One numerical & one categorical x {#model4}

Let's revisit the instructor evaluation data introduced in Section \@ref(model1), where we studied the relationship between

1. the numerical outcome variable $y$: instructor evaluation score `score`
1. the numerical explanatory variable $x$: beauty score `bty_avg`

This analysis suggested that there is a positive relationship between `bty_avg` and `score`, in other words as instructors had higher beauty scores, they also tended to have higher teaching evaluation scores. Now let's say instead of `bty_avg` we are interested in the numerical explanatory variable $x_1$ `age` and furthermore we want to use a second $x_2$, the (binary) categorical variable `gender`. Our modeling scenario now becomes

1. the numerical outcome variable $y$: instructor evaluation score `score`
1. Two explanatory variables:
    1. A first numerical explanatory variable $x_1$: `age`
    1. A second numerical explanatory variable $x_2$, in this case `gender`


### Exploratory data analysis {#model4EDA}

```{r}
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
  select(score, bty_avg, age, gender)
```

We've already performed some elements of our exploratory data analysis (EDA) in Section \@ref(model4EDA), in particular of the outcome variable `score`, so let's now only consider our two new explanatory variables

```{r}
evals %>% 
  select(age, gender) %>% 
  summary()
```

```{r}
cor(evals$score, evals$age)
```

In Figure \@ref(numxcatxplot1), we plot a scatterplot of `score` over `age`, but given that `gender` is a binary categorical variable

1. We can assign a color to points from each of the two levels of `gender`: female and male
1. Furthermore, the `geom_smooth(method = "lm", se = FALSE)` layer automatically fits a different regression line for each 

```{r numxcatxplot1, warning=FALSE, fig.cap="Instructor evaluation scores at UT Austin split by gender: Jittered"}
ggplot(evals, aes(x = age, y = score, col = gender)) +
  geom_jitter() +
  labs(x = "Age", y = "Teaching Score", color = "Gender") +
  geom_smooth(method = "lm", se = FALSE)
```

We notice some interesting trends:

1. There are almost no women faculty over the age of 60.
1. Fitting separate regression lines for men and women, we see they have different slopes.
We see that the associated effect of increasing age seems to be much harsher for women than men. In other words, they have different slopes.


### Multiple regression {#model4table}

Let's now compute the *regression table*

```{r, eval=FALSE}
score_model_2 <- lm(score ~ age + gender, data = evals)
get_regression_table(score_model_2)
```
```{r, echo=FALSE}
score_model_2 <- lm(score ~ age + gender, data = evals)
get_regression_table(score_model_2) %>% 
  knitr::kable(
    digits = 3,
    caption = "Regression table", 
    booktabs = TRUE
  )
```

The modeling equation for this is:

$$
\begin{align}
\widehat{y} &= b_0 + b_1 x_1 + b_2 x_2 \\
\widehat{score} &= b_0 + b_{age} age + b_{male} \mathbb{1}[\mbox{is male}] \\
\end{align}
$$

What this looks like is in Figure \@ref(fig:numxcatxplot2) below. 

```{r numxcatxplot2, echo=FALSE, warning=FALSE, fig.cap="Instructor evaluation scores at UT Austin by gender: same slope"}
coeff <- lm(score ~ age + gender, data = evals) %>% 
  coef() %>%
  as.numeric()
slopes <- evals %>%
  group_by(gender) %>%
  summarise(min = min(age), max = max(age)) %>%
  mutate(intercept = coeff[1]) %>%
  mutate(intercept = ifelse(gender == "male", intercept + coeff[3], intercept)) %>%
  gather(point, age, -c(gender, intercept)) %>%
  mutate(y_hat = intercept + age * coeff[2])

ggplot(evals, aes(x = age, y = score, col = gender)) +
  geom_jitter() +
  labs(x = "Age", y = "Teaching Score", color = "Gender") +
  geom_line(data = slopes, aes(y = y_hat), size = 1)
```

We see that:

* Females are treated as the baseline for comparison for no other reason than "female" is alphabetically earlier than "male". The $b_{male} = 0.1906$ is the vertical "bump" that men get in their teaching evaluation scores. Or more precisely, it is the average difference in teaching score
that men get *relative to the baseline of women*
* Accordingly, the intercepts are (which in this case make no sense since no instructor can have age 0):
    + for women: $b_0$ = 4.484
    + for men: $b_0 + b_{male}$ = 4.484 + 0.191 = 4.675
* Both men and women have the same slope. In other words, *in this model* the associated effect of age is the same for men and women: all other things being equal, for every increase in 1 in age,
there is on average an associated decrease of $b_{age}$ = -0.0086 in teaching score

**Hold up**: Figure \@ref(fig:numxcatxplot2) is different than Figure \@ref(fig:numxcatxplot1)! What is going on? What we have in the original plot is an *interaction effect* between age and gender!


### Mutiple regression with interaction effects
### Observed/fitted values and residuals {#model4points}

We say a model has an interaction effect if the associated effect of one variable *depends on the value of another variable*.

Let's now compute the *regression table*

```{r, eval=FALSE}
score_model_2 <- lm(score ~ age * gender, data = evals)
get_regression_table(score_model_2)
```
```{r, echo=FALSE}
get_regression_table(score_model_2) %>% 
  knitr::kable(
    digits = 3,
    caption = "Regression table", 
    booktabs = TRUE
  )
```

The model formula is

$$
\begin{align}
\widehat{y} &= b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1x_2\\
\widehat{score} &= b_0 + b_{age} age + b_{male} \mathbb{1}[\mbox{is male}] + b_{age,male}age\mathbb{1}[\mbox{is male}] \\
\end{align}
$$


### Residual analysis {#model4residuals}

<!--
## Two categorical $x$ {#model5}

Let's look at the `biopics` dataset in the
[`fivethirtyeight`](https://rudeboybert.github.io/fivethirtyeight/) package.
After loading the package, run `?biopics` in the console to read the help file.
This data is from the article ["Straight Outta Compton" Is The Rare Biopic Not About White Dudes](https://fivethirtyeight.com/features/straight-outta-compton-is-the-rare-biopic-not-about-white-dudes/).  

First let's load the data and look at a random sample of 5 rows:

```{r}
library(fivethirtyeight)
biopics <- biopics %>%
  select(title, box_office, person_of_color, subject_sex) %>%
  # Remove those that are missing
  filter(!is.na(box_office))
```

```{r, echo=FALSE}
biopics %>%
  sample_n(5) %>%
  knitr::kable(digits = 1)
```


Before we conduct an exploratory data analysis (EDA) of the `biopics` data, let's first
have a discussion $\log$-transformations.

### log-transformations

Let's consider a histogram of the box office revenues for the `biopics` dataset

```{r logtransform1, echo=TRUE, message=FALSE, warning=FALSE, fig.cap="Histogram of box office revenue"}
ggplot(biopics, aes(x = box_office)) +
  geom_histogram(color = "white") +
  labs(x = "Box office revenue")
```

In Figure \@ref(fig:logtransform1), we see there is a right-skew to both the
x-values. This is because there are a few Hollywood blockbusters being compared
with many (likely) smaller-scale independent films. Let's look at the top 5 and
bottom 5 grossing movies in this dataset: 

```{r, echo=FALSE}
biopics %>%
  arrange(desc(box_office)) %>%
  slice(1:5) %>% 
  knitr::kable(
    digits = 3,
    caption = "Top 5 grossing movies in data", 
    booktabs = TRUE
  )
```

```{r, echo=FALSE}
biopics %>%
  arrange(box_office) %>%
  slice(1:5) %>% 
  knitr::kable(
    digits = 3,
    caption = "Bottom 5 grossing movies in data", 
    booktabs = TRUE
  )
```

The scale of box office revenue is completely different! Hence, in Figure \@ref(fig:logtransform1), it's really hard to see what's going on at the
lower-end. 

Let's unskew this variable and compare not *absolute* differences, but rather,
*relative* differences i.e. differences in "order of magnitude" using a
`log10()` transformation:

```{r logtransform2, echo=TRUE, message=FALSE, warning=FALSE, fig.cap="Histogram of log10(box office revenue)"}
ggplot(biopics, aes(x = log10(box_office))) +
  geom_histogram(color = "white") +
  labs(x = "log10(Box office revenue)")
```

We can see a little better what's going on at the lower end of the box office revenue scale.
However the values on the axes require a little thinking to process. For example
at $x=7$, this corresponds to movies with revenue of $10^7 = 10,000,000$
dollars. So instead, let's *rescale* the x-axis so that it displays the data in
their original units.

```{r logtransform3, echo=TRUE, message=FALSE, warning=FALSE, fig.cap="Histogram of box office revenue (log-10 scale)"}
ggplot(biopics, aes(x = box_office)) +
  geom_histogram(color = "white") +
  scale_x_log10() +
  labs(x = "Box office revenue (log10-scale))")
```

Note that

* The two plots are identical, but the values on the x-axis are different.
* In both Figure \@ref(fig:logtransform2) Figure \@ref(fig:logtransform3), equivalent distances on each axes correspond to not equivalent absolute differences, but equivalent relative/multiplicative differences. So for example, the horizontal distance on the plot from Budget = `1e+05` = $10^5$ to Budget  = `1e+06` = $10^6$ is equal to the horizontal distance on the plot from Budget = `1e+06` = $10^6$ to Budget = `1e+07` = $10^7$.


### Exploratory data analysis {#model5EDA}

Let's now consider the box office gross earnings in the US of these movies on a log10
scale:

```{r 2catxplot, echo=TRUE, warning=FALSE, fig.cap="Box office revenue vs biopic subject info"}
ggplot(biopics, aes(x = subject_sex, y = box_office)) +
  facet_wrap(~person_of_color, nrow = 1) +
  scale_y_log10() +
  geom_boxplot() +
  labs(x = "Subject sex", y = "Box office revenue (log10-scale)", title =
  "Person of color?")
```

It seems in this dataset, men of color had the highest median box office gross. Let's
look at a table of means instead of medians.

```{r, eval=FALSE}
biopics %>%
  group_by(person_of_color, subject_sex) %>%
  summarise(mean_box_office = mean(box_office)) %>%
  arrange(desc(mean_box_office))
```
```{r, echo=FALSE}
biopics %>%
  group_by(person_of_color, subject_sex) %>%
  summarise(mean_box_office = mean(box_office)) %>%
  arrange(desc(mean_box_office)) %>% 
  knitr::kable(
    digits = 3,
    caption = "Group means", 
    booktabs = TRUE
  )
```

Keep in mind two things though. First, the sample sizes are different:

```{r, eval=FALSE}
biopics %>%
  group_by(person_of_color, subject_sex) %>%
  summarise(n = n()) %>%
  arrange(desc(n))
```
```{r, echo=FALSE}
biopics %>%
  group_by(person_of_color, subject_sex) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>% 
  knitr::kable(
    digits = 3,
    caption = "Number of movies of each type in dataset", 
    booktabs = TRUE
  )
```

Second, can we *generalize* these results to *all movies*? How was the selection
of movies sampled? Is it a representative sample of all movies, or was their a
systematic reason these were included?


### Multiple regression {#model5table}


Let's now compute the *regression table*. Let's jump straight into considering
a model that incorporates an interaction term as described earlier. 

```{r, eval=FALSE}
box_office_model <- lm(box_office ~ person_of_color * subject_sex, data = biopics)
get_regression_table(box_office_model)
```
```{r, echo=FALSE}
box_office_model <- lm(box_office ~ person_of_color * subject_sex, data = biopics)
get_regression_table(box_office_model) %>% 
  knitr::kable(
    digits = 3,
    caption = "Regression table", 
    booktabs = TRUE
  )
```

The model formula is

$$
\begin{align}
\widehat{y} &= b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1x_2\\
\widehat{score} &= b_0 + b_{color}\mathbb{1}[\mbox{of color}] + b_{male} \mathbb{1}[\mbox{is male}] + b_{color,male}\mathbb{1}[\mbox{of color}]\mathbb{1}[\mbox{is male}] \\
\end{align}
$$


Recreate four group means from the Table 6.15 above:

* Female not of color: 18088799 = 18088799 = $b_0$. Note: *Women not of color are the baseline group*.
* Male not of color: 22871074 = 18088799 + 4782275 = $b_0 + b_{male}$
* Female of color: 20035820 = 18088799 + 1947021 = $b_0 + b_{color}$
* Male of color: 31648028 = 18088799 + 1947021 + 4782275 + 6829933 = $b_0 + b_{color} + b_{male} + b_{color,male}$. Note: $b_{color,male}$ is the *interaction term*.


### Observed/fitted values and residuals {#model5points}
### Residual analysis {#model5residuals}

-->


## Other topics

### Simpson's Paradox {#simpsonsparadox}

Recall in Section \@ref(model3), we saw the two following seemingly contraditory results when studying the relationship between credit card balance, credit limit, and income. On the one hand, the right hand plot of Figure \@ref(fig:2numxplot1) suggested that credit card balance and income were positively related:

```{r echo=FALSE, fig.height=4, fig.cap="Relationship between credit card balance and credit limit/income"}
grid.arrange(model3_balance_vs_limit_plot, model3_balance_vs_income_plot, nrow = 1)
```

On the other hand, the multiple regression in Table \@ref(tab:model3-table-output), suggested that when modeling credit card balance as a function of both credit limit and income at the same time, credit limit has a negative relationship with balance, as evidenced by the slope of -7.66. How can this be?

First, let's dive a little deeper into the explanatory variable `Limit`. Figure \@ref(fig:credit_limit_quartiles) shows a histogram of all `r nrow(Credit)` values of `Limit`, along with vertical red lines that cut up the data into quartiles, meaning: 

1. 25% of credit limits were between \$0 and \$3088. Let's call this the "low" credit limit bracket.
1. 25% of credit limits were between \$3088 and \$4622. Let's call this the "medium-low" credit limit bracket.
1. 25% of credit limits were between \$4622 and \$5873. Let's call this the "medium-high" credit limit bracket.
1. 25% of credit limits were over \$5873. Let's call this the "high" credit limit bracket.

```{r, credit_limit_quartiles, echo=FALSE, fig.cap="Histogram of credit limits and quartiles"}
ggplot(Credit, aes(x = Limit)) +
  geom_histogram(color = "white") +
  geom_vline(xintercept = quantile(Credit$Limit, probs = c(0.25, 0.5, 0.75)), col = "red", linetype = "dashed")
```

Let's now display

1. The scatterplot showing the relationship between credit card balance and limit (the right-hand plot of Figure \@ref(fig:2numxplot1)).
1. The scatterplot showing the relationship between credit card balance and limit now with a color aesthetic added corresponding to the credit limit bracket.

```{r, 2numxplot4, fig.height=4, echo=FALSE, fig.cap="Relationship between credit card balance and income for different credit limit brackets"}
Credit <- Credit %>% 
  mutate(limit_bracket = cut_number(Limit, 4)) %>% 
  mutate(limit_bracket = fct_recode(limit_bracket,
    "low" =  "[855,3.09e+03]",
    "medium-low" = "(3.09e+03,4.62e+03]", 
    "medium-high" = "(4.62e+03,5.87e+03]", 
    "high" = "(5.87e+03,1.39e+04]"
  ))

model3_balance_vs_income_plot_colored <- ggplot(Credit, aes(x = Income, y = Balance, col = limit_bracket)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Income (in $1000)", y = "Credit card balance (in $)", 
       color = "Credit limit\nbracket", title = "Balance vs income") + 
  theme(legend.position = "bottom")
  
grid.arrange(model3_balance_vs_income_plot, model3_balance_vs_income_plot_colored, nrow = 1)
#cowplot::plot_grid(model3_balance_vs_income_plot, model3_balance_vs_income_plot_colored, nrow = 1, rel_widths = c(2/5, 3/5))
```

The left-hand plot focuses of the relationship between balance and income in aggregate, but the right-hand plot focuses on the relationship between balance and income *broken down by credit limit bracket*. Whereas in aggregate there is an overall positive relationship, but when broken down We now see that for the 

1. low
1. medium-low
1. medium-high

income bracket groups, the strong positive relationship between credit card balance and income disappears! Only for the high bracket does the relationship stay somewhat positive. In this example credit limit is a *confounding variable* for credit card balance and income.

<!--
Alternatively, we could also have used facets, where each facet has roughly 25% of people based
on the credit limit bracket. However, IMO the above plot is easier to read.

```{r, 2numxplot5, echo=FALSE, warning=FALSE, fig.cap="Relationship between credit card balance and income for different credit limit brackets"}
ggplot(Credit, aes(x = Income, y = Balance)) +
  geom_point() +
  facet_wrap(~limit_bracket) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Income (in $1000)", y = "Credit card balance (in $)")
```
--> 

### Script of R code

An R script file of all R code used in this chapter is available [here](https://moderndive.netlify.com/scripts/07-multiple-regression.R).