This repository has been archived by the owner on Dec 28, 2023. It is now read-only.
forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 13
/
07-multiple-regression.Rmd
889 lines (673 loc) · 39.8 KB
/
07-multiple-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
# Multiple Regression {#multiple-regression}
```{r, include=FALSE, purl=FALSE}
chap <- 6
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = "\\textwidth",
message = FALSE,
warning = FALSE
)
options(scipen = 99, digits = 3)
# This bit of code is a bug fix on asis blocks, which we use to show/not show LC
# solutions, which are written like markdown text. In theory, it shouldn't be
# necessary for knitr versions <=1.11.6, but I've found I still need to for
# everything to knit properly in asis blocks. More info here:
# https://stackoverflow.com/questions/32944715/conditionally-display-block-of-markdown-text-using-knitr
library(knitr)
knit_engines$set(asis = function(options) {
if (options$echo && options$eval) knit_child(text = options$code)
})
# This controls which LC solutions to show. Options for solutions_shown: "ALL"
# (to show all solutions), or subsets of c('5-1', '5-2','5-3', '5-4'), including
# the null vector c("") to show no solutions.
solutions_shown <- c("")
show_solutions <- function(section){
return(solutions_shown == "ALL" | section %in% solutions_shown)
}
```
Now that we are equipped with data visualization skills from Chapter \@ref(viz), data wrangling skills from Chapter \@ref(wrangling), and an understanding of the "tidy" data format from Chapter \@ref(tidy), we now proceed with data modeling. The fundamental premise of data modeling is *to model the relationship* between:
* An outcome variable $y$, also called a dependent variable and
* An explanatory/predictor variable $x$, also called an independent variable or covariate.
Another way to state this is using mathematical terminology: we will model the outcome variable $y$ *as a function* of the explanatory/predictor variable $x$. Why do we have two different labels, explanatory and predictor, for the variable $x$? That's because roughly speaking data modeling can be used for two purposes:
1. **Modeling for prediction**: You want to predict an outcome variable $y$ based on the information contained in a set of predictor variables. You don't care so much about understanding how all the variables relate and interact, but so long as you can make good predictions about $y$, you're fine. For example, if we know many individuals' risk factors for lung cancer, such as smoking habits and age, can we predict whether or not they will develop lung cancer? Here we wouldn't care so much about distinguishing the degree to which the different risk factors contribute to lung cancer, but instead only on whether or not they could be put together to make reliable predictions.
1. **Modeling for explanation**: You want to explicitly describe the relationship between an outcome variable $y$ and a set of explanatory variables, determine the significance of any found relationships, and have measures summarizing these. Continuing our example from above, we would now be interested in describing the individual effects of the different risk factors and quantifying the magnitude of these effects. One reason could be to design an intervention to reduce lung cancer cases in a population, such as targeting smokers of a specific age group with advertisement for smoking cessation programs. In this book, we'll focus more on this latter purpose.
Data modeling is used in a wide variety of fields, including statistical inference, causal inference, artificial intelligence, and machine learning. There are many techniques for data modeling, such as tree-based models, neural networks/deep learning, and more. However, we'll focus on one particular technique: *linear regression*, one of the most commonly-used and easy to understand approaches to modeling. Recall our discussion in Chapter \@ref(explore-dataframes) on numerical and categorical variables. Linear regression involves:
* An outcome variable $y$ that is *numerical*
* Explanatory variables $\vec{x}$ that are either *numerical* or *categorical*
Whereas there is always only one numerical outcome variable $y$, we have choices on both the number and the type of explanatory variables $\vec{x}$ to use. We're going to the following regression scenarios, where we have
1. A single numerical explanatory variable $x$ in Chapter \ref@(model1). This scenario is known as *simple linear regression*.
1. A single categorical explanatory variable $x$ in Chapter \ref@(model2).
1. Two numerical explanatory variables $x_1$ and $x_2$ in Chapter \ref@(model3). This is the first of two *multiple regression* scenarios we'll cover where there are now more than one explanatory variable.
1. One numerical and one categorical explanatory variable in Chapter \ref@(model3). This is the second *multiple regression* scenario we'll cover. We'll also introduce *interaction models* here.
We'll study all 4 of these regression scenarios using real data, all easily accessible via R packages! <!--We will also discuss the concept of *correlation* and how it is frequently incorrectly used to imply *causation*.-->
### Needed packages {-}
In this chapter we introduce a new package, `moderndive`, that is an accompaniment package to the ModernDive book that includes useful functions for linear regression and other functions and data used in later in the book. However because this is a relatively new package, we're have to install this package in a different way than was covered in Section \@ref(packages).
First, install the `remotes` package like you would normally install a package, then run the following line to install the `moderndive` package:
```{r, message=FALSE, warning=FALSE}
library(remotes)
# remotes::install_github("moderndive/moderndive")
```
Let's now load all the packages needed for this chapter. If needed, read Section \@ref(packages) for information on how to install and load R packages.
```{r, message=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
library(moderndive)
```
```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
library(mvtnorm)
library(tidyr)
library(forcats)
library(gridExtra)
```
## Two numerical x {#model3}
Much like our teacher evaluation data in Section \@ref(model1), let's now attempt to identify factors that are associated with how much credit card debt an individual will hold. The textbook [An Introduction to Statistical Learning with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is an intermediate-level textbook on statistical and machine learning freely available [here](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf). It has an accompanying R packaged called `ISLR` with datasets that the authors use to demonstrate various machine learning methods.
One dataset that is frequently used by the authors is the `Credit` dataset where predictions are made on the balance held by various credit card holders based on information like income, credit limit, and education level. Let's consider the following variables:
* Outcome variable $y$: Credit card balance in dollars
* Explanatory variables $x$:
+ income in $1000's
+ credit card limit in dollars
+ credit rating
+ card holder's age
+ years of education
Let's load the data and `select()` only the 6 variables listed above:
```{r, warning=FALSE, message=FALSE}
library(ISLR)
Credit <- Credit %>%
select(Balance, Limit, Income, Rating, Age, Education)
```
Unfortunately, no information on how this data was sampled was provided, so we're not able to generalize any such results to a greater population, but we will still use this data to demonstrate *multiple regression*. Multiple regression is a form of linear regression where there are now more than one explanatory variables and thus the interpretation of the associated effect of any one explanatory variable must be made in conjunction with the other explanatory variable. We'll perform multiple regression with:
1. A numerical outcome variable $y$, in this case teaching credit card `Balance`
1. Two explanatory variables:
1. A first numerical explanatory variable $x_1$, in this case credit `Limit`
1. A second numerical explanatory variable $x_2$, in this case `Income`
### Exploratory data analysis {#model3EDA}
Let's look at the raw data values both by bringing up RStudio's spreadsheet viewer, although in Table \@ref(tab:model3-data-preview) we only show 5 randomly selected credit card holders out of `r nrow(Credit)`, and using the `glimpse()` function:
```{r, eval=FALSE}
View(Credit)
```
```{r model3-data-preview, echo=FALSE}
Credit %>%
sample_n(5) %>%
knitr::kable(
digits = 3,
caption = "Random sample of 5 credit card holders",
booktabs = TRUE
)
```
```{r}
glimpse(Credit)
```
Let's look at some summary statistics:
```{r}
summary(Credit)
```
We observe for example
1. The typical credit card balance is around $500
1. The typical credit card limit is around $5000
1. The typical income is around $40,000.
Since our outcome variable `Balance` and the explanatory variables
`Limit` and `Rating` are numerical, we can compute the correlation coefficient
between pairs of these variables. However, instead running the `cor()` command
twice, once for each explanatory variable:
```{r, eval=FALSE}
cor(Credit$Balance, Credit$Limit)
cor(Credit$Balance, Credit$Income)
```
We can simultaneously compute them by returning a *correlation matrix* in
Table \@ref(tab:model3-correlation). We read off the correlation coefficient
for any pair of variables by looking them up in the appropriate row/column combination.
```{r, eval=FALSE}
Credit %>%
select(Balance, Limit, Income) %>%
cor()
```
```{r model3-correlation, echo=FALSE}
Credit %>%
select(Balance, Limit, Income) %>%
cor() %>%
knitr::kable(
digits = 3,
caption = "Correlations between credit card balance, credit limit, and credit rating",
booktabs = TRUE
)
```
For example, the correlation coefficients of:
1. `Balance` with itself is 1
1. `Balance` with `Limit` is 0.862, indicating a strong positive linear relationship, which makes sense as only individuals with large credit limits and accrue large credit card balances.
1. `Balance` with `Income` is 0.464, suggestive of another positive linear relationship, although not as strong as the relationship between `Balance` and `Limit`.
1. As an added bonus, we can read off the correlation coefficient of the two explanatory variables, `Limit` and `Income` of 0.792. In this case, we say there is a high degree of *collinearity* between these two explanatory variables.
Let's visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots
```{r, eval=FALSE}
ggplot(Credit, aes(x = Limit, y = Balance)) +
geom_point() +
labs(x = "Credit limit (in $)", y = "Credit card balance (in $)",
title = "Relationship between balance and credit limit") +
geom_smooth(method = "lm", se = FALSE)
ggplot(Credit, aes(x = Income, y = Balance)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Credit card balance (in $)",
title = "Relationship between balance and income") +
geom_smooth(method = "lm", se = FALSE)
```
```{r 2numxplot1, echo=FALSE, fig.height=4, fig.cap="Relationship between credit card balance and credit limit/income"}
model3_balance_vs_limit_plot <- ggplot(Credit, aes(x = Limit, y = Balance)) +
geom_point() +
labs(x = "Credit limit (in $)", y = "Credit card balance (in $)",
title = "Balance vs credit limit") +
geom_smooth(method = "lm", se = FALSE)
model3_balance_vs_income_plot <- ggplot(Credit, aes(x = Income, y = Balance)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Credit card balance (in $)",
title = "Balance vs income") +
geom_smooth(method = "lm", se = FALSE)
grid.arrange(model3_balance_vs_limit_plot, model3_balance_vs_income_plot, nrow = 1)
```
In both cases, there seems to be a positive relationship. However the two plots in Figure \@ref(fig:2numxplot1) only focus on the relationship of the outcome variable with each of the explanatory variables independently. To get a sense of the *joint* relationship of all three variables simultaneously through a visualization, let's display the data in a 3-dimensional (3D) scatterplot, where
1. The numerical outcome variable $y$ `Balance` is on the z-axis (vertical one)
1. The two numerical explanatory variables form the x and y axes (on the floor). In this case
1. The first numerical explanatory variable $x_1$ `Income` is on the x-axis
1. The second numerical explanatory variable $x_2$ `Limit` is on the y-axis
Click on the following image to open an interactive 3D scatterplot in your browser:
<center><a target="_blank" href="http://rpubs.com/moderndive/credit_card_balance_3D_scatterplot"><img src="images/credit_card_balance_3D_scatterplot.png" title="3D scatterplot" width="600"/></a></center>
```{r, eval=FALSE, echo=FALSE}
# Save as 798 x 562 images/credit_card_balance_3D_scatterplot.png
library(ISLR)
library(plotly)
plot_ly(showscale=FALSE) %>%
add_markers(
x = Credit$Income,
y = Credit$Limit,
z = Credit$Balance,
hoverinfo = 'text',
text = ~paste("x1 - Income: ", Credit$Income,
"</br> x2 - Limit: ", Credit$Limit,
"</br> y - Balance: ", Credit$Balance)
) %>%
layout(
scene = list(
xaxis = list(title = "x1 - Income (in $10K)"),
yaxis = list(title = "x2 - Limit ($)"),
zaxis = list(title = "y - Balance ($)")
)
)
```
Previously in Figure \@ref(fig:numxplot4), we plotted a "best-fitting" regression line through a set of points where the numerical outcome variable $y$ was teaching `score` and a single numerical explanatory variable $x$ `bty_avg`. What is the analogous concept when we have *two* numerical predictor variables? Instead of a best-fitting line, we now have a best-fitting *plane*, which is 3D generalization of lines which exist in 2D. Click on the following image to open an interactive plot of the regression plane in your browser.
<center><a target="_blank" href="https://beta.rstudioconnect.com/connect/#/apps/3214/"><img src="images/credit_card_balance_regression_plane.png" title="Regression plane" width="600"/></a></center>
```{r, eval=FALSE, echo=FALSE}
# Save as 798 x 562 images/credit_card_balance_regression_plane.png
library(ISLR)
library(plotly)
library(tidyverse)
# setup hideous grid required by plotly
model_lm <- lm(Balance ~ Income + Limit, data=Credit)
x_grid <- seq(from=min(Credit$Income), to=max(Credit$Income), length=100)
y_grid <- seq(from=min(Credit$Limit), to=max(Credit$Limit), length=200)
z_grid <- expand.grid(x_grid, y_grid) %>%
tbl_df() %>%
rename(
x_grid = Var1,
y_grid = Var2
) %>%
mutate(z = coef(model_lm)[1] + coef(model_lm)[2]*x_grid + coef(model_lm)[3]*y_grid) %>%
.[["z"]] %>%
matrix(nrow=length(x_grid)) %>%
t()
# plot points and plane
plot_ly(showscale=FALSE) %>%
add_markers(
x = Credit$Income,
y = Credit$Limit,
z = Credit$Balance,
hoverinfo = 'text',
text = ~paste("x1 - Income: ", Credit$Income, "</br> x2 - Limit: ",
Credit$Limit, "</br> y - Balance: ", Credit$Balance)
) %>%
layout(
scene = list(
xaxis = list(title = "x1 - Income (in $10K)"),
yaxis = list(title = "x2 - Limit ($)"),
zaxis = list(title = "y - Balance ($)")
)
) %>%
add_surface(
x = x_grid,
y = y_grid,
z = z_grid
)
```
### Multiple regression {#model3table}
Just as we did when we had a single numerical explanatory variable $x$ in Section \@ref(model1table) and when we had a single categorical explanatory variable $x$ in Section \@ref(model2table), we fit a regression model and obtain the regression table in our two numerical explanatory variable scenario. To fit a regression model and obtain a table using `get_regression_table()`, we now use a `+` to consider multiple explanatory variables. In this case since we want to preform a regression of `Limit` and `Income` simultaneously, we input `Balance ~ Limit + Income`
```{r, eval=FALSE}
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
get_regression_table(Balance_model)
```
```{r, echo=FALSE}
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
Credit_line <- get_regression_table(Balance_model) %>%
pull(estimate)
```
```{r model3-table-output, echo=FALSE}
get_regression_table(Balance_model) %>%
knitr::kable(
digits = 3,
caption = "Multiple regression table",
booktabs = TRUE
)
```
How do we interpret these three values that define the regression plane?
* Intercept: -$385.18. The intercept in our case represents the credit card balance for an individual who has both a credit `Limit` of $0 and `Income` of $0. In our data however, the intercept has limited practical interpretation as no individuals had `Limit` or `Income` values of $0 and furthermore the smallest credit card balance was $0. Rather, it is used to situate the regression plane in 3D space.
* Limit: $0.26. Now that we have multiple variables to consider, we have to add
a caveat to our interpretation: *all other things being equal, for every increase of one unit in credit `Limit` (dollars), there is an associated increase of on average $0.26 in credit card balance*. Note:
+ Just as we did in Section \@ref(model1table), we are not making any causal statements, only statements relating to the association between credit limit and balance
+ The *all other things being equal* is making a statement about all other explanatory variables, in this case only one: `Income`. This is equivalent to saying "holding `Income` constant, we observed an associated increase of $0.26 in credit card balance for every dollar increase in credit limit"
* Income: -$7.66. Similarly, *all other things being equal, for every increase of one unit in `Income` (in other words, $1000 in income), there is an associated decrease of on average $7.66 in credit card balance*.
However, recall in Figure \@ref(fig:2numxplot1) that when considered separately, both `Limit` and `Income` had positive relationships with the outcome variable `Balance`: as card holders' credit limits increased their credit card balances tended to increase as well, and a similar relationship held for incomes and balances. In the above multiple regression, however, the slope for `Income` is now -7.66, suggesting a *negative relationship* between income and credit card balance. What explains these contradictory results? This is a phenomenon known as Simpson's Paradox, a phenomenon in which a trend appears in several different groups of data but disappears or reverses when these groups are combined; we expand on this in Section \@ref(simpsonsparadox).
### Observed/fitted values and residuals {#model3points}
As we did previously, in Table \@ref(tab:model3-points-table) let's unpack the output of the `get_regression_points()` function for our model for credit card balance
* for all `r nrow(Credit)` card holders in the dataset, in other words
* for all `r nrow(Credit)` rows in the `Credit` data frame, in other words
* for all `r nrow(Credit)` points
```{r, eval=FALSE}
regression_points <- get_regression_points(Balance_model)
regression_points
```
```{r model3-points-table, echo=FALSE}
set.seed(76)
regression_points <- get_regression_points(Balance_model)
regression_points %>%
slice(1:5) %>%
knitr::kable(
digits = 3,
caption = "Regression points (first 5 rows of 400)",
booktabs = TRUE
)
```
Recall the format of the output:
* `balance` corresponds to $y$ the observed value
* `balance_hat` corresponds to $\widehat{y}$ the fitted value
* `reisdual` corresponds to the residual $y - \widehat{y}$
As we did in Section \@ref(model1points) let's prove to ourselves that the results make sense by manually replicate the internal computation that `get_regression_points()` performs using the `mutate()` verb from Chapter \@ref(wrangling)
1. First we compute a duplicate of the fitted values $\widehat{y} = b_0 + b_1 \times x_1 + b_2 \times x_2$ where the intercept $b_0$ = `r Credit_line[1]`, the slope $b_1$ for `Limit` of `r Credit_line[2]`, and the slope $b_2$ for `Income` of `r Credit_line[3]` and store these in `balance_hat_2`
1. Then we compute the a duplicated the residuals $y - \widehat{y}$ and store these in `residual_2`
```{r, eval=FALSE}
regression_points <- regression_points %>%
mutate(Balance_hat_2 = -385.179 + 0.264 * Limit - 7.663 * Income) %>%
mutate(residual_2 = Balance - Balance_hat_2)
regression_points
```
```{r, echo=FALSE}
set.seed(76)
regression_points <- regression_points %>%
mutate(Balance_hat_2 = -385.179 + 0.264 * Limit - 7.663 * Income) %>%
mutate(residual_2 = Balance - Balance_hat_2)
regression_points %>%
slice(1:5) %>%
knitr::kable(
digits = 3,
caption = "Regression points (first five rows)",
booktabs = TRUE
)
```
We see that our manually computed variables `balance_hat_2` and `residual_2` equal the original results (up to rounding error).
### Residual analysis {#model3residuals}
Recall in Section \@ref(model1residuals), our first residual analysis plot investigated the presense of any systematic pattern in the residuals when we had a single numerical predictor: `bty_age`. For the `Credit` card dataset, since we have two numerical predictors, `Limit` and `Income`, we must perform this twice:
```{r, eval=FALSE}
ggplot(regression_points, aes(x = Limit, y = residual)) +
geom_point() +
labs(x = "Credit limit (in $)", y = "Residual", title = "Residuals vs credit limit")
ggplot(regression_points, aes(x = Income, y = residual)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Residual", title = "Residuals vs income")
```
```{r, echo=FALSE, fig.height=4, fig.cap="Residuals vs credit limit and income"}
model3_residual_vs_limit_plot <- ggplot(regression_points, aes(x = Limit, y = residual)) +
geom_point() +
labs(x = "Credit limit (in $)", y = "Residual",
title = "Residuals vs credit limit")
model3_residual_vs_income_plot <- ggplot(regression_points, aes(x = Income, y = residual)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Residual",
title = "Residuals vs income")
grid.arrange(model3_residual_vs_limit_plot, model3_residual_vs_income_plot, nrow = 1)
```
In this case, there does appear to be a systematic pattern to the residuals, as the scatter of the residuals around the line $y=0$ is definitely not consistent. This behavior of the residuals is further evidenced by the histogram of residuals in Figure \@ref(fig:model3_residuals_hist). We observe that the residuals have a slight right-skew (a long tail towards the right). Ideally, these residuals should be bell-shaped around a residual value of 0.
```{r model3_residuals_hist, warning=FALSE, fig.cap="Plot of residuals over continent"}
ggplot(regression_points, aes(x = residual)) +
geom_histogram(color = "white") +
labs(x = "Residual")
```
Another way to intepret this histogram is that since the residual is computed as $y - \widehat{y}$ = `balance` - `balance_hat`, we have some values where the fitted value $\widehat{y}$ is very much lower than the observed value $y$. In other words, we are underfitting certain credit card holder's balance by a very large amount.
### Your turn
Repeat this same analysis but using `Age` and `Rating` as the two numerical explanatory variables. Recall the four steps we've been following:
1. Perform an exploratory data analysis (EDA)
1. Fit a linear regression model and get information about the model
1. Get information on all $n$ data points considered in this analysis.
1. Perform a residual analysis and look for any systematic patterns in the residuals.
## One numerical & one categorical x {#model4}
Let's revisit the instructor evaluation data introduced in Section \@ref(model1), where we studied the relationship between
1. the numerical outcome variable $y$: instructor evaluation score `score`
1. the numerical explanatory variable $x$: beauty score `bty_avg`
This analysis suggested that there is a positive relationship between `bty_avg` and `score`, in other words as instructors had higher beauty scores, they also tended to have higher teaching evaluation scores. Now let's say instead of `bty_avg` we are interested in the numerical explanatory variable $x_1$ `age` and furthermore we want to use a second $x_2$, the (binary) categorical variable `gender`. Our modeling scenario now becomes
1. the numerical outcome variable $y$: instructor evaluation score `score`
1. Two explanatory variables:
1. A first numerical explanatory variable $x_1$: `age`
1. A second numerical explanatory variable $x_2$, in this case `gender`
### Exploratory data analysis {#model4EDA}
```{r}
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
select(score, bty_avg, age, gender)
```
We've already performed some elements of our exploratory data analysis (EDA) in Section \@ref(model4EDA), in particular of the outcome variable `score`, so let's now only consider our two new explanatory variables
```{r}
evals %>%
select(age, gender) %>%
summary()
```
```{r}
cor(evals$score, evals$age)
```
In Figure \@ref(numxcatxplot1), we plot a scatterplot of `score` over `age`, but given that `gender` is a binary categorical variable
1. We can assign a color to points from each of the two levels of `gender`: female and male
1. Furthermore, the `geom_smooth(method = "lm", se = FALSE)` layer automatically fits a different regression line for each
```{r numxcatxplot1, warning=FALSE, fig.cap="Instructor evaluation scores at UT Austin split by gender: Jittered"}
ggplot(evals, aes(x = age, y = score, col = gender)) +
geom_jitter() +
labs(x = "Age", y = "Teaching Score", color = "Gender") +
geom_smooth(method = "lm", se = FALSE)
```
We notice some interesting trends:
1. There are almost no women faculty over the age of 60.
1. Fitting separate regression lines for men and women, we see they have different slopes.
We see that the associated effect of increasing age seems to be much harsher for women than men. In other words, they have different slopes.
### Multiple regression {#model4table}
Let's now compute the *regression table*
```{r, eval=FALSE}
score_model_2 <- lm(score ~ age + gender, data = evals)
get_regression_table(score_model_2)
```
```{r, echo=FALSE}
score_model_2 <- lm(score ~ age + gender, data = evals)
get_regression_table(score_model_2) %>%
knitr::kable(
digits = 3,
caption = "Regression table",
booktabs = TRUE
)
```
The modeling equation for this is:
$$
\begin{align}
\widehat{y} &= b_0 + b_1 x_1 + b_2 x_2 \\
\widehat{score} &= b_0 + b_{age} age + b_{male} \mathbb{1}[\mbox{is male}] \\
\end{align}
$$
What this looks like is in Figure \@ref(fig:numxcatxplot2) below.
```{r numxcatxplot2, echo=FALSE, warning=FALSE, fig.cap="Instructor evaluation scores at UT Austin by gender: same slope"}
coeff <- lm(score ~ age + gender, data = evals) %>%
coef() %>%
as.numeric()
slopes <- evals %>%
group_by(gender) %>%
summarise(min = min(age), max = max(age)) %>%
mutate(intercept = coeff[1]) %>%
mutate(intercept = ifelse(gender == "male", intercept + coeff[3], intercept)) %>%
gather(point, age, -c(gender, intercept)) %>%
mutate(y_hat = intercept + age * coeff[2])
ggplot(evals, aes(x = age, y = score, col = gender)) +
geom_jitter() +
labs(x = "Age", y = "Teaching Score", color = "Gender") +
geom_line(data = slopes, aes(y = y_hat), size = 1)
```
We see that:
* Females are treated as the baseline for comparison for no other reason than "female" is alphabetically earlier than "male". The $b_{male} = 0.1906$ is the vertical "bump" that men get in their teaching evaluation scores. Or more precisely, it is the average difference in teaching score
that men get *relative to the baseline of women*
* Accordingly, the intercepts are (which in this case make no sense since no instructor can have age 0):
+ for women: $b_0$ = 4.484
+ for men: $b_0 + b_{male}$ = 4.484 + 0.191 = 4.675
* Both men and women have the same slope. In other words, *in this model* the associated effect of age is the same for men and women: all other things being equal, for every increase in 1 in age,
there is on average an associated decrease of $b_{age}$ = -0.0086 in teaching score
**Hold up**: Figure \@ref(fig:numxcatxplot2) is different than Figure \@ref(fig:numxcatxplot1)! What is going on? What we have in the original plot is an *interaction effect* between age and gender!
### Mutiple regression with interaction effects
### Observed/fitted values and residuals {#model4points}
We say a model has an interaction effect if the associated effect of one variable *depends on the value of another variable*.
Let's now compute the *regression table*
```{r, eval=FALSE}
score_model_2 <- lm(score ~ age * gender, data = evals)
get_regression_table(score_model_2)
```
```{r, echo=FALSE}
get_regression_table(score_model_2) %>%
knitr::kable(
digits = 3,
caption = "Regression table",
booktabs = TRUE
)
```
The model formula is
$$
\begin{align}
\widehat{y} &= b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1x_2\\
\widehat{score} &= b_0 + b_{age} age + b_{male} \mathbb{1}[\mbox{is male}] + b_{age,male}age\mathbb{1}[\mbox{is male}] \\
\end{align}
$$
### Residual analysis {#model4residuals}
<!--
## Two categorical $x$ {#model5}
Let's look at the `biopics` dataset in the
[`fivethirtyeight`](https://rudeboybert.github.io/fivethirtyeight/) package.
After loading the package, run `?biopics` in the console to read the help file.
This data is from the article ["Straight Outta Compton" Is The Rare Biopic Not About White Dudes](https://fivethirtyeight.com/features/straight-outta-compton-is-the-rare-biopic-not-about-white-dudes/).
First let's load the data and look at a random sample of 5 rows:
```{r}
library(fivethirtyeight)
biopics <- biopics %>%
select(title, box_office, person_of_color, subject_sex) %>%
# Remove those that are missing
filter(!is.na(box_office))
```
```{r, echo=FALSE}
biopics %>%
sample_n(5) %>%
knitr::kable(digits = 1)
```
Before we conduct an exploratory data analysis (EDA) of the `biopics` data, let's first
have a discussion $\log$-transformations.
### log-transformations
Let's consider a histogram of the box office revenues for the `biopics` dataset
```{r logtransform1, echo=TRUE, message=FALSE, warning=FALSE, fig.cap="Histogram of box office revenue"}
ggplot(biopics, aes(x = box_office)) +
geom_histogram(color = "white") +
labs(x = "Box office revenue")
```
In Figure \@ref(fig:logtransform1), we see there is a right-skew to both the
x-values. This is because there are a few Hollywood blockbusters being compared
with many (likely) smaller-scale independent films. Let's look at the top 5 and
bottom 5 grossing movies in this dataset:
```{r, echo=FALSE}
biopics %>%
arrange(desc(box_office)) %>%
slice(1:5) %>%
knitr::kable(
digits = 3,
caption = "Top 5 grossing movies in data",
booktabs = TRUE
)
```
```{r, echo=FALSE}
biopics %>%
arrange(box_office) %>%
slice(1:5) %>%
knitr::kable(
digits = 3,
caption = "Bottom 5 grossing movies in data",
booktabs = TRUE
)
```
The scale of box office revenue is completely different! Hence, in Figure \@ref(fig:logtransform1), it's really hard to see what's going on at the
lower-end.
Let's unskew this variable and compare not *absolute* differences, but rather,
*relative* differences i.e. differences in "order of magnitude" using a
`log10()` transformation:
```{r logtransform2, echo=TRUE, message=FALSE, warning=FALSE, fig.cap="Histogram of log10(box office revenue)"}
ggplot(biopics, aes(x = log10(box_office))) +
geom_histogram(color = "white") +
labs(x = "log10(Box office revenue)")
```
We can see a little better what's going on at the lower end of the box office revenue scale.
However the values on the axes require a little thinking to process. For example
at $x=7$, this corresponds to movies with revenue of $10^7 = 10,000,000$
dollars. So instead, let's *rescale* the x-axis so that it displays the data in
their original units.
```{r logtransform3, echo=TRUE, message=FALSE, warning=FALSE, fig.cap="Histogram of box office revenue (log-10 scale)"}
ggplot(biopics, aes(x = box_office)) +
geom_histogram(color = "white") +
scale_x_log10() +
labs(x = "Box office revenue (log10-scale))")
```
Note that
* The two plots are identical, but the values on the x-axis are different.
* In both Figure \@ref(fig:logtransform2) Figure \@ref(fig:logtransform3), equivalent distances on each axes correspond to not equivalent absolute differences, but equivalent relative/multiplicative differences. So for example, the horizontal distance on the plot from Budget = `1e+05` = $10^5$ to Budget = `1e+06` = $10^6$ is equal to the horizontal distance on the plot from Budget = `1e+06` = $10^6$ to Budget = `1e+07` = $10^7$.
### Exploratory data analysis {#model5EDA}
Let's now consider the box office gross earnings in the US of these movies on a log10
scale:
```{r 2catxplot, echo=TRUE, warning=FALSE, fig.cap="Box office revenue vs biopic subject info"}
ggplot(biopics, aes(x = subject_sex, y = box_office)) +
facet_wrap(~person_of_color, nrow = 1) +
scale_y_log10() +
geom_boxplot() +
labs(x = "Subject sex", y = "Box office revenue (log10-scale)", title =
"Person of color?")
```
It seems in this dataset, men of color had the highest median box office gross. Let's
look at a table of means instead of medians.
```{r, eval=FALSE}
biopics %>%
group_by(person_of_color, subject_sex) %>%
summarise(mean_box_office = mean(box_office)) %>%
arrange(desc(mean_box_office))
```
```{r, echo=FALSE}
biopics %>%
group_by(person_of_color, subject_sex) %>%
summarise(mean_box_office = mean(box_office)) %>%
arrange(desc(mean_box_office)) %>%
knitr::kable(
digits = 3,
caption = "Group means",
booktabs = TRUE
)
```
Keep in mind two things though. First, the sample sizes are different:
```{r, eval=FALSE}
biopics %>%
group_by(person_of_color, subject_sex) %>%
summarise(n = n()) %>%
arrange(desc(n))
```
```{r, echo=FALSE}
biopics %>%
group_by(person_of_color, subject_sex) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
knitr::kable(
digits = 3,
caption = "Number of movies of each type in dataset",
booktabs = TRUE
)
```
Second, can we *generalize* these results to *all movies*? How was the selection
of movies sampled? Is it a representative sample of all movies, or was their a
systematic reason these were included?
### Multiple regression {#model5table}
Let's now compute the *regression table*. Let's jump straight into considering
a model that incorporates an interaction term as described earlier.
```{r, eval=FALSE}
box_office_model <- lm(box_office ~ person_of_color * subject_sex, data = biopics)
get_regression_table(box_office_model)
```
```{r, echo=FALSE}
box_office_model <- lm(box_office ~ person_of_color * subject_sex, data = biopics)
get_regression_table(box_office_model) %>%
knitr::kable(
digits = 3,
caption = "Regression table",
booktabs = TRUE
)
```
The model formula is
$$
\begin{align}
\widehat{y} &= b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1x_2\\
\widehat{score} &= b_0 + b_{color}\mathbb{1}[\mbox{of color}] + b_{male} \mathbb{1}[\mbox{is male}] + b_{color,male}\mathbb{1}[\mbox{of color}]\mathbb{1}[\mbox{is male}] \\
\end{align}
$$
Recreate four group means from the Table 6.15 above:
* Female not of color: 18088799 = 18088799 = $b_0$. Note: *Women not of color are the baseline group*.
* Male not of color: 22871074 = 18088799 + 4782275 = $b_0 + b_{male}$
* Female of color: 20035820 = 18088799 + 1947021 = $b_0 + b_{color}$
* Male of color: 31648028 = 18088799 + 1947021 + 4782275 + 6829933 = $b_0 + b_{color} + b_{male} + b_{color,male}$. Note: $b_{color,male}$ is the *interaction term*.
### Observed/fitted values and residuals {#model5points}
### Residual analysis {#model5residuals}
-->
## Other topics
### Simpson's Paradox {#simpsonsparadox}
Recall in Section \@ref(model3), we saw the two following seemingly contraditory results when studying the relationship between credit card balance, credit limit, and income. On the one hand, the right hand plot of Figure \@ref(fig:2numxplot1) suggested that credit card balance and income were positively related:
```{r echo=FALSE, fig.height=4, fig.cap="Relationship between credit card balance and credit limit/income"}
grid.arrange(model3_balance_vs_limit_plot, model3_balance_vs_income_plot, nrow = 1)
```
On the other hand, the multiple regression in Table \@ref(tab:model3-table-output), suggested that when modeling credit card balance as a function of both credit limit and income at the same time, credit limit has a negative relationship with balance, as evidenced by the slope of -7.66. How can this be?
First, let's dive a little deeper into the explanatory variable `Limit`. Figure \@ref(fig:credit_limit_quartiles) shows a histogram of all `r nrow(Credit)` values of `Limit`, along with vertical red lines that cut up the data into quartiles, meaning:
1. 25% of credit limits were between \$0 and \$3088. Let's call this the "low" credit limit bracket.
1. 25% of credit limits were between \$3088 and \$4622. Let's call this the "medium-low" credit limit bracket.
1. 25% of credit limits were between \$4622 and \$5873. Let's call this the "medium-high" credit limit bracket.
1. 25% of credit limits were over \$5873. Let's call this the "high" credit limit bracket.
```{r, credit_limit_quartiles, echo=FALSE, fig.cap="Histogram of credit limits and quartiles"}
ggplot(Credit, aes(x = Limit)) +
geom_histogram(color = "white") +
geom_vline(xintercept = quantile(Credit$Limit, probs = c(0.25, 0.5, 0.75)), col = "red", linetype = "dashed")
```
Let's now display
1. The scatterplot showing the relationship between credit card balance and limit (the right-hand plot of Figure \@ref(fig:2numxplot1)).
1. The scatterplot showing the relationship between credit card balance and limit now with a color aesthetic added corresponding to the credit limit bracket.
```{r, 2numxplot4, fig.height=4, echo=FALSE, fig.cap="Relationship between credit card balance and income for different credit limit brackets"}
Credit <- Credit %>%
mutate(limit_bracket = cut_number(Limit, 4)) %>%
mutate(limit_bracket = fct_recode(limit_bracket,
"low" = "[855,3.09e+03]",
"medium-low" = "(3.09e+03,4.62e+03]",
"medium-high" = "(4.62e+03,5.87e+03]",
"high" = "(5.87e+03,1.39e+04]"
))
model3_balance_vs_income_plot_colored <- ggplot(Credit, aes(x = Income, y = Balance, col = limit_bracket)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Income (in $1000)", y = "Credit card balance (in $)",
color = "Credit limit\nbracket", title = "Balance vs income") +
theme(legend.position = "bottom")
grid.arrange(model3_balance_vs_income_plot, model3_balance_vs_income_plot_colored, nrow = 1)
#cowplot::plot_grid(model3_balance_vs_income_plot, model3_balance_vs_income_plot_colored, nrow = 1, rel_widths = c(2/5, 3/5))
```
The left-hand plot focuses of the relationship between balance and income in aggregate, but the right-hand plot focuses on the relationship between balance and income *broken down by credit limit bracket*. Whereas in aggregate there is an overall positive relationship, but when broken down We now see that for the
1. low
1. medium-low
1. medium-high
income bracket groups, the strong positive relationship between credit card balance and income disappears! Only for the high bracket does the relationship stay somewhat positive. In this example credit limit is a *confounding variable* for credit card balance and income.
<!--
Alternatively, we could also have used facets, where each facet has roughly 25% of people based
on the credit limit bracket. However, IMO the above plot is easier to read.
```{r, 2numxplot5, echo=FALSE, warning=FALSE, fig.cap="Relationship between credit card balance and income for different credit limit brackets"}
ggplot(Credit, aes(x = Income, y = Balance)) +
geom_point() +
facet_wrap(~limit_bracket) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Income (in $1000)", y = "Credit card balance (in $)")
```
-->
### Script of R code
An R script file of all R code used in this chapter is available [here](https://moderndive.netlify.com/scripts/07-multiple-regression.R).