This repository has been archived by the owner on Dec 28, 2023. It is now read-only.
forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 13
/
06-regression.Rmd
executable file
·1050 lines (790 loc) · 60.8 KB
/
06-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# (PART) Data Modeling via moderndive {-}
# Basic Regression {#regression}
```{r, include=FALSE, purl=FALSE}
chap <- 6
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = "\\textwidth",
message = FALSE,
warning = FALSE
)
options(scipen = 99, digits = 3)
# This bit of code is a bug fix on asis blocks, which we use to show/not show LC
# solutions, which are written like markdown text. In theory, it shouldn't be
# necessary for knitr versions <=1.11.6, but I've found I still need to for
# everything to knit properly in asis blocks. More info here:
# https://stackoverflow.com/questions/32944715/conditionally-display-block-of-markdown-text-using-knitr
library(knitr)
knit_engines$set(asis = function(options) {
if (options$echo && options$eval) knit_child(text = options$code)
})
# This controls which LC solutions to show. Options for solutions_shown: "ALL"
# (to show all solutions), or subsets of c('5-1', '5-2','5-3', '5-4'), including
# the null vector c("") to show no solutions.
solutions_shown <- c("")
show_solutions <- function(section){
return(solutions_shown == "ALL" | section %in% solutions_shown)
}
```
Now that we are equipped with data visualization skills from Chapter \@ref(viz), data wrangling skills from Chapter \@ref(wrangling), and an understanding of the "tidy" data format from Chapter \@ref(tidy), we now proceed with data modeling. The fundamental premise of data modeling is *to model the relationship* between:
* An outcome variable $y$, also called a dependent variable and
* An explanatory/predictor variable $x$, also called an independent variable or covariate.
Another way to state this is using mathematical terminology: we will model the outcome variable $y$ *as a function* of the explanatory/predictor variable $x$. Why do we have two different labels, explanatory and predictor, for the variable $x$? That's because roughly speaking data modeling can be used for two purposes:
1. **Modeling for prediction**: You want to predict an outcome variable $y$ based on the information contained in a set of predictor variables. You don't care so much about understanding how all the variables relate and interact, but so long as you can make good predictions about $y$, you're fine. For example, if we know many individuals' risk factors for lung cancer, such as smoking habits and age, can we predict whether or not they will develop lung cancer? Here we wouldn't care so much about distinguishing the degree to which the different risk factors contribute to lung cancer, but instead only on whether or not they could be put together to make reliable predictions.
1. **Modeling for explanation**: You want to explicitly describe the relationship between an outcome variable $y$ and a set of explanatory variables, determine the significance of any found relationships, and have measures summarizing these. Continuing our example from above, we would now be interested in describing the individual effects of the different risk factors and quantifying the magnitude of these effects. One reason could be to design an intervention to reduce lung cancer cases in a population, such as targeting smokers of a specific age group with advertisement for smoking cessation programs. In this book, we'll focus more on this latter purpose.
Data modeling is used in a wide variety of fields, including statistical inference, causal inference, artificial intelligence, and machine learning. There are many techniques for data modeling, such as tree-based models, neural networks/deep learning, and more. However, we'll focus on one particular technique: *linear regression*, one of the most commonly-used and easy to understand approaches to modeling. Recall our discussion in Chapter \@ref(explore-dataframes) on numerical and categorical variables. Linear regression involves:
* An outcome variable $y$ that is *numerical*
* Explanatory variables $\vec{x}$ that are either *numerical* or *categorical*
Whereas there is always only one numerical outcome variable $y$, we have choices on both the number and the type of explanatory variables $\vec{x}$ to use. We're going to cover the following regression scenarios:
* In this chapter, Chapter \@ref(regression) on basic regression, where we'll always have only one explanatory variable:
+ A single numerical explanatory variable $x$ in Section \ref@(model1). This scenario is known as *simple linear regression*.
+ A single categorical explanatory variable $x$ in Section \ref@(model2).
* In the next chapter: Chapter \@ref(multiple-regression) on *multiple regression*, where we'll have more than one explanatory variable:
+ Two numerical explanatory variables $x_1$ and $x_2$ in Section \ref@(model3).
+ One numerical and one categorical explanatory variable in Section \ref@(model3). We'll also introduce *interaction models* here, there the effect of one explanatory variable depends on the value of another.
We'll study all 4 of these regression scenarios using real data, all easily accessible via R packages! <!--We will also discuss the concept of *correlation* and how it is frequently incorrectly used to imply *causation*.-->
### Needed packages {-}
In this chapter we introduce a new package, `moderndive`, that is an accompaniment package to the ModernDive book that includes useful functions for linear regression and other functions and data used in later in the book.
<!--
However because this is a relatively new package, we're have to install this package in a different way than was covered in Section \@ref(packages).
First, install the `remotes` package like you would normally install a package, then run the following line to install the `moderndive` package:
-->
```{r, message=FALSE, warning=FALSE, include=FALSE}
# library(remotes)
# remotes::install_github("moderndive/moderndive")
```
Let's now load all the packages needed for this chapter. If needed, read Section \@ref(packages) for information on how to install and load R packages.
```{r, message=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
library(moderndive)
```
```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
library(mvtnorm)
library(tidyr)
library(forcats)
library(gridExtra)
```
## One numerical x {#model1}
Why do some professors and instructors at universities and colleges get high teaching evaluations from students while others don't? What factors can explain these differences? Are there biases? These are questions that are of interest to univerisity/college administrators, as teaching evaluations are among the many criteria considered in determining which professors and instructors should get promotions. Researchers at the University of Texas in Austin tried to answer this question: what factors can explain differences in instructor's teaching evaluation scores? To this end, they collected information on n = 463 instructors. A full description of the study can be found at [openintro.org](https://www.openintro.org/stat/data/?data=evals).
We'll keep things simple for now and try to explain differences in instructor evaluations scores as a function of one numerical variable: their "beauty score" which we'll describe shortly. Could it be that instructors with higher beauty scores also have higher teaching evaluations? Could it be instead that instructors with higher beauty scores tend to have lower teaching evaluations? Or could it be there is no relationship between beauty score and teaching evaluations?
We'll achieve this by modeling the relationship between these two variables with a particular kind of linear regression called *simple linear regression*. Simple linear regression is the most basic form of linear regression where we have
1. A numerical outcome variable $y$. In this case their teaching score.
1. A single numerical explanatory variable $x$. In this case their beauty score.
### Exploratory data analysis {#model1EDA}
A crucial step before doing any kind of modeling or analysis however is performing an *exploratory data analysis*, or EDA, of all our data. Exploratory data analysis can give you a sense of the distribution of data, whether there are outliers and/or missing values, but most importantly it can inform how to build your model. There are many approaches to exploratory data analysis, here are three:
1. Most fundamentally: just looking at the raw values, in a spreadsheet for example. While this may seem trivial, many people ignore this crucial step!
1. Compute summary statistics likes means, medians, and standard deviations.
1. Create data visualizations.
Let's load the data, `select` only a subset of 6 of the variables, and look at the raw values. Recall you can look at the raw values by running `View(evals)` in the console to pop-up the spreadsheet viewer. Here however we present only a snapshot of 5 randomly chosen rows:
```{r}
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
select(score, bty_avg)
```
```{r, echo=FALSE}
set.seed(76)
evals %>%
sample_n(5) %>%
knitr::kable(
digits = 3,
caption = "Random sample of 5 instructors",
booktabs = TRUE
)
```
While a full description of each of these variables can be found at [openintro.org](https://www.openintro.org/stat/data/?data=evals), let's summarize what each of these variables mean
1. `score`: Numerical variable of the average teaching score based on students evaluations between 1 and 5. This is the outcome variable $y$ of interest.
1. `bty_avg`: Numerical variable of average "beauty" rating based on a panel of 6 students' scores between 1 and 10. This is the numerical explanatory variable $x$ of interest.
<!--
1. `ethnicity`: Categorical variable of with two levels, `minority` or `non-minority`.
1. `gender`: Categorical variable of (binary) gender `men` or `women`
1. `language`: Categorical variable of the instructor's native language: `english` or `non-english`.
1. `age`: A numerical variable of age (recall our discussion in Chapter \@ref(explore-dataframes) on why `age` is numerical here).
-->
Another way to look at the raw values is using the `glimpse()` function, which gives us a slightly different view of the data. We see that `Observations: 463`, indicating that there are 463 observations in `evals`, each correponding to a particular instructor at UT Austin. Expressed differently, each row in the data frame `evals` corresponds to one of 463 instructors.
```{r}
glimpse(evals)
```
Since both the outcome variable `score` and the explanatory variable `bty_avg` are numerical, we can compute summary statistics about them such as the mean and median. Let's take `evals`, then select only the two variables of interest for now, and pipe them into the `summary()` command which returns: the minimum (smallest) value, the first quartile, the median, the mean (average), the third quartile, and the maximum (largest) value.
```{r}
evals %>%
summary()
```
We get an idea of how the values in both variables are distributed. For example, the average teaching score was 4.17 out of 5 whereas the average beauty score was 4.42 out of 10. Furthermore, the middle 50% of teaching scores were between 3.80 and 4.6 (the first and third quartiles) while the middle 50% of beauty scores were between 3.17 and 5.5 out of 10.
The `summary()` function however only returns what are called *univariate* summaries i.e. summaries about single variables at a time. Since we are considering the relationship between two numerical variables, it would be nice to have a summary statistic that simultaneously considers both variables. The *correlation coefficient* is a *bivariate* summary statistic that fits this bill. *Coefficients* in general are quantitative expressions of a specific property of a phenomenon. A correlation coefficient is a quantitative expression between -1 and 1 that summarizes the *strength of the linear relationship between two numerical variables*:
* -1 indicates a perfect *negative relationship*: as the value of one variable goes up, the value of the other variable goes down.
* 0 indicates no relationship: the values of both variables go up/down independently of each other.
* +1 indicates a perfect *positive relationship*: as the value of one variable goes up, the value of the other variable goes up as well.
Figure \@ref(fig:correlation1) gives examples of different correlation coefficient values for hypothetical numerical variables $x$ and $y$. We see that while for a correlation coefficient of -0.75 there is still a negative relationship between $x$ and $y$, it is not as strong as the negative relationship between $x$ and $y$ when the correlation coefficient is -1.
```{r correlation1, echo=FALSE, fig.cap="Different correlation coefficients"}
correlation <- c(-0.9999, -0.75, 0, 0.75, 0.9999)
n_sim <- 100
values <- NULL
for(i in 1:length(correlation)){
rho <- correlation[i]
sigma <- matrix(c(5, rho * sqrt(50), rho * sqrt(50), 10), 2, 2)
sim <- rmvnorm(
n = n_sim,
mean = c(20,40),
sigma = sigma
) %>%
as_data_frame() %>%
mutate(correlation = round(rho, 2))
values <- bind_rows(values, sim)
}
ggplot(data = values, mapping = aes(V1, V2)) +
geom_point() +
facet_wrap(~ correlation, nrow = 2) +
labs(x = "x", y = "y") +
theme(
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank()
)
```
The correlation coefficient is computed using the `cor()` function, where in this case the inputs to the function are the two numerical variables we want to calculate the correlation coefficient of. Recall from Chapter \ref@(explore-dataframes) that the `$` pulls out specific variables from a data frame:
```{r}
cor(evals$score, evals$bty_avg)
```
In our case, the correlation coefficient of `r cor(evals$score, evals$bty_avg) %>% round(3)` indicates that the relationship between teaching evaluation score and beauty average is "weakly positive." There is a certain amount of subjectivity in interpreting correlation coefficients, especially those that aren't close to -1, 0, and 1. For help developing such intuition and more discussion on the correlation coefficient see Section \@ref(correlationcoefficient) below.
Let's now proceed by visualizing this data. Since both the `score` and `bty_avg` variables are numerical, a scatterplot is an appropriate graph to visualize this data. Let's do this using `geom_point()` and set informative axes labels and title.
```{r numxplot1, warning=FALSE, fig.cap="Instructor evaluation scores at UT Austin"}
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "Beauty Score", y = "Teaching Score", title = "Relationship of teaching and beauty scores")
```
However Figure \@ref(fig:numxplot1) suffers from *overplotting*. Recall from data visualization Chapter \@ref(overplotting) that overplotting occurs when several points are stacked directly on top of each other thereby obscuring the number of points. For example, let's focus on the 6 points in the top-right of the plot with a beauty score of around 8 out of 10: are there truly only 6 points, or are there many more just stacked on top of each other? You can think of these as *ties*. Let's break it up these ties with a little random "jitter" added to the points in Figure \@ref(fig:numxplot2). Jittering adds a little random bump to each of the points to break up these ties. Note that the `geom_jitter` only alters the visual display of the points; the values in the data frame stay the same.
```{r numxplot2, warning=FALSE, fig.cap="Instructor evaluation scores at UT Austin: Jittered"}
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_jitter() +
labs(x = "Beauty Score", y = "Teaching Score", title = "Relationship of teaching and beauty scores")
```
From Figure \@ref(fig:numxplot2) we make several observations:
1. Focusing our attention on the top-right of the plot again, we now see that those originally unjittered 6 points actually were actually 12!
1. A further interesting trend is that the jittering revealed a large number of instructors with beauty scores of between 3 and 4.5, towards the lower end of the beauty scale.
1. Most beauty scores lie between 2 and 8
1. Most teaching scores lie between 3 and 5
1. Recall our earlier computation of the correlation coefficient which describes the strength of the linear relationship between two numerical variables. Looking at Figure \@ref(fig:numxplot2), it is not immediately apparent that these two variables are positively related. This is to be expected given the positive, but rather weak (close to 0), correlation coefficient of `r cor(evals$score, evals$bty_avg) %>% round(3)`.
Let's improve on Figure \@ref(fig:numxplot2) by adding a "regression line" in Figure \@ref(fig:numxplot3). This is easily done by adding a new layer to the `ggplot` code that created Figure \@ref(fig:numxplot2): `+ geom_smooth(method="lm")`. A regression line is a "best fitting" line in that of all possible lines you could draw on this plot, it is "best" in terms of some mathematical criteria. We discuss the criteria for "best" in Section \@ref(leastsquares), but we suggest you read this only after covering the concept of a *residual* coming up in Chapter \@ref(model1points).
```{r numxplot3, warning=FALSE, fig.cap="Regression line"}
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_jitter() +
labs(x = "Beauty Score", y = "Teaching Score", title = "Relationship of teaching and beauty scores") +
geom_smooth(method = "lm")
```
When viewed on this plot, the regression line is a visual summary of the relationship between two numerical variables, in our case the outcome variable `score` and the explanatory variable `bty_avg`. The positive slope of the blue line is consistent with our observed correlation coefficient of `r cor(evals$score, evals$bty_avg) %>% round(3)` suggesting that there is a positive relationship between `score` and `bty_avg`. We'll see later however that while the correlation coefficient is not equal to the slope of this line, they always have the same sign: positive or negative.
What are the grey bands surrounding the blue line? These are *standard error* bands, which can be thought of as error/uncertainty bands. Let's skip this idea for now and suppress these grey bars for now by adding the argument `se=FALSE` to `geom_smooth(method = "lm")`. We'll introduce standard errors in Chapter \@ref(sampling) on sampling, use them for constructing *confidence intervals* and conducting *hypothesis tests* in Chapters \@ref(ci) and \@ref(ht), and consider them when we revisit regression in Chapter \@ref(inference-for-regression)
```{r numxplot4, warning=FALSE, fig.cap="Regression line without error bands"}
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_jitter() +
labs(x = "Beauty Score", y = "Teaching Score", title = "Relationship of teaching and beauty scores") +
geom_smooth(method = "lm", se = FALSE)
```
### Simple linear regression {#model1table}
If case you've forgotten from high school algebra, in general the equation of a line is $y = a + bx$ which is defined by two coefficients (which we defined earlier as "quantitative expressions of a specific property of a phenomenon):
* the intercept coefficient $a$, or the value of $y$ when $x=0$, and
* the slope coefficient $b$, or the increase in $y$ for every increase of one in $x$
However, when defining a line specifically for regression, like the blue regression line in Figure \@ref(fig:numxplot4), we use slightly different notation: the equation of the regression line is $\widehat{y} = b_0 + b_1 x$ where
* the intercept coefficient is $b_0$, or the value of $\widehat{y}$ when $x=0$, and
* the slope coefficient $b_1$, or the increase in $\widehat{y}$ for every increase of one in $x$
Why do we put a "hat" on top of the $y$? It's a form of notation specific for regression, which we'll introduce in the next subsection \@ref{model1points} when we discuss *fitted values*. For now, let's ignore the hat and treat the equation of the line as you would from high school algebra. We know looking at Figure \@ref(fig:numxplot4) that the slope coefficient corresponding to `bty_avg` should be positive. Why? Because as `bty_avg` increases, professors tend to roughly have larger teaching evaluation `scores`. However, what are the specific values of the intercept and slope coefficients? Let's not worry about computing these by hand, but instead let the computer do the work for us, specifically R!
Let's get the value of the intercept and slope coefficients by outputting something called the *linear regression table*. This is always done in a two-step process:
1. First "fit" the linear regression model to the `data` using the `lm()` function. `lm` stands for "linear model", given that we are dealing with lines.
1. Then apply the `get_regression_table()` function from the `moderndive` R package to `score_model`
```{r, eval=FALSE}
lm(score ~ bty_avg, data = evals) %>%
get_regression_table(digits = 2)
```
```{r, echo=FALSE}
score_model <- lm(score ~ bty_avg, data = evals)
evals_line <- score_model %>%
get_regression_table() %>%
pull(estimate)
```
```{r numxplot4b, echo=FALSE}
score_model %>%
get_regression_table() %>%
knitr::kable(
digits = 3,
caption = "Linear regression table",
booktabs = TRUE
)
```
Whoa! There is a lot going on, both in terms of the inputs and outputs! Let's unpack this slowly. First, the `lm()` function that "fits" the linear regression model is typically used as `lm(y ~ x, data = DATA_FRAME_NAME)` where:
* `y` is the outcome variable, followed by a tilde `~`, the key to the left of "1" on your keyboard. In our case, `y` is set to `score`.
* `x` is the explanatory variable. In our case, `x` is set to `bty_avg`. We call the combination `y ~ x` a *model formula*.
* `DATA_FAME_NAME` is the name of the data frame that contains the variables `y` and `x`. In our case the `evals` data frame.
Then we pipe this output to be the input of the `get_regression_table()` function, just as when we discussed piping in Section \@ref(piping) in the data wrangling chapter. An additional arguments to the `get_regression_table()` function is `digits`, where we specify the number of significant digits of precision (number of digits after the decimal points) we want the regression table to have. `digits` defaults to 3, meaning if you don't specify this argument, `digits = 3` is used by default. We'd like to point out that all the `get_regression_table()` function in the `moderndive` package does is generate regression table outputs that are clean and easy-to-read while hiding a lot of the code necessary to do so and not much else. This is known as a *wrapper function* in computer programming, which takes other functions and "wraps" them in a single function. While not necessary to understand regression, if you are curious to know what is going on under the hood of `get_regression_table()`, see Section \@ref(underthehood) below.
<!--
1. `print` is a `TRUE`/`FALSE` value that sets whether or not the output should be in a print friendly format (specifically in Markdown table format suitable for use in R Markdown documents). `print` defaults to `FALSE`, meaning if you don't specify this argument, `print = FALSE` is used by default. We'll ignore the use of the `print` argument until Chapter \@ref(thinking-with-data) on "thinking with data."
-->
Now let's consider the outputted regression table, which has two rows denoted by the first column `term`: one corresponding to the intercept coefficient $b_0$ and one corresponding to the slope coefficient $b_1$ for `bty_avg`. The second column `estimate` gives us the "fitted" (or computed) values for both these coefficients. Therefore the blue regression line in Figure \@ref(fig:numxplot4) is $\widehat{\text{score}} = b_0 + b_{\text{bty avg}} \text{bty avg} = `r evals_line[1]` + `r evals_line[2]`\text{bty avg}$ where
* The intercept coefficient $b_0$ = `r evals_line[1]`, meaning for instructors that had a hypothetical beauty score of 0, the would on average have a teaching score of `r evals_line[1]`. In this case however, while the intercept has a mathematical interpretation when defining the regression line, there is no *practical* interpretation since `score` is an average of a panel of 6 students' ratings from 1 to 10, a `bty_avg` of 0 would be impossible. Furthermore, no instructors had a beauty score anywhere near 0.
* Of more interest is the slope coefficient associated with `bty_avg` $b_{\text{bty avg}}$ = `r evals_line[2]`. This is a numerical quantity that summarizes the relationship between the outcome and explanatory variables. It is interpreted as follows, for every increase of 1 unit in `bty_avg`, there is an *associated* increase of *on average* `r evals_line[2]` units of `score`. We note in particular that the sign of this slope is positive, suggesting a positive relationship between beauty scores and teaching scores. We are very careful with our wording:
+ We only stated that there is an *associated* increase, and not necessarily a *causal* increase. For example, perhaps its not that beauty directly affects teaching scores, but instead individuals from wealthier backgrounds tends to have had better education and training, and hence have higher teaching scores, but these same individuals also have higher beauty scores. Avoiding such reasoning can be summarized by the adage "correlation is not necessarily causation". In other words, just because two variables are correlated, it doesn't mean one directly causes the other. We discuss these ideas more in Section \@ref(correlation-is-not-causation).
+ We say that this associated increase is *on average* `r evals_line[2]` units of teaching `score` and not that the associated increase is *exactly* `r evals_line[2]` units of `score` across all values of `bty_avg`. This is because the slope is the average increase across all points as shown by the regression line in Figure \@ref(fig:numxplot4).
But what about the remaining 5 columns: `std_error`, `statistic`, `p_value`, `conf_low` and `conf_high`? They give you information on the *statistical significance* of these results, or their "meaningfulness." We'll revisit these in Chapter \@ref(inference-for-regression) on (statistical) inference for regression after we've covered standard errors in Chapter \@ref(sampling) (`std_error`), confidence intervals in Chapter \@ref(ci) (`conf_low` and `conf_high`), and hypothesis testing in Chapter \@ref(hypo) (`statistic` and `p_value`). For now, we'll only focus on the `term` and `estimate` columns.
### Observed/fitted values and residuals {#model1points}
We just saw how to get the value of the intercept and the slope of the regression line from the regression table generated by `get_regression_table()`. Now instead, say we want information on individual points, in this case one of the `n=463` instructors in this dataset, one corresponding to each row of `evals`. For example, say we are interested in the 21st instructor in this dataset:
```{r, echo=FALSE}
index <- which(evals$bty_avg == 7.333 & evals$score == 4.9)
target_point <- score_model %>%
get_regression_points() %>%
slice(index)
x <- target_point$bty_avg
y <- target_point$score
y_hat <- target_point$score_hat
resid <- target_point$residual
evals %>%
slice(index) %>%
knitr::kable(
digits = 3,
caption = "Data for 21st instructor",
booktabs = TRUE
)
```
What is the value on the blue line corresponding to this instructors `bty_avg` of `r x`? In Figure \@ref(fig:numxplot5) we mark three values in particular corresponding to this instructor:
* The red circle corresponds to the *observed value* $y$ = `r y` and corresponds to this instructors' observed teaching score.
* The red square corresponds to the *fitted value* $\widehat{y}$ = `r y_hat` and corresponds to the value on the regression line for $x = `r x`$. We compute this value using the intercept and slope in the regression table above: $\widehat{y} = b_0 + b_1 x = `r evals_line[1]` + `r evals_line[2]` \times `r x` = `r y_hat`$
* The length of the blue arrow is the *residual* $y - \widehat{y}$ = `r y` - `r y_hat` = `r y-y_hat`. The residual can be thought of as the error/lack-of-fit for this particular point. So the bigger the residual, the less the fitted value $\widehat{y}$ matches up with the observed value $y$.
```{r numxplot5, echo=FALSE, warning=FALSE, fig.cap="Example of observed value, fitted value, and residual"}
best_fit_plot <- ggplot(evals, aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "Beauty Score", y = "Teaching Score", title = "Relationship of teaching and beauty scores") +
geom_smooth(method = "lm", se = FALSE) +
annotate("point", x = x, y = y, col = "red", size = 3) +
annotate("point", x = x, y = y_hat, col = "red", shape = 15, size = 3) +
annotate("segment", x = x, xend = x, y = y, yend = y_hat, color = "blue",
arrow = arrow(type = "closed", length = unit(0.02, "npc")))
best_fit_plot
```
What if we want both
1. the fitted value $\widehat{y} = b_0 + b_1 \times x$
1. the residual $y - \widehat{y}$
not only the 21st instructor but
* for all `r nrow(evals)` instructors in the study, in other words
* for all `r nrow(evals)` rows in the `evals` data frame, in other words
* for all `r nrow(evals)` points in regression plot in Figure \@ref(fig:numxplot4)?
We could repeat the above calculations 463 times, but that would be tedious and time consuming. Instead, let's use the `get_regression_points()` function that we've included in the `moderndive` R package. Note that in the table below we only present the results for 5 arbitrarily chosen rows out of `r nrow(evals)`.
```{r, eval=FALSE}
regression_points <- get_regression_points(score_model)
regression_points
```
```{r, echo=FALSE}
set.seed(76)
regression_points <- get_regression_points(score_model)
regression_points %>%
slice(c(index, sample(1:nrow(evals), 4))) %>%
knitr::kable(
digits = 3,
caption = "Regression points (5 arbitrarily chosen rows out of 463)",
booktabs = TRUE
)
```
Just as with the `get_regression_table()` function, the inputs to the `get_regression_points()` function are the same, however the outputs are different. Let's inspect the individual columns:
* The `score` column represents the observed value of the outcome variable $y$
* The `bty_avg` column represents the values of the explanatory variable $x$
* The `score_hat` column represents the fitted values $\widehat{y}$
* The `residual` column represents the residuals $y - \widehat{y}$
The prove to ourselves that the results make sense, let's manually replicate the internal computation that `get_regression_points()` performs using the `mutate()` verb from Chapter \@ref(wrangling)
1. First we compute a duplicate of the fitted values $\widehat{y} = b_0 + b_1 \times x$ where the intercept $b_0$ = `r evals_line[1]` and the slope $b_1$ = `r evals_line[2]` and store these in `score_hat_2`
1. Then we compute the a duplicated the residuals $y - \widehat{y}$ and store these in `residual_2`
```{r, eval=FALSE}
regression_points <- regression_points %>%
mutate(score_hat_2 = 3.880 + 0.067 * bty_avg) %>%
mutate(residual_2 = score - score_hat_2)
regression_points
```
```{r, echo=FALSE}
set.seed(76)
regression_points <- regression_points %>%
mutate(score_hat_2 = 3.880 + 0.067 * bty_avg) %>%
mutate(residual_2 = score - score_hat_2)
regression_points %>%
slice(c(index, sample(1:nrow(evals), 4))) %>%
knitr::kable(
digits = 3,
caption = "Regression points (5 arbitrarily chosen rows out of 463)",
booktabs = TRUE
)
```
We see that our manually computed variables `score_hat_2` and `residual_2` equal the original results (up to rounding error).
At this point, we suggest you read Section \@ref(leastsquares), where we explicity define how a regression line is a "best" fitting line.
### Residual analysis {#model1residuals}
Recall the residuals can be thought of the error or the "lack-of-fit" between the observed value $y$ and the fitted value $\widehat{y}$ on the blue regression line in Figure \@ref(fig:numxplot4). Ideally when we fit a regression model, we'd like there to be *no systematic pattern* to these residuals. In other words, the error is seemingly random. What does this mean?
1. The residuals should be on average 0. In other words, sometimes we'll make a positive error, sometimes we'll make a negative error, but on average the error is 0.
1. The spread of the residuals should be consistent.
1. The value of the residuals should not depend on the value of x. If it did, then the errors would not be seemingly random.
Investigating any such patterns is known as *residual analysis*. Let's perform our residual analysis in two ways. First, recall in Figure \@ref(fig:numxplot5) above we plotted:
* On the y-axis: $y$ teaching score
* On the x-axis: $x$ beauty score
* Blue arrow: one example of a residual
Instead, in Figure \@ref(fig:numxplot6) below let's plot
* On the y-axis: the residual $y-\widehat{y}$ instead
* On the x-axis: $x$ beauty score (same as before)
```{r numxplot6, echo=FALSE, warning=FALSE, fig.cap="Plot of residuals over beauty score"}
ggplot(regression_points, aes(x = bty_avg, y = residual)) +
geom_point() +
labs(x = "Beauty Score", y = "Residual") +
geom_hline(yintercept = 0, col = "blue", size = 1) +
annotate("point", x = x, y = resid, col = "red", size = 3) +
annotate("point", x = x, y = 0, col = "red", shape = 15, size = 3) +
annotate("segment", x = x, xend = x, y = resid, yend = 0, color = "blue",
arrow = arrow(type = "closed", length = unit(0.02, "npc")))
```
You can think of Figure \@ref(fig:numxplot6) as Figure \@ref(fig:numxplot5), but
with the blue line flattened out. Does it seem like there is *no systematic pattern*
to the residuals? This question is rather qualitative and subjective in nature, thus different people may respond different. However, it can be argued that there isn't a *drastic* pattern in the residuals.
Here are some hypothetical examples where there are *drastic* patterns to the
residuals:
```{r numxplot7, echo=FALSE, warning=FALSE, fig.cap="Examples of less than ideal residual patterns"}
resid_ex <- evals
resid_ex$ex_1 <- ((evals$bty_avg - 5) ^ 2 - 6 + rnorm(nrow(evals), 0, 0.5)) * 0.4
resid_ex$ex_2 <- (rnorm(nrow(evals), 0, 0.075 * evals$bty_avg ^ 2)) * 0.4
resid_ex <- resid_ex %>%
select(bty_avg, ex_1, ex_2) %>%
gather(type, eps, -bty_avg) %>%
mutate(type = ifelse(type == "ex_1", "Example 1", "Example 2"))
ggplot(resid_ex, aes(x = bty_avg, y = eps)) +
geom_point() +
labs(x = "Beauty Score", y = "Residual") +
geom_hline(yintercept = 0, col = "blue", size = 1) +
facet_wrap(~type)
```
The second way to look at the residuals is using a histogram:
```{r model1_residuals_hist, warning=FALSE, fig.cap="Histogram of residuals"}
ggplot(regression_points, aes(x = residual)) +
geom_histogram(binwidth = 0.25, color = "white") +
labs(x = "Residual")
```
This histogram seems to indicate that we have more positive residuals than negative. Since residual = $y-\widehat{y} > 0$ when $y > \widehat{y}$, it seems our fitted teaching score from the regression model tend to *underestimate* the true teaching score. This histogram has a slight *left-skew* in that there is a long tail on the left. Another way to say this is this data exhibit a *negative skew*. Here are examples of an ideal and less than ideal patterns of residuals.
```{r numxplot9, echo=FALSE, warning=FALSE, fig.cap="Examples of ideal and less than ideal residual patterns"}
resid_ex <- evals
resid_ex$`Ideal` <- rnorm(nrow(resid_ex), 0, sd = sd(regression_points$residual))
resid_ex$`Less than ideal` <-
rnorm(nrow(resid_ex), 0, sd = sd(regression_points$residual))^2
resid_ex <- resid_ex %>%
select(bty_avg, `Ideal`, `Less than ideal`) %>%
gather(type, eps, -bty_avg)
ggplot(resid_ex, aes(x = eps)) +
geom_histogram(binwidth = 0.25, color = "white") +
labs(x = "Residual") +
facet_wrap( ~ type, scales = "free")
```
In fact, we'll see later on that we would like the residuals to be *normally distributed* with
mean 0. In other words, be bell-shaped and centered at 0! While this requirement and residual analysis in general may some of you as not being overly critical at this point, we'll see later after we've covered *statistical inference* that for the last five columns of the regression table from earlier (std error, statistic, p-value, conf low, and conf high) to have valid interpretations, the above three conditions should roughly hold. More on this in Section .
### Your Turn
Repeat the above analysis where teaching `score` is the outcome variable, but now use `age` as the explanatory variable. Recall the steps:
```{r}
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
select(score, age)
```
1. Perform an exploratory data analysis (EDA) by
a) Looking at the raw values
a) Computing summary statistics of the variables of interest
a) Creating informative visualizations
1. Fit a linear regression model and get information about the "best-fitting" line from the regression table by using the `get_regression_table()` function
1. Get information on all $n$ data points AKA all $n$ rows in the dataset AKA all instructors, in particular the fitted values and the residuals, using the `get_regression_points()` function.
1. Perform a residual analysis and look for any systematic patterns in the residuals. Ideally, there should be little to no pattern.
## One categorical x {#model2}
It's an unfortunate truth that life expectancy is not the same across various countries in the world; there are a multitude of factors that are associated with how long people live. International development agencies are very interested in studying these differences in the hope of understanding where governments should allocate resources to address this problem. One way to compare differences in life expectancy is between continents: "Do countries in certain continents have higher life expectancy?" or "Do certain continents have a lot of variation in life expectancy?"
To this end, let's study the `gapminder` data frame in the `gapminder` package. Recall we introduced this dataset in Chapter \@ref(gapminder) when we first studied the "Grammar of Graphics"; in particular Figure \@ref(fig:gapminder). This dataset has development data about various countries for 5-year intervals between 1952 and 2007. Let's have
* Outcome variable $y$: Mean life expectancy in 2007 for $n=142$ countries
* Explanatory variables $x$
+ continent
+ GDP per capita
Let's load the `gapminder` data, `filter()` for only observations in 2007, `select()` only a subset of the variables, and save this in a data frame `gapminder2007`.
```{r, warning=FALSE, message=FALSE}
library(gapminder)
gapminder2007 <- gapminder %>%
filter(year == 2007) %>%
select(country, continent, lifeExp, gdpPercap)
```
In this section we'll try to explain between country differences in life expectancy as a function of which continent the country is located in. We'll model the relationship between these two variables with linear regression again, but note that our explanatory variable $x$ is now categorical, and not numerical like when we covered simple linear regression in Section \@ref(model1):
1. A numerical outcome variable $y$, in this case `lifeExp`
1. A single numerical explanatory variable $x$, in this case `continent`
The concept of a "best-fitting" line is a little different when the explanatory variable $x$ is numerical; we'll study these differences shortly.
### Exploratory data analysis {#model2EDA}
Let's look at the raw data values both by bringing up RStudio's spreadsheet viewer, although in Table \@ref(tab:model2-data-preview) we only show 5 randomly selected countries out of `r nrow(gapminder2007)`, and using the `glimpse()` function:
```{r, eval=FALSE}
View(gapminder2007)
```
```{r model2-data-preview, echo=FALSE}
gapminder2007 %>%
sample_n(5) %>%
knitr::kable(
digits = 3,
caption = "Random sample of 5 countries",
booktabs = TRUE
)
```
```{r}
glimpse(gapminder2007)
```
We see that the variable `continent` is indeed categorical, as it is encoded as `fctr` which stands for "factor": R's way of storing categorical variables. Let's look at a summary of the explanatory variable `continent`:
```{r}
summary(gapminder2007$continent)
```
We observe that all continents have 25 countries or more, but Oceania only has two: Australia and New Zealand. Let's now compute some summary statistics of the outcome variable `lifeExp`, in particular the worldwide median and mean life expectancy
```{r, eval=TRUE}
lifeExp_worldwide <- gapminder2007 %>%
summarize(median = median(lifeExp), mean = mean(lifeExp))
```
```{r, echo=FALSE}
lifeExp_worldwide %>%
knitr::kable(
digits = 3,
caption = "Worldwide life expectancy",
booktabs = TRUE
)
```
Worldwide roughly half the countries had life expectancies of `r lifeExp_worldwide$median %>% round(3)` years or less, while roughly half were higher. The mean however is lower at `r lifeExp_worldwide$mean %>% round(3)`. Why are these two values different? Let's look at a histogram of `lifeExp` to see why.
```{r}
ggplot(gapminder2007, aes(x = lifeExp)) +
geom_histogram(binwidth = 5, color = "white") +
labs(x = "Life expectancy", y = "Number of countries", title = "Worldwide life expectancy")
```
We say that this data is *left-skewed*: there are a few countries with very low life expectancies that are bringing down the mean life expectancy. However, the median is less sensitive to the effects of such outliers. Hence the median is greater than the mean in this case. Let's proceed by comparing median and mean life expectancy between continents by adding a `group_by(continent)` to the above code:
```{r, eval=TRUE}
lifeExp_by_continent <- gapminder2007 %>%
group_by(continent) %>%
summarize(median = median(lifeExp), mean = mean(lifeExp))
```
```{r catxplot0, echo=FALSE}
lifeExp_by_continent %>%
knitr::kable(
digits = 3,
caption = "Life expectancy by continent",
booktabs = TRUE
)
```
```{r, echo=FALSE}
median_africa <- lifeExp_by_continent %>%
filter(continent == "Africa") %>%
pull(median)
mean_africa <- lifeExp_by_continent %>%
filter(continent == "Africa") %>%
pull(mean)
n_countries <- gapminder2007 %>% nrow()
n_countries_africa <- gapminder2007 %>% filter(continent == "Africa") %>% nrow()
```
We now that there are differences in life expectancies between the continents. For example, while the median life expectancy across all $n = `r n_countries`$ countries in 2007 was `r lifeExp_worldwide$median %>% round(3)`, the median life expectancy across only the $n =`r n_countries_africa `$ countries in Africa was only `r median_africa`.
Let's create an appropriate visualization. One way to compare the life expectancies of countries in different continents would be via a faceted histogram:
```{r catxplot0b, warning=FALSE, fig.cap="Life expectancy in 2007"}
ggplot(gapminder2007, aes(x = lifeExp)) +
geom_histogram(binwidth = 5, color = "white") +
labs(x = "Life expectancy", y = "Number of countries", title = "Life expectancy by continent") +
facet_wrap(~continent, nrow = 2)
```
Another way would be via a boxplot, which is well suited for visualizing one numerical and one categorical variable:
```{r catxplot1, warning=FALSE, fig.cap="Life expectancy in 2007"}
ggplot(gapminder2007, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
labs(x = "Continent", y = "Life expectancy (years)", title = "Life expectancy by continent")
```
We can now easily compare life expectancies between continents with single horizontal lines. Something that people new to boxplots forget is that the solid bars in the middle of the boxes correspond to **medians and not means**. Let's modify this plot in two ways. First, let's treat the median life expectancy years for Africa of `r median_africa` as a *baseline for comparison* are mark this value on the y-axis with a horizontal line via `geom_hline()`:
```{r catxplot1b, warning=FALSE, fig.cap="Life expectancy in 2007"}
ggplot(gapminder2007, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
labs(x = "Continent", y = "Life expectancy (years)", title = "Life expectancy by continent") +
geom_hline(yintercept = 52.93, color = "red")
```
Second, let's recenter the y-axis at Africa's median life expectancy by subtracting `r median_africa` from all values. This allows us to focus on relative differences from Africa's median life expectancy.
```{r catxplot2, warning=FALSE, fig.cap="Difference in life expectancy relative to African median of 52.93 years"}
ggplot(gapminder2007, aes(x = continent, y = lifeExp - 52.93)) +
geom_boxplot() +
geom_hline(yintercept = 52.93 - 52.93, col = "red") +
labs(x = "Continent", y = "Difference in life expectancy vs Africa (years)",
title = "Life expectancy relative to Africa")
```
Using the "eyeball test", we make the following observations:
* Differences in median life expectancy vs. the baseline for comparison, Africa's median life expectancy of `r median_africa` years:
1. The median life expectancy of the Americas is roughly 20 years greater.
1. The median life expectancy of Asia is roughly 20 years greater.
1. The median life expectancy of Europe is roughly 25 years greater.
1. The median life expectancy of Oceania is roughly 27.8 years greater.
* Africa and Asia have much more spread/variation in life expectancy as indicated by the interquartile range (the height of the boxes).
* Oceania has almost no spread/variation, but this might in large part be due to the fact there are only two countries in Oceania.
### Linear regression {#model2table}
In Section \@ref(model1table) we introduced *simple* linear regression, which involves modeling the relationship between a numerical outcome variable $y$ as a function of a numerical explanatory variable $x$, in our life expectancy example, we now have a categorical explanatory variable $x$ `continent`. While we still can fit a regression model, given our categorical explanatory variable we no longer have a concept of a "best-fitting" line, but differences relative to a baseline for comparison.
Before we fit our regression model, let's create a table similar to Table \@ref(tab:catxplot0), but
1. Report the mean life expectancy for each continent.
1. Report the difference in mean life expectancy *relative* to Africa's mean life expectancy of `r mean_africa` in the column "mean vs Africa"; this column is simple the "mean" column minus `r mean_africa`.
Think back to your observations from the eye-ball test of Figure \@ref(fig:catxplot2) at the end of the last section. The column "mean vs Africa" is the same idea of comparing a summary statistic to a baseline for comparison, in this case the countries of Africa, but using means instead of medians.
```{r continent-mean-life-expectancies, echo=FALSE}
gapminder2007 %>%
group_by(continent) %>%
summarize(mean = mean(lifeExp)) %>%
mutate(`mean vs Africa` = mean - mean_africa) %>%
knitr::kable(
digits = 3,
caption = "Mean life expectancy by continent",
booktabs = TRUE
)
```
Now, let's use the `get_regression_table()` function we introduced in Section \@ref(model1table) to obtain the *regression table* for `gapminder2007` analysis:
```{r, eval=FALSE}
lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
get_regression_table(lifeExp_model)
```
```{r, echo=FALSE}
lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
evals_line <- get_regression_table(lifeExp_model) %>%
pull(estimate)
```
```{r catxplot4b, echo=FALSE}
get_regression_table(lifeExp_model) %>%
knitr::kable(
digits = 3,
caption = "Linear regression table",
booktabs = TRUE
)
```
Just as before, we have the `term` and `estimates` columns of interest, but unlike before, we now have 5 rows corresponding to 5 outputs in our table: an intercept like before, but also `continentAmericas`, `continentAsia`, `continentEurope`, and `continentOceania`. What are these values?
1. `intercept = 54.8` corresponds to the mean life expectancy for Africa. This mean life expectancy is treated as a baseline for comparison for the other continents.
1. `continentAmericas = 18.8` is the difference in mean life expectancies of the
Americas minus Africa. Note that 18.80 = 73.6 - 54.8 is the 2nd "mean vs Africa"
value in Table \@ref(tab:continent-mean-life-expectancies).
1. `continentAmericas = 15.9` is the difference in mean life expectancy of Asia
minus Africa. Note that 15.9 = 70.7 - 54.8 is the 2nd "mean vs Africa"
value in Table \@ref(tab:continent-mean-life-expectancies).
1. `continentEurope = 22.8` is the difference in mean life expectancy of Europe
minus Africa. Note that 22.8 = 77.6 - 54.8 is the 3rd "mean vs Africa"
value in Table \@ref(tab:continent-mean-life-expectancies).
1. `continentOceania = 25.9` is the difference in mean life expectancy of Oceania
minus Africa. Note that 25.9 = 80.7 - 54.8 is the 3rd "mean vs Africa"
value in Table \@ref(tab:continent-mean-life-expectancies).
Let's generalize this idea a bit. If we fit a linear regression model using an explanatory variable $x$ that has $k$ levels, a regression model will return an intercept and $k-1$ "slope" coefficients of differences in means relative to a baseline mean for comparison. In our case, since there are $k=5$ continents, the regression model returns an intercept corresponding to the baseline for comparison Africa and $k-1=4$ slope coefficients corresponding to the Americas, Asia, Europe, and Oceania. Africa was chosen as the baseline by R for no other reason than it is first alphabetically of the 5 continents. Note you can manually specify which continent to use as baseline, but we leave that to a more advanced course.
### Observed/fitted values and residuals {#model2points}
What do fitted values $\widehat{y}$ and residuals $y - \widehat{y}$ correspond to when the explanatory variable $x$ is categorical? Let's investigate these values for the first 10 countries in the `gapminder2007` dataset:
```{r, echo=FALSE}
gapminder2007 %>%
slice(1:10) %>%
knitr::kable(
digits = 3,
caption = "First 10 out of 142 countries",
booktabs = TRUE
)
```
Recall the `get_regression_points()` function we used in Section \@ref(model1points) that returns the fitted value and residual for
* for all `r nrow(gapminder2007)` countries in our dataset
* for all `r nrow(gapminder2007)` rows in the `gapminder2007` data frame
* for all `r nrow(gapminder2007)` data points used in the five boxplots in Figure \@ref(fig:catxplot2)
```{r, eval=FALSE}
regression_points <- get_regression_points(lifeExp_model)
regression_points
```
```{r, echo=FALSE}
regression_points <- get_regression_points(lifeExp_model)
regression_points %>%
slice(1:10) %>%
knitr::kable(
digits = 3,
caption = "Regression points (First 10 out of 142 countries)",
booktabs = TRUE
)
```
Notice
* The fitted values $\widehat{\text{lifeexp}}$. Countries in Africa have the
same fitted value of 54.8, which is the mean life expectancy of Africa;
countries in Asia have the same fitted value of 70.7, which is the mean life
expectancy of Asia; this similarly holds for countries in the Americas, Europe,
and Oceania.
* The `residual` column is simply $y - \widehat{y}$ = `lifeexp - lifeexp_hat`.
These values can be interpreted as that particular countries deviation from the
mean life expectancy of the respective continent's mean. For example, the first
row of this dataset corresponds to Afghanistan, and the residual of -26.9 = 43.8
- 70.7 is Afghanistan's mean life expectancy minus the mean life expectancy of
all Asian countries.
### Residual analysis {#model2residuals}
Recall our discussion on residuals from Section \@ref(model1residuals) where our goal was to investigate whether or not there was a *systematic pattern* to the residuals, as ideally since residuals can be thought of as error, there should be no such pattern. While there are many ways to do such residual analysis, we focused on two approaches based on visualizations.
1. A plot with residuals on the y-axis and the predictor (in this case continent) on the x-axis
1. A histogram of all residuals
First, let's plot the residuals vs continent in Figure \@ref(fig:catxplot7), but also let's plot all `r nrow(gapminder2007)` points with a little horizontal random jitter by setting the `width = 0.1` parameter in `geom_jitter()`
```{r catxplot7, warning=FALSE, fig.cap="Plot of residuals over continent"}
ggplot(regression_points, aes(x = continent, y = residual)) +
geom_jitter(width = 0.5) +
labs(x = "Continent", y = "Residual") +
geom_hline(yintercept = 0, col = "blue")
```
We observe:
1. While not perfectly balanced above and below the line $y=0$, there seems to
be a rough balance of both positive and negative residuals for all 5 continents.
1. However, there is one clear outlier in Asia. It has the smallest residual,
hence also has the smallest life expectancy in Asia.
Let's see investigate the 5 countries in Asia with the shortest life expectancy:
```{r, eval=FALSE}
gapminder2007 %>%
filter(continent == "Asia") %>%
arrange(lifeExp)
```
```{r, echo=FALSE}
gapminder2007 %>%
filter(continent == "Asia") %>%
arrange(lifeExp) %>%
slice(1:5) %>%
knitr::kable(
digits = 3,
caption = "Countries in Asia with shortest life expectancy",
booktabs = TRUE
)
```
This was the earlier identified residual for Afghanistan of -26.9. Unfortunately
given recent geopolitical turmoil, individuals who live in Afghanistan have a
drastically lower life expectancy.
Second, let's look at a histogram of all `r nrow(gapminder2007)` values of
residuals in Figure \@ref(fig:catxplot8). In this case, the residuals form a
rather nice bell-shape, although there are a couple of very low and very high
values at the tails. As we said previously, searching for patterns in residuals
can be somewhat subjective, but ideally we hope there are no "drastic" patterns.
```{r catxplot8, warning=FALSE, fig.cap="Histogram of residuals"}
ggplot(regression_points, aes(x = residual)) +
geom_histogram(binwidth = 5, color = "white") +
labs(x = "Residual")
```
### Your turn
Repeat this same analysis but using GDP per capita `gdpPercap` as the outcome variable. Recall the four steps we've been following:
1. Perform an exploratory data analysis (EDA)
1. Fit a linear regression model and get information about the model
1. Get information on all $n$ data points considered in this analysis.
1. Perform a residual analysis and look for any systematic patterns in the residuals.
## Other topics
### Correlation coefficient {#correlationcoefficient}
Let's replot Figure \@ref(fig:correlation1), but now consider a broader range of correlation coefficient values in Figure \@ref(fig:correlation2).
```{r correlation2, echo=FALSE, fig.cap="Different Correlation Coefficients"}
correlation <- c(-0.9999, -0.9, -0.75, -0.3, 0, 0.3, 0.75, 0.9, 0.9999)
n_sim <- 100
values <- NULL
for(i in 1:length(correlation)){
rho <- correlation[i]
sigma <- matrix(c(5, rho * sqrt(50), rho * sqrt(50), 10), 2, 2)
sim <- rmvnorm(
n = n_sim,
mean = c(20,40),
sigma = sigma
) %>%
as_data_frame() %>%
mutate(correlation = round(rho,2))
values <- bind_rows(values, sim)
}
ggplot(data = values, mapping = aes(V1, V2)) +
geom_point() +
facet_wrap(~ correlation, ncol = 3) +
labs(x = "x", y = "y") +
theme(
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank()
)
```
As we suggested in Chapter \@ref(model1EDA), interpreting coefficients that are not close to the extreme values of -1 and 1 can be subjective. To develop your sense of correlation coefficients, we suggest you play the following 80's-style video game called "Guess the correlation"! Click on the image below:
<center><a target="_blank" href="http://guessthecorrelation.com/"><img src="images/guess_the_correlation.png" title="Guess the correlation" width="600"/></a></center>
What if in instead we looked the correlation coefficient between `Income` and
credit card `Balance`, where `Income` was in dollars and not thousands of dollars?
i.e. we multiply `Income` by 1000?
```{r cor-credit, eval=FALSE}
library(ISLR)
Credit %>%
select(Balance, Limit, Income) %>%
mutate(Income = Income * 1000) %>%
cor()
```
```{r cor-credit-2, echo=FALSE}
library(ISLR)
Credit %>%
select(Balance, Limit, Income) %>%
mutate(Income = Income * 1000) %>%
cor() %>%
knitr::kable(
digits = 3,
caption = "Correlation between income (in $) and credit card balance",
booktabs = TRUE
)
```
We see it is the same! We say that correlation coefficient is invariant to linear
transformations! In other words
* The correlation between $x$ and $y$ will be the same as
* The correlation between $a\times x + b$ and $y$ where $a, b$ are numerical values (real numbers in mathematical terms)
### Correlation is not necessarily causation {#correlation-is-not-causation}
Causation is a tricky problem and frequently takes either carefully designed experiments or methods to control for the effects of potential confounding variables. Both these approaches attempt either to remove all confounding variables or take them into account as best they can, and only focus on the behavior of a outcome variable in the presence of the levels of the other variable(s). Be careful as you read studies to make sure that the writers aren't falling into this fallacy of correlation implying causation. If you spot one, you may want to send them a link to [Spurious Correlations](http://www.tylervigen.com/spurious-correlations).
### Best fitting line {#leastsquares}
Regression lines are also known as "best fitting lines". But what do we mean by best? Let's unpack the criteria
that is used by regression to determine best. Recall the plot in Figure \@ref(fig:numxplot5) where for a instructor
with a beauty average score of $x=7.333$
* The observed value $y=4.9$ was marked with a red circle
* The fitted value $\widehat{y} = 4.369$ on the regression line was marked with a red square
* The residual $y-\widehat{y} = 4.9-4.369 = 0.531$ was the length of the blue arrow.
Let's do this for another arbitrarily chosen instructor whose beauty score was
$x=2.333$. The residual in this case is $2.7 - 4.036 = -1.336$.
```{r echo=FALSE}
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
select(score, bty_avg)
index <- which(evals$bty_avg == 2.333 & evals$score == 2.7)
target_point <- get_regression_points(score_model) %>%
slice(index)
x <- target_point$bty_avg
y <- target_point$score
y_hat <- target_point$score_hat
resid <- target_point$residual
best_fit_plot <- best_fit_plot +
annotate("point", x = x, y = y, col = "red", size = 3) +
annotate("point", x = x, y = y_hat, col = "red", shape = 15, size = 3) +
annotate("segment", x = x, xend = x, y = y, yend = y_hat, color = "blue",
arrow = arrow(type = "closed", length = unit(0.02, "npc")))
best_fit_plot
```
Let's do this for another arbitrarily chosen instructor whose beauty score was
$x=3.667$. The residual in this case is $4.4 - 4.125 = 0.2753$.
```{r, echo=FALSE}
index <- which(evals$bty_avg == 3.667 & evals$score == 4.4)
score_model <- lm(score ~ bty_avg, data = evals)
target_point <- get_regression_points(score_model) %>%
slice(index)
x <- target_point$bty_avg
y <- target_point$score
y_hat <- target_point$score_hat
resid <- target_point$residual