-
Notifications
You must be signed in to change notification settings - Fork 42
/
20-regression.qmd
834 lines (576 loc) · 35.8 KB
/
20-regression.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
# Regression {#sec-regression}
## Case study: is height hereditary?
```{r, message=FALSE, warning=FALSE, cache = FALSE}
library(tidyverse)
library(HistData)
set.seed(1983)
galton_heights <- GaltonFamilies |>
filter(gender == "male") |>
group_by(family) |>
sample_n(1) |>
ungroup() |>
select(father, childHeight) |>
rename(son = childHeight)
```
Suppose we were asked to summarize the father and son data. Since both distributions are well approximated by the normal distribution, we could use the two averages and two standard deviations as summaries:
```{r, message=FALSE, warning=FALSE}
galton_heights |>
summarize(mean(father), sd(father), mean(son), sd(son))
```
However, this summary fails to describe an important characteristic of the data: the trend that the taller the father, the taller the son.
```{r scatterplot, fig.height = 3, fig.width = 3, out.width="40%"}
galton_heights |> ggplot(aes(father, son)) +
geom_point(alpha = 0.5)
```
## The correlation coefficient {#sec-corr-coef}
The correlation coefficient is defined for a list of pairs $(x_1, y_1), \dots, (x_n,y_n)$ as the average of the product of the standardized values:
$$
\rho = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i-\mu_x}{\sigma_x} \right)\left( \frac{y_i-\mu_y}{\sigma_y} \right)
$$
```{r, eval=FALSE}
rho <- mean(scale(x) * scale(y))
```
Let's see why this makes sense
```{r}
#| echo: false
n <- 250
height <- rnorm(n, 69, 3)
weight <- 180 + 10*(height - mean(height)) + rnorm(n, 0, 20)
plot(height, weight)
abline(v = mean(height), h = mean(weight), lty = 1)
rect(0, 0, mean(height), mean(weight), col = rgb(0, 0, 1, alpha = 0.25))
rect(mean(height), mean(weight), 10^5, 10^5, col = rgb(0, 0, 1, alpha = 0.25))
rect(mean(height), 0, 10^5, mean(weight), col = rgb(1, 0, 0, alpha = 0.25))
rect(0, mean(weight), mean(height), 10^5, col = rgb(1, 0, 0, alpha = 0.25))
abline(v = mean(height) + seq(-3,3)*sd(height), h = mean(weight) + seq(-3,3)*sd(weight), lty = 2)
```
The correlation coefficient is always between -1 and 1. We can show this mathematically: consider that we can't have higher correlation than when we compare a list to itself (perfect correlation) and in this case the correlation is:
$$
\rho = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i-\mu_x}{\sigma_x} \right)^2 =
\frac{1}{\sigma_x^2} \frac{1}{n} \sum_{i=1}^n \left( x_i-\mu_x \right)^2 =
\frac{1}{\sigma_x^2} \sigma^2_x =
1
$$
A similar derivation, but with $x$ and its exact opposite, proves the correlation has to be bigger or equal to -1.
For other pairs, the correlation is in between -1 and 1. The correlation, computed with the function `cor`, between father and son's heights is about 0.5:
```{r}
galton_heights |> summarize(r = cor(father, son)) |> pull(r)
```
:::{callout-warning}
For reasons similar to those explained in Section @sec-population-sd for the standard deviation, `cor(x,y)` divides by `length(x)-1` rather than `length(x)`.
:::
To see what data looks like for different values of $\rho$, here are six examples of pairs with correlations ranging from -0.9 to 0.99:
```{r what-correlation-looks-like, echo=FALSE}
n <- 250
cors <- c(-0.9,-0.5,0,0.5,0.9,0.99)
sim_data <- lapply(cors,function(r) MASS::mvrnorm(n,c(0,0), matrix(c(1,r,r,1),2,2)))
sim_data <- Reduce(rbind, sim_data)
sim_data <- cbind( rep(cors, each = n), sim_data)
colnames(sim_data) <- c("r","x","y")
as.data.frame(sim_data) |>
ggplot(aes(x,y)) +
facet_wrap(~r) +
geom_point(alpha = 0.5) +
geom_vline(xintercept = 0,lty = 2) +
geom_hline(yintercept = 0,lty = 2)
```
### Sample correlation is a random variable
Let's consider our Galton heights to be the population and take random samples of size 25:
```{r}
r <- sample_n(galton_heights, 25, replace = TRUE) |>
summarize(r = cor(father, son)) |> pull(r)
```
`r`` is a random variable. We can run a Monte Carlo simulation to see its distribution:
```{r sample-correlation-distribution}
B <- 1000
N <- 25
r <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) |>
summarize(r = cor(father, son)) |>
pull(r)
})
hist(r, breaks = 20)
```
We see that the expected value of `r`` is the population correlation:
```{r}
mean(r)
```
and that it has a relatively high standard error relative to the range of values `R` can take:
```{r}
sd(r)
```
So, when interpreting correlations, remember that correlations derived from samples are estimates containing uncertainty.
Does CLT apply?
The `SE` can be shown to be
$$
\sqrt{\frac{1-r^2}{N-2}}
$$
In our example, $N=25$ does not seem to be large enough to make the approximation a good one:
```{r small-sample-correlation-not-normal, out.width="40%"}
ggplot(aes(sample = r), data = data.frame(r)) +
stat_qq() +
geom_abline(intercept = mean(r), slope = sqrt((1 - mean(r)^2)/(N - 2)))
```
If you increase $N$, you will see the distribution converging to normal.
```{r sample-correlation-distribution-2}
B <- 1000
N <- 50
r <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) |>
summarize(r = cor(father, son)) |>
pull(r)
})
ggplot(aes(sample = r), data = data.frame(r)) +
stat_qq() +
geom_abline(intercept = mean(r), slope = sqrt((1 - mean(r)^2)/(N - 2)))
```
### Correlation is not always a useful summary {#sec-ascombe}
Correlation is not always a good summary of the relationship between two variables. The following four artificial datasets, referred to as Anscombe's quartet, famously illustrate this point. All these pairs have a correlation of 0.82:
```{r ascombe-quartet, echo=FALSE}
anscombe |> mutate(row = seq_len(n())) |>
gather(name, value, -row) |>
separate(name, c("axis", "group"), sep = 1) |>
spread(axis, value) |> select(-row) |>
ggplot(aes(x,y)) +
facet_wrap(~group) +
geom_smooth(method = "lm", fill = NA, fullrange = TRUE) +
geom_point()
```
Here are some other fun examples:
```{r}
library(datasauRus)
ggplot(datasaurus_dozen, aes(x = x, y = y, color = dataset)) +
geom_point() +
facet_wrap(~dataset)
```
Correlation is only meaningful in a particular context. To help us understand when it is that correlation is meaningful as a summary statistic, we will return to the example of predicting a son's height using his father's height. This will help motivate and define linear regression. We start by demonstrating how correlation can be useful for prediction.
## Conditional expectations {#sec-conditional-expectation}
Suppose we are asked to guess the height of a randomly selected son and we don't know his father's height. How do we do this?
Mathematically, we can show that the conditional expectation
$$
f(x) = \mbox{E}(Y \mid X = x)
$$
minimizes the expected squared error (MSE):
$$
\mbox{E}[\{Y -f(X)\}^2]
$$
with $x$ representing the fixed value that defines that subset, for example 72 inches.
We denote the standard deviation of the strata with
$$
\mbox{SD}(Y \mid X = x) = \sqrt{\mbox{Var}(Y \mid X = x)}
$$
Because the conditional expectation $E(Y\mid X=x)$ is the best predictor for the random variable $Y$ for an individual in the strata defined by $X=x$, many data science challenges reduce to estimating this quantity. The conditional standard deviation quantifies the precision of the prediction.
However, we often have a limited number of points to estimate this conditional expectation. For example
```{r}
sum(galton_heights$father == 72)
```
fathers that are exactly 72-inches. If we change the number to 72.5, we get even fewer data points:
```{r}
sum(galton_heights$father == 72.5)
```
A practical way to improve these estimates of the conditional expectations, is to define strata of with similar values of $x$. In our example, we can round father heights to the nearest inch and assume that they are all 72 inches. If we do this, we end up with the following prediction for the son of a father that is 72 inches tall:
```{r}
conditional_avg <- galton_heights |>
filter(round(father) == 72) |>
summarize(avg = mean(son)) |>
pull(avg)
conditional_avg
```
Note that a 72-inch father is taller than average but smaller than 72. We see the same for someone slightly shorted than average:
```{r}
conditional_avg <- galton_heights |>
filter(round(father) == 67) |>
summarize(avg = mean(son)) |>
pull(avg)
conditional_avg
```
The sons of have **regressed** some to the average height. We notice that the reduction in how many SDs taller is about 0.5, which happens to be the correlation. As we will see in a later section, this is not a coincidence.
If we want to make a prediction of any height, not just 72, we could apply the same approach to each strata. Stratification followed by boxplots lets us see the distribution of each group:
```{r boxplot-1, fig.height = 3, fig.width = 3, out.width="40%"}
galton_heights |>
mutate(father_strata = factor(round(father))) |>
ggplot(aes(father_strata, son)) +
geom_boxplot() +
geom_point()
```
Not surprisingly, the centers of the groups are increasing with height. Furthermore, these centers appear to follow a linear relationship. Below we plot the averages of each group. If we take into account that these averages are random variables with standard errors, the data is consistent with these points following a straight line:
```{r conditional-averages-follow-line, echo=FALSE, fig.height = 3, fig.width = 3, out.width="40%"}
galton_heights |>
mutate(father = round(father)) |>
group_by(father) |>
summarize(son_conditional_avg = mean(son)) |>
ggplot(aes(father, son_conditional_avg)) +
geom_point()
```
The fact that these conditional averages follow a line is not a coincidence. In the next section, we explain that the line these averages follow is what we call the *regression line*, which improves the precision of our estimates. However, it is not always appropriate to estimate conditional expectations with the regression line so we also describe Galton's theoretical justification for using the regression line.
## The regression line
If we are predicting a random variable $Y$ knowing the value of another $X=x$ using a regression line, then we predict that for every standard deviation, $\sigma_X$, that $x$ increases above the average $\mu_X$, our prediction $\hat{Y}$ increase $\rho$ standard deviations $\sigma_Y$ above the average $\mu_Y$ with $\rho$ the correlation between $X$ and $Y$. The formula for the regression is therefore:
$$
\left( \frac{\hat{Y}-\mu_Y}{\sigma_Y} \right) = \rho \left( \frac{x-\mu_X}{\sigma_X} \right)
$$
We can rewrite it like this:
$$
\hat{Y} = \mu_Y + \rho \left( \frac{x-\mu_X}{\sigma_X} \right) \sigma_Y
$$
Note that if the correlation is positive and lower than 1, our prediction is closer, in standard units, to the average height than the value used to predict, $x$, is to the average of the $x$s. This is why we call it *regression*: the son regresses to the average height. In fact, the title of Galton's paper was: *Regression toward mediocrity in hereditary stature*. To add regression lines to plots, we will need the above formula in the form:
$$
\hat{Y} = b + mx \mbox{ with slope } m = \rho \frac{\sigma_y}{\sigma_x} \mbox{ and intercept } b=\mu_y - m \mu_x
$$
Here we add the regression line to the original data:
```{r regression-line, fig.height = 3, fig.width = 3, out.width="40%"}
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
galton_heights |>
ggplot(aes(father, son)) +
geom_point(alpha = 0.5) +
geom_abline(slope = r * s_y/s_x, intercept = mu_y - r * s_y/s_x * mu_x)
```
The regression formula implies that if we first standardize the variables, that is subtract the average and divide by the standard deviation, then the regression line has intercept 0 and slope equal to the correlation $\rho$. You can make same plot, but using standard units like this:
```{r regression-line-standard-units, fig.height = 3, fig.width = 3, out.width="40%"}
galton_heights |>
ggplot(aes(scale(father), scale(son))) +
geom_point(alpha = 0.5) +
geom_abline(intercept = 0, slope = r)
```
## Regression improves precision
Let's compare the two approaches to prediction that we have presented:
1. Round fathers' heights to closest inch, stratify, and then take the average.
2. Compute the regression line and use it to predict.
We use a Monte Carlo simulation sampling $N=50$ families:
```{r}
B <- 1000
N <- 50
set.seed(1983)
conditional_avg <- replicate(B, {
dat <- sample_n(galton_heights, N)
dat |> filter(round(father) == 72) |>
summarize(avg = mean(son)) |>
pull(avg)
})
regression_prediction <- replicate(B, {
dat <- sample_n(galton_heights, N)
mu_x <- mean(dat$father)
mu_y <- mean(dat$son)
s_x <- sd(dat$father)
s_y <- sd(dat$son)
r <- cor(dat$father, dat$son)
mu_y + r*(72 - mu_x)/s_x*s_y
})
```
Although the expected value of these two random variables is about the same:
```{r}
mean(conditional_avg, na.rm = TRUE)
mean(regression_prediction)
```
The standard error for the regression prediction is substantially smaller:
```{r}
sd(conditional_avg, na.rm = TRUE)
sd(regression_prediction)
```
The regression line is therefore much more stable than the conditional mean. There is an intuitive reason for this. The conditional average is computed on a relatively small subset: the fathers that are about 72 inches tall. In fact, in some of the permutations we have no data, which is why we use `na.rm=TRUE`. The regression always uses all the data.
## Bivariate normal distribution
Correlation and the regression slope are a widely used summary statistic, but they are often misused or misinterpreted. Anscombe's examples provide over-simplified cases of dataset in which summarizing with correlation would be a mistake. But there are many more real-life examples.
The main way we motivate the use of correlation involves what is called the *bivariate normal distribution*.
When a pair of random variables is approximated by the bivariate normal distribution, scatterplots look like ovals. As we saw in Section @sec-corr-coef), they can be thin (high correlation) or circle-shaped (no correlation.
A more technical way to define the bivariate normal distribution is the following: if $X$ is a normally distributed random variable, $Y$ is also a normally distributed random variable, and the conditional distribution of $Y$ for any $X=x$ is approximately normal, then the pair is approximately bivariate normal. When three or more variables have the property that each pair is bivariate normal, we say the variables follow a *multivariate* normal distribution or that they are *jointly normal*.
If we think the height data is well approximated by the bivariate normal distribution, then we should see the normal approximation hold for each strata. Here we stratify the son heights by the standardized father heights and see that the assumption appears to hold:
```{r qqnorm-of-strata}
galton_heights |>
mutate(z_father = round((father - mean(father)) / sd(father))) |>
filter(z_father %in% -2:2) |>
ggplot() +
stat_qq(aes(sample = son)) +
facet_wrap( ~ z_father)
```
Now we come back to defining correlation. Galton used mathematical statistics to demonstrate that, when two variables follow a bivariate normal distribution, computing the regression line is equivalent to computing conditional expectations. We don't show the derivation here, but we can show that under this assumption, for any given value of $x$, the expected value of the $Y$ in pairs for which $X=x$ is:
$$
\mbox{E}(Y | X=x) = \mu_Y + \rho \frac{x-\mu_X}{\sigma_X}\sigma_Y
$$
This is the regression line, with slope $$\rho \frac{\sigma_Y}{\sigma_X}$$ and intercept $\mu_y - m\mu_X$. It is equivalent to the regression equation we showed earlier which can be written like this:
$$
\frac{\mbox{E}(Y \mid X=x) - \mu_Y}{\sigma_Y} = \rho \frac{x-\mu_X}{\sigma_X}
$$
This implies that, if our data is approximately bivariate, the regression line gives the conditional probability. Therefore, we can obtain a much more stable estimate of the conditional expectation by finding the regression line and using it to predict.
In summary, if our data is approximately bivariate, then the conditional expectation, the best prediction of $Y$ given we know the value of $X$, is given by the regression line.
## Variance explained
The bivariate normal theory also tells us that the standard deviation of the *conditional* distribution described above is:
$$
\mbox{SD}(Y \mid X=x ) = \sigma_Y \sqrt{1-\rho^2}
$$
To see why this is intuitive, notice that without conditioning, $\mbox{SD}(Y) = \sigma_Y$, we are looking at the variability of all the sons. But once we condition, we are only looking at the variability of the sons with a tall, 72-inch, father. This group will all tend to be somewhat tall so the standard deviation is reduced.
Specifically, it is reduced to $\sqrt{1-\rho^2} = \sqrt{1 - 0.25}$ = 0.87 of what it was originally. We could say that father heights "explain" 13% of the variability observed in son heights.
The statement "$X$ explains such and such percent of the variability" is commonly used in academic papers. In this case, this percent actually refers to the variance (the SD squared). So if the data is bivariate normal, the variance is reduced by $1-\rho^2$, so we say that $X$ explains $1- (1-\rho^2)=\rho^2$ (the correlation squared) of the variance.
But it is important to remember that the "variance explained" statement only makes sense when the data is approximated by a bivariate normal distribution.
## There are two regression lines
We computed a regression line to predict the son's height from father's height. We used these calculations:
```{r}
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m_1 <- r * s_y / s_x
b_1 <- mu_y - m_1*mu_x
```
which gives us the function $\mbox{E}(Y\mid X=x) =$ `r round(b_1, 1)` + `r round(m_1, 2)` $x$.
What if we want to predict the father's height based on the son's? It is important to know that this is not determined by computing the inverse function: $x = \{ \mbox{E}(Y\mid X=x) -$ `r round(b_1, 1)` $\} /$ `r round(m_1, 1)`.
We need to compute $\mbox{E}(X \mid Y=y)$. Since the data is approximately bivariate normal, the theory described above tells us that this conditional expectation will follow a line with slope and intercept:
```{r}
m_2 <- r * s_x / s_y
b_2 <- mu_x - m_2 * mu_y
```
So we get $\mbox{E}(X \mid Y=y) =$ `r round(b_2, 1)` + `r round(m_2, 2)`y. Again we see regression to the average: the prediction for the father is closer to the father average than the son heights $y$ is to the son average.
Here is a plot showing the two regression lines, with blue for the predicting son heights with father heights and red for predicting father heights with son heights:
```{r two-regression-lines, fig.height = 3, fig.width = 3, out.width="40%"}
galton_heights |>
ggplot(aes(father, son)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = b_1, slope = m_1, col = "blue") +
geom_abline(intercept = -b_2/m_2, slope = 1/m_2, col = "red")
```
## Linear models
* Let's learn about the connection between regression and _linear models_.
* If data is bivariate normal then the conditional expectations follow the regression line. This resul is derived, not assumed
* In practice it is common to explicitly write down a model that describes the relationship between two or more variables using a *linear model*.
* _linear_ here does not refer to lines exclusively, but rather to the fact that the conditional expectation is a linear combination of known quantities. For example,
$$2 + 3x - 4y + 5z$$
is a linear combination of $x$, $y$, and $z$.
* If we write
$$
Y = \beta_0 + \beta_1 x + \varepsilon
$$
then if we assume $\varepsilon$ follows a normal distribution with expected value 0 and fixed standard deviation, then $Y$ has the same properties as the regression setup gave us.
:::{.callout-note}
In statistical textbooks, the $\varepsilon$s are referred to as "errors," which originally represented measurement errors in the initial applications of these models. These errors were associated with inaccuracies in measuring height, weight, or distance. However, the term "error" is now used more broadly, even when the $\varepsilon$s do not necessarily signify an actual error. For instance, in the case of height, if someone is 2 inches taller than expected based on their parents' height, those 2 inches should not be considered an error. Despite its lack of descriptive accuracy, the term "error" is employed to elucidate the unexplained variability in the model, unrelated to other included terms.
:::
* Linear model for Galton's data, we would denote the $N$ observed father heights with $x_1, \dots, x_n$, then we model the $N$ son heights we are trying to predict with:
$$
Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \, i=1,\dots,N.
$$
*We can further assume that $\varepsilon_i$ are independent from each other and all have the same standard deviation.
* To have a useful model for prediction, we need $\beta_0$ and $\beta_1$. We estimate these from the data.
* In practice, linear models are just assumed without necessarily assuming normality: the distribution of the $\varepsilon$s is not necessarily specified. Nevertheless, if your data is bivariate normal, the above linear model holds. If your data is not bivariate normal, then you will need to have other ways of justifying the model.
* One reason linear models are popular is that they are _interpretable_.
* In the case of Galton's data, we can interpret the data like this: due to inherited genes, the son's height prediction grows by $\beta_1$ for each inch we increase the father's height $x$.
* Because not all sons with fathers of height $x$ are of equal height, we need the term $\varepsilon$, which explains the remaining variability.
* This remaining variability includes the mother's genetic effect, environmental factors, and other biological randomness.
* Given how we wrote the model above, the intercept $\beta_0$ is not very interpretable as it is the predicted height of a son with a father with no height. Due to regression to the mean, the prediction will usually be a bit larger than 0.
* To make the slope parameter more interpretable, we can rewrite the model slightly as:
$$
Y_i = \beta_0 + \beta_1 (x_i - \bar{x}) + \varepsilon_i, \, i=1,\dots,N
$$
* In this case $\beta_0$ represents the height when $x_i = \bar{x}$, which is the height of the son of an average father.
## Least Squares Estimates {#sec-lse}
The standard approach to estimate the $\beta$s is to find the values that minimize the distance of the fitted model to the data:
$$
RSS = \sum_{i=1}^n \left\{ y_i - \left(\beta_0 + \beta_1 x_i \right)\right\}^2
$$
* This quantity is called the residual sum of squares (RSS).
* Once we find the values that minimize the RSS, we will call the values the least squares estimates (LSE) and denote them with $\hat{\beta}_0$ and $\hat{\beta}_1$.
* Let's demonstrate this with the previously defined dataset:
```{r, message=FALSE}
library(HistData)
set.seed(1983)
galton_heights <- GaltonFamilies |>
filter(gender == "male") |>
group_by(family) |>
sample_n(1) |>
ungroup() |>
select(father, childHeight) |>
rename(son = childHeight)
```
Let's write a function that computes the RSS for any pair of values $\beta_0$ and $\beta_1$.
```{r}
rss <- function(beta0, beta1, data){
resid <- galton_heights$son - (beta0 + beta1*galton_heights$father)
return(sum(resid^2))
}
```
So for any pair of values, we get an RSS. Here is a plot of the RSS as a function of $\beta_1$ when we keep the $\beta_0$ fixed at 25.
```{r rss-versus-estimate}
beta1 = seq(0, 1, length = nrow(galton_heights))
results <- data.frame(beta1 = beta1,
rss = sapply(beta1, rss, beta0 = 25))
results |> ggplot(aes(beta1, rss)) + geom_line() +
geom_line(aes(beta1, rss))
```
We can see a clear minimum for $\beta_1$ at around 0.65. However, this minimum for $\beta_1$ is for when $\beta_0 = 25$, a value we arbitrarily picked. We don't know if (25, 0.65) is the pair that minimizes the equation across all possible pairs. We don't need trial and error because we can use calculus.
## The `lm` function
In R, we can obtain the least squares estimates using the `lm` function. To fit the model:
$$
Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i
$$
with $Y_i$ the son's height and $x_i$ the father's height, we can use this code to obtain the least squares estimates.
```{r}
fit <- lm(son ~ father, data = galton_heights)
fit$coef
```
* The most common way we use `lm` is by using the character `~` to let `lm` know which is the variable we are predicting (left of `~`) and which we are using to predict (right of `~`).
* The intercept is added automatically to the model that will be fit. To fit a model with no intercept you need to use `-1` in the rightside of the formula.
* The object `fit` includes more information about the fit. We can use the function `summary` to extract more of this information (not shown):
```{r}
summary(fit)
```
* To understand some of the information included in this summary we need to remember that the LSE are random variables. Mathematical statistics gives us some ideas of the distribution of these random variables.
## LSE are random variables
The LSE is derived from the data $y_1,\dots,y_N$, which are a realization of random variables $Y_1, \dots, Y_N$. This implies that our estimates are random variables. Here is a Monte Carlo simulation demonstrating it:
```{r}
B <- 1000
N <- 50
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) |>
lm(son ~ father, data = _) |>
coef()
})
lse <- data.frame(beta_0 = lse[1,], beta_1 = lse[2,])
```
We can see the variability of the estimates by plotting their distributions:
```{r lse-distributions, out.width="100%", fig.width=6, fig.height=3, echo=FALSE}
library(gridExtra)
p1 <- lse |> ggplot(aes(beta_0)) +
geom_histogram(binwidth = 5, color = "black")
p2 <- lse |> ggplot(aes(beta_1)) +
geom_histogram(binwidth = 0.1, color = "black")
grid.arrange(p1, p2, ncol = 2)
```
* The reason these look normal is because the central limit theorem applies here as well.
* The standard errors are a bit complicated to compute, but mathematical theory does allow us to compute them and they are included in the summary provided by the `lm` function.
* Here it is for one of our simulated data sets:
```{r}
sample_n(galton_heights, N, replace = TRUE) |>
lm(son ~ father, data = _) |>
summary() |>
coef()
```
/8 You can see that the standard errors estimates reported by the `summary` are close to the standard errors from the simulation:
```{r}
lse |> summarize(se_0 = sd(beta_0), se_1 = sd(beta_1))
```
The `summary` function also reports t-statistics (`t value`) and p-values (`Pr(>|t|)`). The t-statistic is not actually based on the central limit theorem but rather on the assumption that the $\varepsilon$s follow a normal distribution. Under this assumption, mathematical theory tells us that the LSE divided by their standard error, $\hat{\beta}_0 / \hat{\mbox{SE}}(\hat{\beta}_0 )$ and $\hat{\beta}_1 / \hat{\mbox{SE}}(\hat{\beta}_1 )$, follow a t-distribution with $N-p$ degrees of freedom, with $p$ the number of parameters in our model. In the case of height $p=2$, the two p-values are testing the null hypothesis that $\beta_0 = 0$ and $\beta_1=0$, respectively.
:::{.callout-warning}
Although we do not show examples in this lecture, hypothesis testing with regression models is commonly used in epidemiology and economics to make statements such as "the effect of A on B was statistically significant after adjusting for X, Y, and Z". However, several assumptions have to hold for these statements to be true.
:::
## Predicted values are random variables
Once we fit our model, we can obtain prediction of $Y$ by plugging in the estimates into the regression model. For example, if the father's height is $x$, then our prediction $\hat{Y}$ for the son's height will be:
$$\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 x$$
When we plot $\hat{Y}$ versus $x$, we see the regression line.
The **ggplot2** layer `geom_smooth(method = "lm")` that we previously used plots $\hat{Y}$ and surrounds it by confidence intervals:
```{r father-son-regression}
galton_heights |> ggplot(aes(son, father)) +
geom_point() +
geom_smooth(method = "lm")
```
* The R function `predict` takes an `lm` object as input and returns the prediction. If requested, the standard errors and other information from which we can construct confidence intervals is provided:
```{r father-son-predictor}
fit <- galton_heights |> lm(son ~ father, data = _)
y_hat <- predict(fit, se.fit = TRUE)
names(y_hat)
```
## Diagnostic plots
When the linear model is assumed rather than derived, all interpretations depend on the usefulness of the model. The `lm` function will fit the model and return summaries even when the model is wrong and unuseful.
* Visually inspecting residuals, defined as the difference between observed values and predicted values
$$
r = Y - \hat{Y} = Y - \left(\hat{\beta}_0 - \hat{\beta}_1 x_i\right),
$$
and summaries of the residuals, is a powerful way to diagnose if the model is useful.
* Note that the residuals can be thought of estimates of the errors since
$$
\varepsilon = Y - \left(\beta_0 + \beta_1 x_i \right).
$$
In fact, residuals are often denoted as $\hat{\varepsilon}$. This motivates several _diagnostic_ plots. Becasue we obervere, $r$ but don't observe $\varepsilon$, we based the plots on the residuals.
1. Because the errors are assumed not to depend on the expected value of $Y$, a plot of $r$ versus the fitted values $\hat{Y}$ should show no relationship.
2. In cases in which we assume the errors follow a normal distribtuion a qqplot of standardized $r$ should fall on a line when plotted against theoretical quantiles.
3. Because we assume the standard deviation of the errors is constant, if we plot the absolute value of the residuals, it should appear constant.
We prefer plots rather than summaries based on, for example, correlation because, as noted in Section @ascombe, correlation is not always the best summary of association. The function `plot` applied to an `lm` object automatically plots these.
```{r diagnotics}
#| eval: false
plot(fit, which = 1:3)
```
This function can produce six different plots, and the argument `which` let's you specify which you want to see. You can learn more by reading the `plot.lm` help file.
## The regression fallacy
Wikipedia defines the *sophomore slump* as:
> A sophomore slump or sophomore jinx or sophomore jitters refers to an instance in which a second, or sophomore, effort fails to live up to the standards of the first effort. It is commonly used to refer to the apathy of students (second year of high school, college or university), the performance of athletes (second season of play), singers/bands (second album), television shows (second seasons) and movies (sequels/prequels).
In Major League Baseball, the rookie of the year (ROY) award is given to the first-year player who is judged to have performed the best. The *sophmore slump* phrase is used to describe the observation that ROY award winners don't do as well during their second year. For example, [this Fox Sports article](https://web.archive.org/web20160815063904/http://www.foxsports.com/mlb/story/kris-bryant-carlos-correa-rookies-of-year-award-matt-duffy-francisco-lindor-kang-sano-120715) asks "Will MLB's tremendous rookie class of 2015 suffer a sophomore slump?".
Does the data confirm the existence of a sophomore slump? Let's take a look. Examining the data for widely used measure of success, the batting average, we see that this observation holds true for the top performing ROYs:
```{r, echo=FALSE, cache=FALSE}
#The data is available in the Lahman library, but we have to do some work to create a table with the statistics for all the ROY. First we create a table with player ID, their names, and their most played position.
library(Lahman)
playerInfo <- Fielding |>
group_by(playerID) |>
arrange(desc(G)) |>
slice(1) |>
ungroup() |>
left_join(People, by = "playerID") |>
select(playerID, nameFirst, nameLast, POS)
ROY <- AwardsPlayers |>
filter(awardID == "Rookie of the Year") |>
left_join(playerInfo, by = "playerID") |>
rename(rookie_year = yearID) |>
right_join(Batting, by = "playerID") |>
mutate(AVG = H/AB) |>
filter(POS != "P")
##We also will keep only the rookie and sophomore seasons and remove players that did not play sophomore seasons:
ROY <- ROY |>
filter(yearID == rookie_year | yearID == rookie_year + 1) |>
group_by(playerID) |>
mutate(rookie = ifelse(yearID == min(yearID), "rookie", "sophomore")) |>
filter(n() == 2) |>
ungroup() |>
select(playerID, rookie_year, rookie, nameFirst, nameLast, AVG)
## Finally, we will use the `spread` function to have one column for the rookie and sophomore years batting averages:
ROY <- ROY |> spread(rookie, AVG) |> arrange(desc(rookie))
tmp <- ROY |> slice(1:5) |>
select(nameFirst, nameLast, rookie_year, rookie, sophomore)
if(knitr::is_html_output()){
knitr::kable(tmp, "html") |>
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = FALSE)
} else{
knitr::kable(tmp, "latex", booktabs = TRUE) |>
kableExtra::kable_styling(font_size = 8)
}
```
In fact, the proportion of players that have a lower batting average their sophomore year is `r mean(ROY$sophomore - ROY$rookie <= 0)`.
So is it "jitters" or "jinx"? To answer this question, let's turn our attention to all players that played the 2013 and 2014 seasons and batted more than 130 times (minimum to win Rookie of the Year).
```{r, echo=FALSE}
# We perform similar operations to what we did above
two_years <- Batting |>
filter(yearID %in% 2013:2014) |>
group_by(playerID, yearID) |>
filter(sum(AB) >= 130) |>
summarize(AVG = sum(H)/sum(AB)) |>
ungroup() |>
spread(yearID, AVG) |>
filter(!is.na(`2013`) & !is.na(`2014`)) |>
left_join(playerInfo, by = "playerID") |>
filter(POS != "P") |>
select(-POS) |>
arrange(desc(`2013`)) |>
select(nameFirst, nameLast, `2013`, `2014`)
```
The same pattern arises when we look at the top performers: batting averages go down for most of the top performers.
```{r, echo=FALSE}
tmp <- two_years |> slice(1:5)
knitr::kable(tmp, "html") |>
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)
```
But these are not rookies! Also, look at what happens to the worst performers of 2013:
```{r, echo=FALSE}
tmp <- arrange(two_years, `2013`) |> slice(1:5)
knitr::kable(tmp, "html") |>
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)
```
Their batting averages mostly go up! Is this some sort of reverse sophomore slump? It is not. There is no such thing as the sophomore slump. This is all explained with a simple statistical fact: the correlation for performance in two separate years is high, but not perfect:
```{r regression-fallacy, echo=FALSE, fig.height=3, fig.width=3, out.width="40%"}
two_years |> ggplot(aes(`2013`, `2014`)) + geom_point()
```
The correlation is `r cor(two_years$"2013",two_years$"2014")` and the data look very much like a bivariate normal distribution, which means we predict a 2014 batting average $Y$ for any given player that had a 2013 batting average $X$ with:
$$ \frac{Y - .255}{.032} = 0.46 \left( \frac{X - .261}{.023}\right) $$
Because the correlation is not perfect, regression tells us that, on average, expect high performers from 2013 to do a bit worse in 2014. It's not a jinx; it's just due to chance. The ROY are selected from the top values of $X$ so it is expected that $Y$ will regress to the mean.
## Exercises
1\. Load the `GaltonFamilies` data from the **HistData**. The children in each family are listed by gender and then by height. Create a dataset called `galton_heights` by picking a male and female at random.
2\. Make a scatterplot for heights between mothers and daughters, mothers and sons, fathers and daughters, and fathers and sons.
3\. Compute the correlation in heights between mothers and daughters, mothers and sons, fathers and daughters, and fathers and sons.