-
Notifications
You must be signed in to change notification settings - Fork 42
/
18-inference.qmd
474 lines (315 loc) · 17.8 KB
/
18-inference.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
# Inference
The day before the 2008 presidential election, Nate Silver's FiveThirtyEight stated that
>> Barack Obama appears poised for a decisive electoral victory.
They went further and predicted that Obama would win the election with 349 electoral votes to 189, and the popular vote by a margin of 6.1%. FiveThirtyEight also attached a probabilistic statement to their prediction claiming that Obama had a 91% chance of winning the election.
Political commentator Joe Scarborough said during [his show](https://www.youtube.com/watch?v=TbKkjm-gheY)
> > Anybody that thinks that this race is anything but a toss-up right now is such an ideologue ... they're jokes.
To which Nate Silver responded via Twitter:
> > If you think it's a toss-up, let's bet. If Obama wins, you donate \$1,000 to the American Red Cross. If Romney wins, I do. Deal?
In 2016, Silver was not as certain and gave Hillary Clinton only a 71% of winning. In contrast, many other forecasters were almost certain she would win. She lost. But 71% is still more than 50%, so was Mr. Silver wrong? And what does probability mean in this context anyway? Are dice being tossed or cards being dealt somewhere?
## Parameters and Estimates
[Real Clear Politics](http://www.realclearpolitics.com)
is an example of a news aggregator that organizes and publishes poll results. For example, they present the following poll results reporting estimates of the popular vote for the [2016 presidential election](http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html)
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
url <- "https://web.archive.org/web/20161108012231/https://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html"
library(rvest)
tab <- read_html(url) |> html_node("table") |> html_table()
tab <- tab |> mutate(Poll = stringr::str_remove(Poll, "\\/.*")) |>
mutate(Poll = case_when(
Poll == "BloombergBloomberg" ~ "Bloomberg",
Poll == "FOX NewsFOX News" ~ "FOX News",
Poll == "MonmouthMonmouth" ~ "Monmouth",
Poll == "CBS NewsCBS News" ~ "CBS News",
TRUE ~ Poll))
names(tab) <- stringr::str_remove_all(names(tab), "\\s(.*)")
knitr::kable(tab, "html") |>
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)
```
Let's make some observations about the table above.
* Different polls, all taken days before the election, report a different *spread*
* Clinton won the popular vote by 2.1%.
* We also see a column titled **MoE** which stands for *margin of error*.
## Predict percents competition
What percent of the beads are red?
![](http://rafalab.dfci.harvard.edu/dsbook-part-2/inference/img/urn.jpg){width=40%}
Rules:
* Winners gets a $25 gift certificate
* Provide an estimate and an interval.
* If the true percent is not in the interval you are eliminated
* Smallest interval wins.
* You can take a sample (with replacement) from the urn.
* It costs you \$0.10 per each bead you sample. Example: if your sample size is 250, and you win, you will break even since you will pay \$25 to collect your \$25 prize.
The **dslabs** package includes a function that shows a random draw from this urn:
```{r first-simulated-poll, message=FALSE, warning=FALSE, echo=FALSE, cache=FALSE}
set.seed(1)
library(tidyverse)
library(dslabs)
take_poll(25)
```
## Populations, samples, parameters, and estimates
We want to predict the proportion of red beads in the urn. Let's call this quantity $p$, which then tells us the proportion of blue beads $1-p$, and the spread $p - (1-p)$, which simplifies to $2p - 1$.
In statistical textbooks:
* the beads in the urn are called the *population*.
* The proportion of red beads in the population $p$ is called a *parameter*.
* The 25 beads we see in the previous plot are called a *sample*.
The task of statistical inference is to predict the parameter $p$ using the observed data in the sample.
## The sample average
$$\bar{X} = \frac{1}{N} \sum_{i=1}^N X_i$$
Has some desirable properties:
$$
\mbox{E}(\bar{X}) = p
$$
$$
\mbox{SE}(\bar{X}) = \sqrt{p(1-p)/N}
$$
This result reveals the power of polls. The expected value of the sample proportion $\bar{X}$ is the parameter of interest $p$ and we can make the standard error as small as we want by increasing $N$. The law of large numbers tells us that with a large enough poll, our estimate converges to $p$.
Here is the standard error for $p=.25, .5,$ and $0.75$:
```{r standard-error-versus-sample-size, echo=FALSE, message=FALSE, warning=FALSE}
N <- 10^seq(1, 5, len = 100)
map_df(c(0.1, 0.25, 0.5), function(p) data.frame(N = N, SE = sqrt(p*(1 - p)/N), p = as.character(p))) |>
ggplot(aes(N, SE, color = p)) +
geom_line() +
scale_x_continuous(breaks = c(10, 100, 1000, 10000), trans = "log10", labels = scales::comma) +
geom_hline(yintercept = 0.01, lty = 2)
```
From the plot we see that we would need a very large polls to get the standarr error below 1%.
But there is one more very useful property based on CLT:
$$
\bar{X} \sim \mbox{Normal}\left(p, \sqrt{p(1-p)/N}\right)
$$
## CLT in practice
Now we can ask a more practical questions such as what is the probability that our estimate is within 1% of the actual $p$. We can write it like this and actually use CLT to answer the question:
$$
\mbox{Pr}(| \bar{X} - p| \leq .01)
$$
which is the same as:
$$
\mbox{Pr}(\bar{X} - p\leq .01) - \mbox{Pr}(\bar{X} - p \leq - .01)
$$
we standardize $\bar{X}$ to get a approximately standard normal $Z$:
$$
\mbox{Pr}\left(Z \leq \frac{ \,.01} {\mbox{SE}(\bar{X})} \right) -
\mbox{Pr}\left(Z \leq - \frac{ \,.01} {\mbox{SE}(\bar{X})} \right)
$$
We are almost ready to get a number except since we don't know $p$, we don't know $\mbox{SE}(\bar{X})$.
But it turns out that the CLT still works if we estimate the standard error by using $\bar{X}$ in place of $p$. We say that we *plug-in* the estimate. Our estimate of the standard error is therefore:
$$
\hat{\mbox{SE}}(\bar{X})=\sqrt{\bar{X}(1-\bar{X})/N}
$$
Now we continue with our calculation, but dividing by $\hat{\mbox{SE}}(\bar{X})=\sqrt{\bar{X}(1-\bar{X})/N})$ instead. In our first sample we had 12 blue and 13 red so $\bar{X} = 0.52$ and our estimate of standard error is:
```{r}
x_hat <- 0.52
se <- sqrt(x_hat*(1 - x_hat)/25)
se
```
And now we can answer the question of the probability of being close to $p$. The answer is:
```{r}
pnorm(0.01/se) - pnorm(-0.01/se)
```
Earlier we mentioned the *margin of error*. Now we can define it because it is simply two times the standard error, which we can now estimate. In our case it is:
```{r}
1.96*se
```
Why do we multiply by $1.96 \hat{\mbox{SE}}(\bar{X})$? Because if you ask what is the probability that we are within 1.96 standard errors from $p$, we get:
$$
\mbox{Pr}\left(Z \leq \, 1.96\,\hat{\mbox{SE}}(\bar{X}) / \hat{\mbox{SE}}(\bar{X}) \right) -
\mbox{Pr}\left(Z \leq - 1.96\, \hat{\mbox{SE}}(\bar{X}) / \hat{\mbox{SE}}(\bar{X}) \right)
$$
which is:
$$
\mbox{Pr}\left(Z \leq 1.96 \right) -
\mbox{Pr}\left(Z \leq - 1.96\right)
$$
which we know is about 95%:
```{r}
pnorm(1.96) - pnorm(-1.96)
```
Hence, there is a 95% probability that $\bar{X}$ will be within $1.96\times \hat{SE}(\bar{X})$, in our case within about `r round(1.96*se, 2)`, of $p$. Note that 95% is somewhat of an arbitrary choice and sometimes other percentages are used, but it is the most commonly used value to define margin of error. We often round 1.96 up to 2 for simplicity of presentation.
In summary, the CLT tells us that our poll based on a sample size of $25$ is not very useful. We don't really learn much when the margin of error is this large. All we can really say is that the popular vote will not be won by a large margin. This is why pollsters tend to use larger sample sizes.
### A Monte Carlo simulation
Suppose we want to use a Monte Carlo simulation to corroborate the tools we have built using probability theory. To create the simulation, we would write code like this:
```{r, eval=FALSE}
B <- 10000
N <- 1000
p <- 0.45
x_hat <- replicate(B, {
x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
mean(x)
})
```
The problem is, of course, we don't know `p`. We could construct an urn like the one pictured above and run an analog (without a computer) simulation. It would take a long time, but you could take 10,000 samples, count the beads and keep track of the proportions of blue. We can use the function `take_poll(n=1000)` instead of drawing from an actual urn, but it would still take time to count the beads and enter the results.
One thing we therefore do to corroborate theoretical results is to pick one or several values of `p` and run the simulations. Let's set `p=0.45`. We can then simulate a poll:
```{r}
p <- 0.45
N <- 1000
x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p))
x_hat <- mean(x)
```
In this particular sample, our estimate is `x_hat`. We can use that code to do a Monte Carlo simulation:
```{r}
B <- 10000
x_hat <- replicate(B, {
x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p))
mean(x)
})
```
To review, the theory tells us that $\bar{X}$ is approximately normally distributed, has expected value $p=$ `r p` and standard error $\sqrt{p(1-p)/N}$ = `r sqrt(p*(1-p)/N)`. The simulation confirms this:
```{r}
mean(x_hat)
mean((x_hat-mean(x_hat))^2)
```
A histogram and qq-plot confirm that the normal approximation is accurate as well:
```{r normal-approximation-for-polls, echo=FALSE, warning=FALSE, message=FALSE, out.width="100%", fig.height=3, cache=FALSE}
library(tidyverse)
library(gridExtra)
p1 <- data.frame(x_hat=x_hat) |>
ggplot(aes(x_hat)) +
geom_histogram(binwidth = 0.005, color="black")
p2 <- data.frame(x_hat=x_hat) |>
ggplot(aes(sample=x_hat)) +
stat_qq(dparams = list(mean=mean(x_hat), sd=sd(x_hat))) +
geom_abline() +
ylab("x_hat") +
xlab("Theoretical normal")
grid.arrange(p1,p2, nrow=1)
```
Of course, in real life we would never be able to run such an experiment because we don't know $p$. But we could run it for various values of $p$ and $N$ and see that the theory does indeed work well for most values. You can easily do this by re-running the code above after changing `p` and `N`.
## Confidence intervals
We want to know the probability that the interval
$$
[\bar{X} - 1.96\hat{\mbox{SE}}(\bar{X}), \bar{X} + 1.96\hat{\mbox{SE}}(\bar{X})]
$$
contains the true proportion $p$. First, consider that the start and end of these intervals are random variables: every time we take a sample, they change. To illustrate this, run the Monte Carlo simulation above twice. We use the same parameters as above:
```{r}
p <- 0.45
N <- 1000
```
And notice that the interval here:
```{r}
x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p))
x_hat <- mean(x)
se_hat <- sqrt(x_hat * (1 - x_hat) / N)
x_hat + c(-1, 1)*se_hat * 1.96
```
is different from this one:
```{r}
x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p))
x_hat <- mean(x)
se_hat <- sqrt(x_hat * (1 - x_hat) / N)
c(x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat)
```
Keep sampling and creating intervals and you will see the random variation.
To determine the probability that the interval includes $p$, we need to compute this:
$$
\mbox{Pr}\left(\bar{X} - 1.96\hat{\mbox{SE}}(\bar{X}) \leq p \leq \bar{X} + 1.96\hat{\mbox{SE}}(\bar{X})\right)
$$
By subtracting and dividing the same quantities in all parts of the equation, we get that the above is equivalent to:
$$
\mbox{Pr}\left(-1.96 \leq \frac{\bar{X}- p}{\hat{\mbox{SE}}(\bar{X})} \leq 1.96\right)
$$
The term in the middle is an approximately normal random variable with expected value 0 and standard error 1, which we have been denoting with $Z$, so we have:
$$
\mbox{Pr}\left(-1.96 \leq Z \leq 1.96\right)
$$
which we can quickly compute using :
```{r}
pnorm(1.96) - pnorm(-1.96)
```
proving that we have a 95% probability.
If we want to have a larger probability, say 99%, we need to multiply by whatever `z` satisfies the following:
$$
\mbox{Pr}\left(-z \leq Z \leq z\right) = 0.99
$$
Using:
```{r}
z <- qnorm(0.995)
z
```
will achieve this because by definition `pnorm(qnorm(0.995))` is 0.995 and by symmetry `pnorm(1-qnorm(0.995))` is 1 - 0.995. As a consequence, we have that:
```{r}
pnorm(z) - pnorm(-z)
```
is `0.995 - 0.005 = 0.99`.
We can use this approach for any probability, not just 0.95 and 0.99. In statistics textbooks, these are usually written for any probability as $1-\alpha$. We can then obtain the $z$ for the equation above noting using `z = qnorm(1 - alpha / 2)` because $1 - \alpha/2 - \alpha/2 = 1 - \alpha$.
So, for example, for $\alpha=0.05$, $1 - \alpha/2 = 0.975$ and we get the 1.96 we have been using:
```{r}
qnorm(0.975)
```
### A Monte Carlo simulation
We can run a Monte Carlo simulation to confirm that, in fact, a 95% confidence interval includes $p$ 95% of the time.
```{r, echo=FALSE}
set.seed(1)
```
```{r}
N <- 1000
B <- 10000
inside <- replicate(B, {
x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p))
x_hat <- mean(x)
se_hat <- sqrt(x_hat * (1 - x_hat) / N)
between(p, x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat)
})
mean(inside)
```
The following plot shows the first 100 confidence intervals. In this case, we created the simulation so the black line denotes the parameter we are trying to estimate:
```{r confidence-interval-coverage, message=FALSE, echo=FALSE, fig.height=6}
set.seed(1)
tab <- replicate(100, {
x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
x_hat <- mean(x)
se_hat <- sqrt(x_hat * (1 - x_hat) / N)
hit <- between(p, x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat)
c(x_hat, x_hat - 1.96 * se_hat, x_hat + 2 * se_hat, hit)
})
tab <- data.frame(poll=1:ncol(tab), t(tab))
names(tab)<-c("poll", "estimate", "low", "high", "hit")
tab <- mutate(tab, p_inside = ifelse(hit, "Yes", "No") )
ggplot(tab, aes(poll, estimate, ymin=low, ymax=high, col = p_inside)) +
geom_point()+
geom_errorbar() +
coord_flip() +
geom_hline(yintercept = p)
```
## Exercises
(@) Write a line of code that gives you the standard error `se` for the proportion of red beads for several values of $p$, specifically for `p <- seq(0, 1, length = 100)`. Make a plot of `se` versus `p`.
(@) If we are interested in the difference in proportions, $\mu = p - (1-p)$, our estimate is $\hat{\mu} = \bar{X} - (1-\bar{X})$ expected value and SE of $\hat{\mu}$.
(@) If the actual $p=.45$, it means the Republicans are winning by a relatively large margin since $\mu = -.1$, which is a 10% margin of victory. In this case, what is the standard error of $2\hat{X}-1$ if we take a sample of $N=25$?
(@) Given the answer to 9, which of the following best describes your strategy of using a sample size of $N=25$?
a. The expected value of our estimate $2\bar{X}-1$ is $\mu$, so our prediction will be right on.
b. Our standard error is larger than the difference, so the chances of $2\bar{X}-1$ being positive and throwing us off were not that small. We should pick a larger sample size.
c. The difference is 10% and the standard error is about 0.2, therefore much smaller than the difference.
d. Because we don't know $p$, we have no way of knowing that making $N$ larger would actually improve our standard error.
For the next exercises, we will use actual polls from the 2016 election. You can load the data from the **dslabs** package.
```{r}
library(dslabs)
```
Specifically, we will use all the national polls that ended within one week before the election.
```{r, message=FALSE, message=FALSE}
library(tidyverse)
polls <- polls_us_election_2016 |>
filter(enddate >= "2016-10-31" & state == "U.S.")
```
(@) For the first poll, you can obtain the samples size and estimated Clinton percentage with:
```{r, eval=FALSE}
N <- polls$samplesize[1]
x_hat <- polls$rawpoll_clinton[1]/100
```
Assume there are only two candidates and construct a 95% confidence interval for the election night proportion $p$.
(@) Now use `dplyr` to add a confidence interval as two columns, call them `lower` and `upper`, to the object `poll`. Then use `select` to show the `pollster`, `enddate`, `x_hat`,`lower`, `upper` variables. Hint: define temporary columns `x_hat` and `se_hat`.
(@) The final tally for the popular vote was Clinton 48.2% and Trump 46.1%. Add a column, call it `hit`, to the previous table stating if the confidence interval included the true proportion $p=0.482$ or not.
(@) For the table you just created, what proportion of confidence intervals included $p$?
(@) If these confidence intervals are constructed correctly, and the theory holds up, what proportion should include $p$?
(@). A much smaller proportion of the polls than expected produce confidence intervals containing $p$. If you look closely at the table, you will see that most polls that fail to include $p$ are underestimating. The reason for this is undecided voters, individuals polled that do not yet know who they will vote for or do not want to say. Because, historically, undecideds divide evenly between the two main candidates on election day, it is more informative to estimate the spread or the difference between the proportion of two candidates $\mu$, which in this election was $0. 482 - 0.461 = 0.021$. Assume that there are only two parties and that $\mu = 2p - 1$, redefine `polls` as below and re-do exercise 1, but for the difference.
```{r, message=FALSE, comment=FALSE}
polls <- polls_us_election_2016 |>
filter(enddate >= "2016-10-31" & state == "U.S.") |>
mutate(mu_hat = rawpoll_clinton / 100 - rawpoll_trump / 100)
```
(@) Now recalculate confidence interavls for the difference and see how often the polls include the election night difference of 0.21.
```{r}
polls |> mutate(p = (mu_hat + 1)/2, se = 2*sqrt(p*(1-p))/sqrt(samplesize)) |>
mutate(lower = mu_hat - 1.96*se, upper = mu_hat + 1.96*se)
```
(@) Make a plot of the error, the difference between each poll's estimate and the actual $mu=0.021$ stratified by pollster.
(@) Redo the previous plot but only for pollsters that took five or more polls.