This repository has been archived by the owner on Dec 28, 2023. It is now read-only.
forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 13
/
08-sampling.Rmd
executable file
·564 lines (367 loc) · 32.6 KB
/
08-sampling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
# (PART) Statistical Inference via infer {-}
# Sampling {#sampling}
```{r setup_infer, include=FALSE, purl=FALSE}
chap <- 7
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth',
fig.height = 4,
warning = FALSE
)
# This bit of code is a bug fix on asis blocks, which we use to show/not show LC
# solutions, which are written like markdown text. In theory, it shouldn't be
# necessary for knitr versions <=1.11.6, but I've found I still need to for
# everything to knit properly in asis blocks. More info here:
# https://stackoverflow.com/questions/32944715/conditionally-display-block-of-markdown-text-using-knitr
library(knitr)
knit_engines$set(asis = function(options) {
if (options$echo && options$eval) knit_child(text = options$code)
})
# This controls which LC solutions to show. Options for solutions_shown: "ALL"
# (to show all solutions), or subsets of c('5-1', '5-2','5-3', '5-4'), including
# the null vector c('') to show no solutions.
solutions_shown <- c('')
show_solutions <- function(section){
return(solutions_shown == "ALL" | section %in% solutions_shown)
}
```
In this chapter we will introduce new concepts that will serve as the basis for the remainder of the text: **sampling** and **resampling**. We will see that the tools that you learned in the Data Exploration part of this book (tidy data, data visualization, and data wrangling) will also play an important role here. As mentioned before, the concepts throughout this text all build into a culmination allowing you to create better stories with data.
We begin with some helpful definitions that will help us better understand why statistical inference exists and why it is needed. We will then progress with introducing the second of our main datasets (in addition to the `nycflights13` data you've been working with) about OKCupid dating profiles to see how one can think of the distribution of a sample being an approximation of the distribution of the population. We will also focus on representative, random samples versus convenience samples in this context.
We then shift to a famous example from statistics lore on a lady tasting tea. This section will focus on introducing concepts without a lot of statistical jargon. The chapter will conclude with a summary of the different functions introduced in the `mosaic` package in this chapter.
### Needed packages {-}
Let's load all the packages needed for this chapter (this assumes you've already installed them). If needed, read Section \@ref(packages) for information on how to install and load R packages.
```{r message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(okcupiddata)
library(mosaic)
```
```{r message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
library(knitr)
```
## Random sampling
Whenever you hear the phrases "random sampling" or just "sampling" (with regards to statistics), you should think about tasting soup. This likely sounds a little bonkers. Let's dig into why tasting soup is such an excellent analogy to random sampling.
### Sampling
Consider the following "sampling bowl" consisting of 2400 balls
```{r tub1, echo=FALSE, fig.cap="Sampling 1", purl=FALSE}
knitr::include_graphics("images/sampling_bowl_1.jpg")
```
from which we draw samples of size $n=50$ using the following shovel
```{r tub2, echo=FALSE, fig.cap="Sampling 2", purl=FALSE}
knitr::include_graphics("images/sampling_bowl_2.jpg")
```
to extract this sample ...
```{r tub3, echo=FALSE, fig.cap="Sampling 3", purl=FALSE}
knitr::include_graphics("images/sampling_bowl_3.jpg")
```
### Tasting soup
```{r soupimg, echo=FALSE, fig.cap="A bowl of Indian chicken and vegetable soup", purl=FALSE}
knitr::include_graphics("images/soup.jpg")
```
Imagine that you have invited a group of friends over to try a new recipe for soup that you've never made before. As in the image above downloaded from [here](http://readthespirit.wpengine.netdna-cdn.com/feed-the-spirit/wp-content/uploads/sites/19/2015/02/Chicken-soup-Indian-by-Fifth-Floor-Kitchen.jpg), you'd like to make a bowl of Indian chicken soup with lots of different kinds of vegetables included.
You've carefully followed along with the recipe but you are concerned that you don't have a lot of experience making foods from India. It is coming near the end of the prescribed time to cook given in the recipe. You begin to wonder:
- "Did I add too much curry spice?"
- "Are the carrots cooked enough?"
- "Does this actually taste good?"
How can we answer these questions? Does it matter where we take a bite of soup from? Is there anything we should do to the soup before we taste? Is one taste enough?
### Common terms
The process of sampling brings with it many common terms that we define now. As you read over these definitions, think about how they each apply to the tasting soup example above.
**Definition: population**
The *population* is the (usually) large pool of observational units that we are interested in.
**Definition: sample**
A *sample* is a smaller collection of observational units that is selected from the population.
**Definition: sampling**
*Sampling* refers to the process of selecting observations from a population. There are both random and non-random ways this can be done.
**Definition: representative sample**
A sample is said be a *representative sample* if the characteristics of observational units selected are a good approximation of the characteristics from the original population.
**Definition: bias**
*Bias* corresponds to a favoring of one group in a population over another group.
**Definition: generalizability**
*Generalizability* refers to the largest group in which it makes sense to make inferences about from the sample collected. This is directly related to how the sample was selected.
**Definition: parameter**
A *parameter* is a calculation based on one or more variables measured in the population. Parameters are almost always denoted symbolically using Greek letters such as $\mu$, $\pi$, $\sigma$, $\rho$, and $\beta$.
**Definition: statistic**
A *statistic* is a calculation based on one or more variables measured in the sample. Statistics are usually denoted by lower case Arabic letters with other symbols added sometimes. These include $\bar{x}$, $\hat{p}$, $s$, $p$, and $b$.
```{block lc6-0a, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Explain in your own words how tasting soup relates to the concepts of sampling covered here.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Describe a different scenario (not food or drink related) that is analogous to sampling concepts covered here.
```{block, type='learncheck', purl=FALSE}
```
Let's explore these terms for our tasting soup example:
*Population* - the entire container of soup that we have cooked.
*Sample* - any smaller portion of soup collected that isn't the whole container of soup. We could say that each spoonful of soup represents one sample.
*Sampling* - the process of selecting spoonfuls from the container of soup
*Representative sample* - A sample we select will only be representative if it tastes like what the soup tastes like in general. If we only select a carrot in our spoonful, we might not have a representative sample.
*Bias* - As we noted with the carrot selection example above, we may select a sample that is not representative. If you watch chefs cook or if you frequently cook, you'll be sure to stir the soup before you taste it.
*Generalizability* - If we stir our soup before we taste a spoonful (and if we make sure we don't just pick our favorite item in the soup), results from our sample can be generalized (by and large) to the larger pot of soup. When we say "Yum! This is good!" after a couple spoonfuls, we can be pretty confident that each bowl of soup for our friends will taste good too.
*Parameter* - An example here is could be the proportion of curry entered into the entire pot of soup. A measurement of how salty the pot of soup is on average is also a parameter. How crunchy, on average, the carrots are in the pot of soup is one more example.
*Statistic* - To convert a parameter to a statistic, you need only to think about the same measurement on a spoonful:
- The proportion of curry to non-curry in a spoonful of soup
- How salty the spoonful of soup is that we collected as our sample
- How crunchy the carrots are in our spoonful of soup
```{block lc6-0b, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why isn't our population all bowls of soup? All bowls of Indian chicken soup?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Describe a way in which we could select a sample of flights from `nycflights13` that is not representative.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** If we treat all of the flights in `nycflights13` as the population, give examples of three _parameters_ we could calculate.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** If we treat all of the flights in `nycflights13` as the population, give examples of three _statistics_ we could calculate.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What biases might we see if we only select flights to Boston when we are interested in looking at mean flight delays from NYC?
```{block, type='learncheck', purl=FALSE}
```
## Visualizing sampling
---
## Visualizing sampling {#vizsamp}
Let's explore how sampling and these other terms relate to working with data and data visualization. Here we introduce the `okcupiddata` R package [@R-okcupiddata]. Note that permission to use this data to create the R package was explicitly granted by OkCupid. More information about this package is available [here](https://github.com/rudeboybert/okcupiddata). The `profiles` data frame in this R data package contains data about 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile. We will be focusing on the `height` variable, which corresponds to a self-reported height for each individual on their profile. Note that this is measured in inches.
Let's take a look at the distribution of `height` using a histogram and `ggplot2`:
```{r height-hist, warning=FALSE}
ggplot(data = profiles, mapping = aes(x = height)) +
geom_histogram(bins = 20, color = "white")
```
We see here that this being self-reported data has led to the data being a little messy.
```{block lc6-0c, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why does the histogram go all the way back to 0 for height and all the way up to 100?
```{block, type='learncheck', purl=FALSE}
```
To clean up the data a bit, let's focus on just looking at heights between 55 inches and 85 inches. Remember that the `filter` function in `dplyr` allows us to focus on a subset of rows. The specific subset of rows we are interested in corresponds to the argument to the `filter` function. We will create a new data frame called `profiles_subset` that contains all rows with heights between 55 and 85 inclusive.
```{r filter-profiles}
profiles_subset <- profiles %>% filter(between(height, 55, 85))
```
Next, let's produce the same histogram as above but using the `profiles_subset` data frame instead.
```{r height-hist2, warning=FALSE}
ggplot(data = profiles_subset, mapping = aes(x = height)) +
geom_histogram(bins = 20, color = "white")
```
We can think of this data as representing the *population* of interest. Let's now take a random sample of size 100 from this population and look to see if this sample represents the overall shape of the population. In other words, we are going to use data visualization as our guide to understand the *representativeness* of the sample selected.
```{r sample-profiles}
set.seed(2017)
profiles_sample1 <- profiles_subset %>%
resample(size = 100, replace = FALSE)
```
The `set.seed` function is used to ensure that all users get the same random sample when they run the code above. It is a way of interfacing with the pseudo-random number generation scheme that R uses to generate "random" numbers. If that command was not run, you'd obtain a different random sample than someone else if you ran the code above for the first time.
We have introduced the `resample` function from the `mosaic` package here [@R-mosaic]. This function can be used for both sampling with and without replacement. Here we have chosen to sample without replacement. In other words, after the first row is chosen from the `profiles_subset` data frame at random it is kept out of the further 99 samples. Let's now visualize the 100 values of the `height` variable in the `profiles_sample1` data frame. To keep this visualization on the same horizontal scale as our original population presented in `profiles_subset` we can use the `coord_cartesian` function along with the `c` function to specify the limits on the horizontal axis.
```{r plot-sample1}
ggplot(data = profiles_sample1, mapping = aes(x = height)) +
geom_histogram(bins = 20, color = "white", fill = "red") +
coord_cartesian(xlim = c(55, 85))
```
```{block lc6-0d, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Does this random sample of `height` represent the population `height` variable well? Explain why or why not in a couple of sentences.
```{block, type='learncheck', purl=FALSE}
```
We now repeat this process of sampling to look to see how another random sample of `height` compares to the original population distribution.
```{r sample-profiles2}
profiles_sample2 <- profiles_subset %>% resample(size = 100, replace = FALSE)
ggplot(data = profiles_sample2, mapping = aes(x = height)) +
geom_histogram(bins = 20, color = "black", fill = "yellow") +
coord_cartesian(xlim = c(55, 85))
```
Remember that a sample can never truly quantify all of the properties of a population since it contains less data and, thus, less information. We can use the overall shape as a good guess as to the representativeness of the sample in regards to the population though. We see that the above two random samples of size 100 have roughly the same shape as the original population `height` data. Let's next explore what is known as a convenience sample and how its distribution compares to the population distribution.
A **convenience sample** is a sample that is chosen conveniently by the person selecting the sample. While certainly less work, convenience samples are generally not representative of the population since they will exclude some (usually large) portion of the population. Let's look at values of `height` in our `profiles_subset` population that are larger than 6 feet tall (72 inches) and have that be the sample we choose.
```{r sample-profiles3}
profiles_sample3 <- profiles_subset %>% filter(height >= 72)
ggplot(data = profiles_sample3, mapping = aes(x = height)) +
geom_histogram(bins = 20, color = "white", fill = "blue") +
coord_cartesian(xlim = c(55, 85))
```
This is a clear example of a sample that is not representative of the population. The population `height` variable is roughly symmetric, whereas this distribution is right-skewed. Further, since it only selects large heights it has completely excluded the small and middle heights. We have seen here that data visualization provides an excellent tool in judging the representativeness of a sample.
### Sampling distribution
The representativeness of a sample plays an even larger role than just looking at the shapes of distributions. Let's suppose we were interested in estimating the mean `height` of all profiles in the `profiles_subset` data frame. To do so, we could look at the mean of the `height` variable in the `profiles_sample1` data frame:
```{r mean1}
profiles_sample1 %>% summarize(mean(height))
```
But, we could also use `profiles_sample2`:
```{r mean2}
profiles_sample2 %>% summarize(mean(height))
```
Or maybe even `profiles_sample3`:
```{r mean3}
profiles_sample3 %>% summarize(mean(height))
```
We see a clear difference here in looking at the mean of `height` in `profiles_sample3` versus `profiles_sample1` and `profiles_sample2`. This comes from the bias that is used in choosing only the top heights for `profiles_sample3`. If we had chosen to use this sample as our only sample, we would be quite a ways off from what the actual mean `height` in our population of `profiles_subset` is.
We also see that even random samples produce means that aren't exactly the same. This sampling variability can be shown via what is called a *sampling distribution*. This is defined as the behavior of a statistic under repeated sampling. To build this sampling distribution for this example, we've created an interactive app using the `shiny` R package below that is available at <http://ismay.shinyapps.io/okcupidheights/>. You can specify the sample size you'd like to work with (100 is chosen by default) and then generate a random sample. You then can see the mean of this generated sample plotted in the bottom visualization. Repeating this process many times, you can start to see the shape of the sampling distribution take form. A screenshot of the app is below, but you are encouraged to go to <http://ismay.shinyapps.io/okcupidheights/> and test it out on your own.
```{r shiny, echo=FALSE, out.width="100%", fig.cap=ifelse(knitr:::is_html_output(), "Sampling distribution app at http://ismay.shinyapps.io/okcupidheights/.", "Sampling distribution app"), screenshot.opts=list(delay=20), dev='png', purl=FALSE}
library(knitr)
#if(knitr:::is_html_output()){
# include_app("http://ismay.shinyapps.io/okcupidheights/", height = "1300px")
#} else{
include_graphics("images/shinyapp.png")
#}
```
### Repeated sampling via `do`
We have looked at two random samples above, but using `mosaic` we can repeat this process over and over again with the `do` function. Below, we repeat this sampling process 5000 times. We can then plot the different values of the sample means to get a sense for what a reasonable range of values for the population parameter mean `height` is in the `profiles_subset` data frame.
```{r do-first, include=FALSE}
if(!file.exists("rds/sample_means.rds")){
sample_means <- do(5000) *
(profiles_subset %>% resample(size = 100, replace = FALSE) %>%
summarize(mean_height = mean(height))
)
saveRDS(object = sample_means, "rds/sample_means.rds")
} else {
sample_means <- readRDS("rds/sample_means.rds")
}
```
```{r do-first-read, eval=FALSE}
sample_means <- do(5000) *
(profiles_subset %>% resample(size = 100, replace = FALSE) %>%
summarize(mean_height = mean(height)))
```
```{r do-plot}
ggplot(data = sample_means, mapping = aes(x = mean_height)) +
geom_histogram(color = "white", bins = 20)
```
Note how the range of sample mean height values is much more narrow than the original range of `height` in the `profiles_subset` data frame. We also see a characteristic shape to this distribution of `mean_height`: the normal curve. This idea is commonly associated with statistics and you hopefully have a good sense of how this distribution comes about. As before, if you aren't quite sure of this yet, go back and explore the shiny app above a bit more. We see that many values for the sample mean appear near the center of the distribution and a few values appear out in the tails providing the bell-shaped distribution linked with the normal distribution. You'll see more examples of this in the chapters to come and in Appendix \@ref(appendixB).
```{block lc6-0e, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why do the sample mean values have a much smaller spread than the original population data? You may want to play with the shiny app above a bit to understand why this is the case.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why is random sampling so important here to create a distribution of sample means that provide a range of plausible values for the population mean height?
```{block, type='learncheck', purl=FALSE}
```
---
## Simulation
We next will introduce the ideas behind hypothesis testing that we will delve into more formally in the chapters to come. What follows is taken from a book entitled _The Lady Tasting Tea_ [@salsburg2001]:
> It was a summer afternoon in Cambridge, England, in the late
1920s. A group of university dons, their wives, and some guests
were sitting around an outdoor table for afternoon tea. One of
the women was insisting that tea tasted different depending upon
whether the tea was poured into the milk or whether the milk was
poured into the tea. The scientific minds among the men scoffed
at this as sheer nonsense. What could be the difference? They
could not conceive of any difference in the chemistry of the mixtures
that could exist. A thin, short man, with thick glasses and a
Vandyke beard beginning to turn gray, pounced on the problem.
"Let us test the proposition," he said excitedly. He began to outline
an experiment in which the lady who insisted there was a difference
would be presented with a sequence of cups of
tea, in some of which the milk had been
poured into the tea and in others of which
the tea had been poured into the milk...
> So it was that sunny summer afternoon in Cambridge.
The lady might or might not have been correct about the tea infusion.
The fun would be in finding a way to determine if she was
right, and, under the direction of the man with the Vandyke beard,
they began to discuss how they might make that determination.
> Enthusiastically, many of them joined with him in setting up
the experiment. Within a few minutes, they were pouring different
patterns of infusion in a place where the lady could not see which
cup was which. Then, with an air of finality, the man with the
Vandyke beard presented her with her first cup. She sipped for a
minute and declared that it was one where the milk had been
poured into the tea. He noted her response without comment and
presented her with the second cup...
> The man with
the Vandyke beard was Ronald Aylmer Fisher, who was in his late
thirties at the time. He would later be knighted Sir Ronald Fisher.
In 1935, he wrote a book entitled _The Design of Experiments_, and
he described the experiment of the lady tasting tea in the second
chapter of that book. In his book, Fisher discusses the lady and her
belief as a hypothetical problem. He considers the various ways in
which an experiment might be designed to determine if she could
tell the difference. The problem in designing the experiment is
that, if she is given a single cup of tea, she has a 50 percent chance
of guessing correctly which infusion was used, even if she cannot
tell the difference. If she is given two cups of tea, she still might
guess correctly. In fact, if she knew that the two cups of tea were
each made with a different infusion, one guess could be completely
right (or completely wrong).
> Similarly, even if she could tell the difference, there is some
chance that she might have made a mistake, that one of the cups
was not mixed as well or that the infusion was made when the tea
was not hot enough. She might be presented with a series of ten
cups and correctly identify only nine of them, even if she could tell
the difference.
> In his book, Fisher discusses the various possible outcomes of
such an experiment. He describes how to decide how many cups
should be presented and in what order and how much to tell the
lady about the order of presentations. He works out the probabilities
of different outcomes, depending upon whether the lady is or
is not correct. Nowhere in this discussion does he indicate that
such an experiment was ever run. Nor does he describe the outcome
of an actual experiment.
It's amazing that there is no actual evidence that such an event actually took place. This problem is a great introduction into inference though and we can proceed by testing to see how likely it is for a person to guess correctly, say, 9 out of 10 times, assuming that person is just guessing. In other words, is the person just lucky or do we have reason to suspect that they can actually detect whether milk was put in first or not?
We need to think about this problem from the standpoint of hypothesis testing. First, we'll need to identify some important parts of a hypothesis test before we proceed with the analysis.
```{block lc6-1, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does "by chance" mean in this context?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is our observed statistic?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is this statistic trying to estimate?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How could we test to see whether the person is just guessing or if they have some special talent of identifying milk before tea or vice-versa?
```{block, type='learncheck', purl=FALSE}
```
Let's begin with an experiment. I will flip a coin 10 times. Your job is to try to predict the sequence of my 10 flips. Write down 10 H's and T's corresponding to your predictions. We could compare your guesses with my actual flips and then we would note how many correct guesses you have.
You may be asking yourself how this models a way to test whether the person was just guessing or not. All we are trying to do is see how likely it is to have 9 matches out of 10 if the person was truly guessing. When we say "truly guessing" we are assuming that we have a 50/50 chance of guessing correctly. This can be modeled using a coin flip and then seeing whether we guessed correctly for each of the coin flips. If we guessed correctly, we can think of that as a "success."
We often don't have time to do the physical flipping over and over again and we'd like to be able to do more than just 20 different simulations or so. Luckily, we can use R to simulate this process many times. The `mosaic` package includes a function called `rflip()`, which can be used to flip one coin. Well, not exactly. It uses pseudo-random number generation to "flip" a virtual coin. In order for us all to get the same results here, we can set the seed of the pseudo-random number generator. Let's see an example of this: (Remember to load the `mosaic` package!)
```{r message=FALSE}
set.seed(2017)
do(1) * rflip(1)
```
This shows us the proportion of "successes" in one flip of a coin. The `do` function in the `mosaic` package will be useful and you can begin to understand what it does with another example.
```{r}
do(13) * rflip(10)
```
We've now done a simulation of what actually happened when you flipped a coin ten times. We have 13 different simulations of flipping a coin 10 times. Note here that `heads` now corresponds to the number of correct guesses and `tails` corresponds to the number of incorrect guesses. (This can be tricky to understand at first since we've done a switch on what the meaning of "heads" and ``tails" are.)
If you look at the output above for our simulation of 13 student guesses, we can begin to get a sense for what an "expected" sample proportion of successes may be. Around five out of 10 seems to be the most likely value. What does this say about what we actually observed with a success rate of 9/10? To better answer this question, we can simulate 5000 student guesses and then look at the distribution of the simulated sample proportion of successes, also known as the **null distribution**.
```{r include=FALSE}
if(!file.exists("rds/simGuesses.rds")){
simGuesses <- do(5000) * rflip(10)
saveRDS(object = simGuesses, "rds/simGuesses.rds")
} else {
simGuesses <- readRDS("rds/simGuesses.rds")
}
```
```{r eval=FALSE}
simGuesses <- do(5000) * rflip(10)
```
```{r}
simGuesses %>%
group_by(heads) %>%
summarize(count = n())
```
We can see here that we have created a count of how many of each of the 5000 sets of 10 flips resulted in 0, 1, 2, $\ldots$, up to 10 heads. Note the use of the `group_by` and `summarize` functions from Chapter \@ref(wrangling) here.
In addition, we can plot the distribution of these simulated `heads` using the ideas from Chapter \@ref(viz). `heads` is a quantitative variable. Think about which type of plot is most appropriate here before reading further.
We already have an idea as to an appropriate plot by the data summarization that we did in the chunk above. We'd like to see how many heads occurred in the 5000 sets of 10 flips. In other words, we'd like to see how frequently 9 or more heads occurred in the 10 flips:
```{r fig.cap="Histogram of number of heads in simulation - needs tweaking"}
ggplot(data = simGuesses, mapping = aes(x = heads)) +
geom_histogram(binwidth = 1, color = "white")
```
This horizontal axis labels are a little confusing here. What does 2.5 or 7.5 heads mean? In `simGuesses`, `heads` is a `numerical` variable. Thus, `ggplot` is expecting the values to be on a continuous scale. We can switch the scale to be discrete by invoking the `factor` function and using `geom_bar` instead of `geom_histogram`:
```{r fig.cap="Barplot of number of heads in simulation"}
ggplot(data = simGuesses, mapping = aes(x = factor(heads))) +
geom_bar()
```
You'll frequently need to make this conversion to `factor` when making a barplot with quantitative variables. Remember from "Getting Used to R, RStudio, and R Markdown" [@usedtor2016], that a `factor` variable is useful when there is a natural ordering to the variable and it only takes on discrete values and not fractional values like 2.5. Our `heads` variable has a natural ordering: 0, 1, 2, $\ldots$, 10.
Again, note that the shape of these number of heads follows what appears to be a normal distribution. We'll see in a related example that if appropriate conditions/assumptions are met with the data that we can expect to see a normal distribution result. When these conditions aren't met, the simulation methodology we've presented here still works well whereas the traditional normal-based methods start to fall apart.
We will delve further into hypothesis testing in the next few chapters. This null distribution in combination with the **sampling distribution** concept covered earlier will be of utmost importance going forward.
---
## Review of `mosaic` simulation functions
In this chapter, we've discussed three functions in the `mosaic` package useful in understanding the stepping stones to statistical inference: `do`, `rflip`, and `resample`. We will also work with the `shuffle` function in later chapters and we summarize it here for your reference later.
- `do`: Its main use is in replicating a process many times. It has one argument `n` which specifies how many times to replicate the process. It then uses `*`, which can be read as "times", and whatever follows immediately after it as the process.
- `rflip`: This is used to simulate the flipping of a coin many times. By default, it flips a fair coin one time giving an equal chance to heads and tails. It is frequently used with `do() *` to simulate many coin flips in multiple sets.
- `resample`: This is used to sample from a larger dataset with or without replacement. When we are thinking about the concept of random sampling, we sample without replacement. We can also sample with replacement corresponding to the values being replaced into the pool to draw from with the possibility that they are drawn again in the resample. This will be of great importance when we discuss **bootstrapping** with confidence intervals.
- `shuffle`: Its main purpose is to permute the values of one variable across the values of another variable. This acts in much the same way as shuffling a deck of cards and then presenting the shuffled deck to two (or more) players.
```{block lc-mosaic, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Recreate `rflip` using only the `resample` function and specifying the appropriate arguments.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Recreate `shuffle` using only the `resample` function and specifying the appropriate arguments.
```{block, type='learncheck', purl=FALSE}
```
---
## Conclusion
### What's to come?
This chapter has served as an introduction into inferential techniques that will be discussed in greater detail in Chapter \@ref(hypo) for hypothesis testing and in Chapter \@ref(ci) for confidence intervals. In these chapters, we will see how we can use a related concept of **resampling** when working with the distributions of two groups. All of these concepts will be further reinforced in Chapter \@ref(regression) as well.
### Script of R code
An R script file of all R code used in this chapter is available [here](https://moderndive.netlify.com/scripts/08-sampling.R).