-
Notifications
You must be signed in to change notification settings - Fork 10
/
03-data-applications.Rmd
366 lines (266 loc) · 14.1 KB
/
03-data-applications.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
# Applications: Data {#data-applications}
```{r, include = FALSE}
source("_common.R")
```
::: {.chapterintro}
The [Preliminaries](rstudio) Chapter introduced the RStudio statistical computing environment.
In this chapter, we will explore tidy data in R---how to import and export data sets,
filter the data set into groups, and select or create new variables.
We encourage you to work through the code in this chapter in
your own RStudio session.
:::
<!-- ## Data in R {#data-in-r} -->
R is a powerful and open source software tool for working with data.
We use the R programming environment within the RStudio integrated
development environment (IDE), which offers a suite of additional features
in addition to the R programming console.
Throughout this text, we provide some guidance on how to use R within the
context of the statistical content that is being covered in each of the
"Applications" chapters at the end of each part.
As educators, we see the value of teaching with modern software to
empower students to take optimal advantage of the concepts they are learning.
Generally, we will present the R techniques in the "Applications" chapter at the
end of each part. There are times in the text when the concepts are not
distinguishable from the software, and in those cases, we have have provided the
R code within the main body of the chapter.
We start with an introduction to R, focused on how data sets are structured in
R and how the user can work with a data object in R.
## Dataframes in R
Throughout the text, we will work with many different data sets. Some data sets
are pre-loaded into R, some get loaded through R packages, and some data sets
will be created by the student. Data sets can be viewed through the RStudio
environment, and can also be investigated through various R **functions**.
::: {.onebox}
Similar to the notation for a mathematical function, an R **function** takes the form:
> `function_name(arguments to the function)`
The `function_name` is the name of the function, such as `mean`, `read.csv`, or `lm`.
You can access the help file for any named function by preceding it with a question mark: `?read.csv`.
The arguments to the function are the inputs to the function. These can be data sets, parameter values, or other options.
:::
```{r include=FALSE}
terms_chp_3 <- c("function")
```
In R, all functions take arguments in round parentheses (as opposed to
subsetting observations or variables from data objects which happen
with square parentheses).
::: {.workedexample}
In the R console, type `?read.csv`. This will bring up the help file for the `read.csv()` function. What does this function do? What are the first two arguments to this function?
--------------------------------------
According to the "Description" section of the help file, the `read.csv` function
reads a data file and creates a data frame object in R, with cases corresponding
to lines and variables to fields in the file. The `read.csv` function in particular
is designed to read in `.csv` (comma separated value) data files.
The first argument is the `file` --- the name of the file or website which the data are to be read from. The second argument is `header` --- a **logical** argument (`TRUE` or `FALSE`) for whether the file contains a header row of variables or not.
The default for the function is `header = TRUE`, so if this argument isn't used,
R will assume the .csv file has a header row.
:::
```{r include=FALSE}
functions_chp_3 <- c("`read.csv()`")
```
Let's start by exploring a data set containing information on 50 emails
sent to David Diez's Gmail account (one of the authors of the _OpenIntro_
textbooks). The data set, \data{email50}, is built into the `openintro`
package in R and is comprised of a subset of emails from the larger `email`
data set that we'll see in Chapter \@ref(explore-categorical).
Data sets that are built into R packages can be loaded using the `data()` function.
::: {.data}
The `email50` data can be found in the [openintro](http://openintrostat.github.io/openintro/reference/index.html) package: `email50`^[The `email` data set found in the `openintro` package defines the variable `spam` as 0 (not spam) or 1 (spam), and `format` as 0 (not HTML) or 1 (HTML). When variables are defined in this way---coded as the numbers 0 and 1 rather than the category names---they are called **indicator variables*** or **dummy variables**. In this section of the textbook, we have re-coded `spam` to be the variable `type`, which takes on values "not spam" or "spam", and we have re-coded the variable `format` to take on values "not HTML" or "HTML".].
Load these data into your RStudio session using the following commands:
`library(openintro)`
`data(email50)`
:::
We can use the `glimpse()` function to see the variables included in the data set
and their data type. Or, we could use the `head()` function to see the first
few rows of the data set.
```{r echo=TRUE}
data(email50)
glimpse(email50)
```
```{r echo = TRUE}
head(email50)
```
```{r include=FALSE}
functions_chp_3 <- c(functions_chp_3, "`data()`", "`glimpse()`", "`head()`")
```
Sometimes it is necessary to extract a column or a row from a data set.
In R, the `$` operator can be used to extract a column from a data set.
For example, `data$variable` would extract the `variable` column from the `data` dataframe.
When extracted, these columns can be thought of as vectors. With these vectors, if you desired to pull off a specific entry, you could use square brackets (`[ ]`), with the index (number) of the entry you wish to extract in the brackets.
For example, `data$variable[2]` would extract the second entry (row) of the `variable` column.
Because a dataframe can be (roughly) thought of as a set of many different vectors, you can extract rows and columns from a dataframe using matrix notation (e.g. `[row, column]`).
For example `data[i,j]` will extract the entry in row $i$ and column $j$;
`data[i, ]` will extract the $i^{th}$ row, and `data[ , j]` will extract the $j^{th}$ column.
Notice, when extracting an entire row (or column), you do not need to specify the columns (or rows) you would like, which is why the second entry does not contain a number.
```{r echo=TRUE}
email50$num_char # The num_char variable column
email50[47,3] # The entry in the 47th row and 3rd column
email50[47,] # The 47th row
```
```{r include=FALSE}
functions_chp_3 <- c(functions_chp_3, "`$`", "`[ , ]`")
```
## Tidy structure of data {#datastruc}
For plotting, analyses, model building, etc., the data should be structured according to certain principles.
Hadley Wickham provides a thorough discussion and advice for cleaning up the data in @wickham2014.
::: {.onebox}
**Tidy data.**
A data set in which the rows are the observational units and the columns are the variables.
The key is that *every* row is a case and *every* column is a variable.
No exceptions.
:::
Creating tidy data is often not trivial! However, data sets provided in this course will always be provided in the tidy data form.
```{r include=FALSE}
terms_chp_3 <- c(terms_chp_3, "tidy data")
```
## Using the pipe to chain
Within R (really within any type of computing language, Python, SQL, Java, etc.), it is important to understand how to build data using the patterns of the language.
In R, the syntax `<-` is called the **assignment** character; it is used to store the output of some function or set of operations in what we call an **object**. Some things to consider:
* `object_name <- anything` is a way of *assigning* `anything` to the new `object_name`.
* `object_name <- function_name(data_table, arguments)` is a way of using a function to create a new object.
* `object_name <- data_table %>% function_name(arguments)` uses *chaining syntax* called the **pipe operator** (`%>%`) as an extension of the ideas of functions.
```{r include=FALSE}
terms_chp_3 <- c(terms_chp_3, "assignment", "object", "pipe operator")
```
```{r include=FALSE}
functions_chp_3 <- c(functions_chp_3, "%>%", "<-")
```
The pipe syntax (`%>%`) takes a data frame (or data table) and sends it to the argument of a function. The mapping goes to the first available argument in the function.
For example:
`x %>% f(y)` is the same as `f(x, y)`
` y %>% f(x, ., z)` is the same as `f(x,y,z)`
In chaining, the value on the left side of `%>%` becomes the *first argument* to the function on the right side. This can be extended to multiple lines of code:
```
object_name <- data_table %>%
function_name(arguments) %>%
another_function_name(other_arguments)
```
This is called extended chaining.
Some properties of the pipe syntax (`%>%`) to keep in mind:
* `%>%` is never at the front of the line, it is always connecting one idea with the continuation of that idea on the next line.
* The spot to the left of the first `%>%` is always a data table.
* The pipe syntax should be read as *then*, `%>%`.
Pipes are used commonly with functions in the `dplyr` package
(included in the `tidyverse` package) and they allow us to
sequentially build data wrangling operations.
Pipes are also helpful when creating data visualizations in the `ggplot2` package.
## A data wrangling example
Consider the built-in data set `hsb2` --- the "High School and Beyond" survey.
Two hundred observations were randomly sampled from the
"High School and Beyond" survey, a survey conducted on high
school seniors by the National Center of Education Statistics.
Of interest is the proportion of students at each of the
two types of school, `public` and `private`.
First, load the data into your R session:
```{r echo=TRUE}
data(hsb2) # Load the data
```
We use the `table` command to tabulate how many of each type
of school are in the data set. Notice that the same result is
produced by the `$` command with `table` and the chaining
syntax done with `%>%`.
```{r echo=TRUE}
table(hsb2$schtyp)
```
is equivalent to
```{r echo=TRUE}
hsb2 %>%
select(schtyp) %>% # Select the schtyp variable
table()
```
```{r include=FALSE}
functions_chp_3 <- c(functions_chp_3, "`table()`", "`select()`")
```
In the code above, the `select()` function selected the `schtyp`
variable (column) from the `hsb2` data set.
Since data set is the first argument to the `select()` function
(examine the help file by typing `?select`), we only need to
include the second argument to `select()`---the variable we would
like to select.
### Filtering {-}
What if we are interested only in `public` schools?
First, we should take note of another piece of R syntax:
the double equal sign: `==`.
This is the logical test for "is equal to".
In other words, we first determine if school type is equal
to public for each of the observations in the data set and
`filter` for those where this is true.
```{r echo=TRUE}
# Filter for public schools
hsb2_public <- hsb2 %>% # Save final result as hsb2_public
filter(schtyp == "public") # Only include observations
# where the variable schtyp
# is equal to "public"
```
We can read this as: "take the `hsb2` data frame and `pipe` it
into the `filter` function. Filter the data for
`cases where school type is equal to public`. Then,
`assign` the resulting data frame to a new object
called `hsb2 underscore public`."
```{r include=FALSE}
functions_chp_3 <- c(functions_chp_3, "`==`", "`filter()`")
```
### Mutating {-}
Suppose we are not interested in the actual reading score
of students, but instead whether their reading score is below
average or at or above average.
First, we need to calculate the average reading score with
the `mean()` function.
This will give us the mean value, `r mean(hsb2$read)`.
However, in order to be able to refer back to this value later on,
we might want to store it as an object that we can refer to by name.
```{r echo=TRUE}
# Calculate average reading score and show the value
mean(hsb2$read)
```
So instead of just printing the result, let’s save it as
a new object called `avg_read`.
```{r echo=TRUE}
# Calculate average reading score and store as avg_read
avg_read <- mean(hsb2$read)
```
Next we need to determine whether each student is below or at
or above average. For example, a reading score of 57 is above
average, so is 68, but 44 is below.
Obviously, going through each record like this would be tedious
and error prone.
Instead we can create this new variable with the `mutate()`
function from the `dplyr` package.
We start with the data frame, `hsb2`, and pipe it into
`mutate()`, to create a new variable called `read_cat` (`cat` for `cat`egorical).
Note that we are using a new variable name here in order
to not overwrite the existing reading score variable.
The new variable `read_cat` will be a column in the
existing data frame `hsb2`.
```{r echo=TRUE}
hsb2 <- hsb2 %>% mutate(read_cat =
ifelse(read < avg_read,
"below average",
"at or above average"
)
)
```
The decision criteria for this new variable is based on a
TRUE/FALSE question: if the reading score of the student
is below the average reading score, label "below average",
otherwise, label "at or above average".
This can be accomplished using the `ifelse()` function in R.
Look at its helpfile by typing `?ifelse` into the R console window.
* The first argument of the function is the logical test.
* The second argument is what to do if the result of the logical test is TRUE, in other words, if the student’s score is below the average score.
* The last argument is what to do if the result is FALSE.
```{r include=FALSE}
functions_chp_3 <- c(functions_chp_3, "`mutate()`", "`ifelse()`")
```
## Terms {-}
We introduced the following terms in the chapter.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.
However you should be able to easily spot them as **bolded text**.
```{r}
make_terms_table(terms_chp_3)
```
## R functions {-}
The following R functions were introduced in this chapter.
```{r}
make_terms_table(functions_chp_3)
```