-
Notifications
You must be signed in to change notification settings - Fork 42
/
06-tidyverse.qmd
342 lines (224 loc) · 9.16 KB
/
06-tidyverse.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
# Tidyverse
```{r}
#| message: false
library(tidyverse)
library(dslabs)
```
The tidyverse makes data analysis simpler and code easier to read by sacrificing some flexibility. The general idea is define functions that perform the most common tasks and requiring that they take data frames as first argument and return a data frame: data frame in data frame out.
Another way it makes code more readable is using _non standard evaluation_. We will define when looking at examples.
## Tidy data
This is tidy:
```{r, echo=FALSE}
tidy_data <- gapminder |>
filter(country %in% c("South Korea", "Germany") & !is.na(fertility)) |>
select(country, year, fertility)
head(tidy_data, 6)
```
Originally, the data was in the following format:
```{r, echo=FALSE, message=FALSE}
path <- system.file("extdata", package = "dslabs")
filename <- file.path(path, "fertility-two-countries-example.csv")
wide_data <- read_csv(filename)
select(wide_data, country, `1960`:`1970`) |> as.data.frame()
```
Not tidy.
Part of what we learn in the data wrangling part of the class is to make data tidy.
## Adding a column with `mutate`
```{r, message=FALSE}
murders <- mutate(murders, rate = total/population*100000)
```
Notice that here we used `total` and `population` inside the function, which are objects that are **not** defined in our workspace. But why don't we get an error? This is non-standard evaluation where the context is used to know what variable names means.
## Subsetting with `filter`
```{r}
filter(murders, rate <= 0.71)
```
## Selecting columns with `select`
```{r}
new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
```
## The pipe: `|>` or `%>%`
We use the pipe to chain a series of operations... for example if we want to select columns and then filter rows we chain like this:
$$ \mbox{original data }
\rightarrow \mbox{ select }
\rightarrow \mbox{ filter } $$
The code looks like this:
```{r}
murders |> select(state, region, rate) |> filter(rate <= 0.71)
```
The object on the left of the pipe is used as the first argument for the function on the right.
The second argument becomes the first, the third the second, and so on...
```{r}
16 |> sqrt() |> log(base = 2)
```
## Summarizing data
Here is how it works:
```{r}
murders |> summarize(avg = mean(rate))
```
Let's compute murder rate for the US. Is the above it?
No the rate is NOT the average of rates.
```{r}
murders |> summarize(rate = sum(total)/sum(population)*100000)
```
### Multiple summaries
We want the median, minimum and max population size:
```{r}
murders |> summarize(median = median(population), min = min(population), max = max(population))
```
Why don't we use `quantiles`?
```{r}
murders |> summarize(quantiles = quantile(population, c(0.5, 0, 1)))
```
For multiple summaries we use `reframe`
```{r}
murders |> reframe(quantiles = quantile(population, c(0.5, 0, 1)))
```
However, if we want a column per summary, as the `summarize` call above,
we have to define a function that returns a data frame like this:
```{r}
median_min_max <- function(x){
qs <- quantile(x, c(0.5, 0, 1))
data.frame(median = qs[1], min = qs[2], max = qs[3])
}
```
Then we can call `summarize` as above:
```{r}
murders |> summarize(median_min_max(population))
```
### Group then summarize with `group_by` {#sec-group-by}
Let's compute murder rate by region.
Take a close look at this output? How is it different than the original?
```{r}
murders |> group_by(region)
```
Note the `Groups: region [4]` when we print the object. Although not immediately obvious from its appearance, this is now a special data frame called a _grouped data frame_, and __dplyr__ functions, in particular `summarize`, will behave differently when acting on this object.
```{r}
murders |> group_by(region) |> summarize(rate = sum(total) / sum(population) * 100000)
```
The `summarize` function applies the summarization to each group separately.
For another example, let's compute the median, minimum, and maximum population in the four regions of the country using the `median_min_max` defined above:
```{r}
murders |> group_by(region) |> summarize(median_min_max(population))
```
## ungroup
You can also summarize a variable but not collapse the dataset. We use `mutate` instead of `summarize`. Here is an example where we add a column with the population in each region and the number of states in the region, shown for each state. When we do this, we usually want to ungroup before continuing our analysis.
```{r}
murders |> group_by(region) |>
mutate(region_pop = sum(population), n = n()) |>
ungroup()
```
### `pull`
Tidyverse function always returns a data frame. Even if its just one number.
```{r}
murders |>
summarize(rate = sum(total)/sum(population)*100000) |>
class()
```
To get a number use pull
```{r}
murders |>
summarize(rate = sum(total)/sum(population)*100000) |>
pull(rate)
```
## Sorting data frames
States order by rate
```{r}
murders |> arrange(rate) |> head()
```
If we want decreasing we can either use the negative or, for more readability, use `desc`:
```{r}
murders |> arrange(desc(rate)) |> head()
```
We can use two variables as well:
```{r}
murders |> arrange(region, desc(rate)) |> head(11)
```
## Exercises
## Exercises
Let's redo the exercises from previous chapter but now with tidyverse:
(@) Show the subset of `murders` showing states with less than 1 per 100,000 deaths. Show all variables.
```{r}
murders <- mutate(murders, rate = total/population*10^5)
filter(murders, rate < 1)
```
(@) Show the subset of `murders` showing states with less than 1 per 100,000 deaths and in the West of the US. Don't show the `region` variable.
```{r}
filter(murders, rate < 1 & region == "West")
```
(@) Show the largest state with a rate less than 1 per 100,000.
```{r}
murders |> filter(rate < 1) |> slice_max(population)
```
(@) Show the state with a population of more than 10 million with the lowest rate.
```{r}
murders |> filter(population > 10^7) |> slice_min(rate)
```
(@) Compute the rate for each region of the US.
```{r}
murders |> group_by(region) |> summarize(rate = sum(total)/sum(population)*10^5)
```
For the next exercises we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the __NHANES__ package. Once you install the __NHANES__ package, you can load the data like this:
```{r}
library(NHANES)
```
(@) Check for consistency between `Race1` and `Race3`. Do any rows have different entries?
```{r}
NHANES |> filter(!is.na(Race1) & !is.na(Race3)) |>
filter(as.character(Race1) != as.character(Race3)) |>
count(Race1, Race3)
```
(@) Define a new `race` variable that has as few `NA` and `Other` as possible.
```{r}
dat <- NHANES %>% mutate(Race = Race3) |>
mutate(Race = if_else(is.na(Race), Race1, Race))
```
(@) Compute proportion of individuals that smoked at the time of the survey, by race category and gender. Keep track of how many people answered the question. Order the result by the number that answered. Read the help file for NHANES carefully before doing this one. To be clear: what proportion of people who have smoked at any time in their life are smoking now.
```{r}
dat |> group_by(Gender, Race) |>
summarize(n = sum(!is.na(Smoke100)),
smoke = sum(SmokeNow == "Yes", na.rm = TRUE)) |>
mutate(smoke = smoke/n) |>
arrange(desc(n))
```
(@) Create a new dataset that combines the `Mexican` and `Hispanic`, and removes the `Other` category. Hint: use the function `forcats::fct_collapse()`.
```{r}
dat <- dat |>
mutate(Race = forcats::fct_collapse(Race, Hispanic = c("Hispanic", "Mexican"))) |>
filter(Race != "Other") |>
mutate(Race = droplevels(Race))
```
(@) Recompute proportion of individuals that smoke now by race category and gender. Order by rate of smokers.
```{r}
dat |> group_by(Gender, Race) |>
summarize(n = sum(!is.na(Smoke100)), smoke = sum(SmokeNow == "Yes", na.rm = TRUE)) |>
mutate(smoke = smoke/n) |>
arrange(desc(smoke))
```
(@) Compute the median age by race category and gender order by Age.
```{r}
dat |> group_by(Gender, Race) |>
summarize(Age = median(Age)) |>
arrange(Gender, Age)
```
(@) Now redo the smoking rate calculation by age group. But first, remove individuals with no group and remove any
age groups for which less than 10 people answered the question. Within each age group and Gender order by percent that smokes.
```{r}
res <- dat |>
filter(!is.na(AgeDecade)) |>
group_by(AgeDecade) |>
mutate(n = sum(!is.na(Smoke100))) |>
ungroup() |>
filter(n >= 10) |>
group_by(AgeDecade, Gender, Race) |>
summarize(n = sum(!is.na(Smoke100)),
smoke = sum(SmokeNow == "Yes", na.rm = TRUE),
.groups = "drop") |> ## This is similar to running ungroup() in a next step
mutate(smoke = smoke/n) |>
arrange(AgeDecade, Gender, desc(smoke))
## Bonus: a plot
res |> ggplot(aes(AgeDecade, smoke, color = Race)) +
geom_point() +
geom_line() +
facet_wrap(~Gender)
```