-
Notifications
You must be signed in to change notification settings - Fork 24
/
dplyr.qmd
executable file
·511 lines (349 loc) · 18.5 KB
/
dplyr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
---
output: html_document
editor_options:
chunk_output_type: console
---
# Manipulating data tables with `dplyr`
```{r echo=FALSE}
source("libs/Common.R")
options(width = 80)
```
```{r echo = FALSE}
pkg_ver(c("dplyr", "lubridate","forcats", "stringr"))
```
The data file *FAO_grains_NA.csv* will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.
Run the following line to load the FAO data file into your current R session.
```{r}
dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)
```
Before tackling the examples that follow, make sure to load the `dplyr` package.
```{r, message=FALSE}
library(dplyr)
```
## The dplyr basics
The basic set of R tools can accomplish many data table queries, but the syntax can be overwhelming and verbose. The `dplyr` package offers some nifty and simple querying functions. Some of `dplyr`'s key data manipulation functions are summarized in the following table:
`dplyr` function | Description
-------------------|-------------
`filter()` | Subset by row values
`arrange()` | Sort rows by column values
`select()` | Subset columns
`mutate()` | Add columns
`summarise()` | Summarize columns
Note that all of these functions take as first argument the data table name except when used in a piping operation (pipes are discussed later in this chapter). For example:
When used alone, dataframe `dat` is the first argument **inside** the `dplyr` function.
```{r eval=FALSE}
dat2 <- select(dat, Crop)
```
When used in a pipe, dataframe `dat` is **outside** of the `dplyr` function.
```{r eval=FALSE}
dat2 <- dat %>% select(Crop)
```
If you are using R 4.1 or greater, you can make use of the native pipe.
```{r eval=FALSE}
dat2 <- dat |> select(Crop)
```
### `filter`: Subset by rows
Tables can be subsetted by rows based on column values. For example, we may wish to grab all rows associated with the value `Oats` in the `Crop` column:
```{r}
dat.query1 <- filter(dat, Crop == "Oats")
```
We can use the `unique()` function to identify all unique values in the `Crop` column. This should return a single unique value: `Oats`.
```{r}
unique(dat.query1$Crop)
```
Note that R is case sensitive, so make sure that you respect each letter's case (i.e. upper or lower).
We can expand our query by including both `Oats`, `Buckwheat` and limiting the `Country` column to `Canada`.
```{r}
dat.query2 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat",
Country == "Canada")
```
```{r}
unique(dat.query1$Crop)
unique(dat.query1$Country)
```
The character `|` is the Boolean operator *OR*. So in our example, the query can be read as "... crop equals oats *OR* crop equals buckwheat". Had we used the AND operator, `&` as in `Crop == "Oats" & Crop == "Buckwheat"`, the output would have returned zero rows since a `Crop` value cannot be both `Oats` AND `Buckwheat`.
We can expand this query by limiting our output to the years 2005 to 2010
```{r}
dat.query3 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat",
Country == "Canada",
Year >= 2005 & Year <= 2010)
```
```{r}
unique(dat.query3$Crop)
unique(dat.query3$Country)
range(dat.query3$Year)
```
Note the use of the AND Boolean operator (`&`) instead of the OR operator (`|`) for the `Year` query. We want the `Year` value to satisfy two criteria simultaneously: *greater than or equal to 2005* **AND** *less than or equal to 2010*. Had we used the `|` operator, R would have returned all years since all year values satisfy at least one of the two critera.
### `arrange`: Sort rows by column value
You can sort a table based on a column's values using the `arrange` function. For example, to sort `dat` by crop name type:
```{r}
dat.sort1 <- arrange(dat, Crop)
head(dat.sort1)
tail(dat.sort1)
```
By default, `arrange` sorts by ascending order. To sort by descending order, wrap the column name with the function `desc()`. For example, to sort the table by `Crop` in ascending order, then by `Year` in descending order, type:
```{r}
dat.sort2 <- arrange(dat, Crop, desc(Year))
head(dat.sort2)
tail(dat.sort2)
```
### `select`: Subset by column
You can subset a table by column(s) using the `select` function. To extract the columns `Crop`, `Year` and `Value`, type:
```{r}
dat.subcol <- select(dat, Crop, Year, Value)
head(dat.subcol, 2)
```
If you want all columns *other* than `Crop`, `Year` and `Value`, add the negative `-` symbol before the column name:
```{r}
dat.subcol <- select(dat, -Crop, -Year, -Value)
head(dat.subcol, 2)
```
### `mutate`: Creating and/or calculating column values
You can add columns (and compute their values) using the `mutate` function. For example, to add a column `Ctr_abbr` and assign it the abbreviated values `CAN` for Canada and `USA` for the United States of America based on the values in column `Country` type:
```{r}
dat.extended <- mutate(dat, Ctr_abbr = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.extended,3)
tail(dat.extended,3)
```
Here, we make use of an embedded function, `ifelse`, which performs a conditional operation: if the `Country` value is `Canada` return `CAN` if not, return `USA`. `ifelse` is covered in section 9.3.
Note that if you wish to *rename* a column, you can use the `rename()` function instead of `mutate`.
You can also use `mutate` to recompute column values. For example, to replace the `Country` column values with `CAN` or `USA` type:
```{r}
dat.overwrite <- mutate(dat, Country = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.overwrite, 3)
tail(dat.overwrite, 3)
```
### `mutate` across muliple columns
You might find yourself wanting to apply a same set of *mutate* operations across multiple variables. For example, given the following sample dataset,
```{r}
met <- data.frame(Wind = c(3.4, 5.0, 99, 4.1, 1.5),
Dir = c(181, 220, 15, 15, 99 ),
Prec = c(99 , 0.5, 0, 99, 99))
met
```
where the value `99` is a placeholder for a missing value for the variables `Wind` and `Prec` but a valid value for `Dir`, we want to replace all missing values with `NA`. We could either create two mutate operations as in,
```{r}
met %>% mutate(Wind = ifelse(Wind == 99, NA, Wind),
Prec = ifelse(Prec == 99, NA, Prec))
```
or, we could reduce the separate `mutate` operations into a single operation by adding an `across()` function to its argument.
```{r}
met %>% mutate(across( c(Wind, Prec),
~ ifelse( . == 99, NA, .)))
```
`across` takes two arguments: the columns the mutate operation is to operate on, and the mutate operation to perform on these columns. Here, the tilde `~` can be interpreted as *"to the listed columns, apply the following function ..."*. The dot `.` in the `ifelse` function is a placeholder for each column listed in `across`'s first argument.
### `summarise`: Summarize columns
You can summarize (or "collapse") one or more columns using the `summarise` function. For instance, to get the minimum and maximum years from the `Year` column, type:
```{r}
summarise(dat, yr_min = min(Year), yr_max=max(Year))
```
## Combining data manipulation functions using the pipe `%>%` {#sec-pipe}
In most cases, you will find yourself wanting to combine several of `dplyr`'s data manipulation functions. For example,
```{r}
dat.yield <- filter(dat, Information == "Yield (Hg/Ha)",
Crop == "Oats",
Year == 2012)
dat.rename <- mutate(dat.yield, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.final <- select(dat.rename, Country, Value)
head(dat.final, 3)
```
The downside to this approach is the creation of several intermediate objects (e.g. `dat.yield` and `dat.rename`). This can make the workflow difficult to follow and clutter your R session with needless intermediate objects.
Another approach to combining `dplyr` operations is to use the **piping operator** ,`%>%`, which daisy chains `dplyr` operations. So our previous workflow could look like:
```{r}
dat.final <- dat %>%
filter(Information == "Yield (Hg/Ha)",
Crop == "Oats",
Year == 2012) %>%
mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) %>%
select(Country, Value)
head(dat.final, 3)
```
The chunk of code can be read as *"... with the `dat` table, `filter` by ..., then `mutate` ...., then `select` ..."* with the result from one operation being passed on to the next using the `%>%` operator. Note that the `filter`, `mutate` and `select` functions do not include the data table name making the chunk of code less cluttered and easier to read. The input data table `dat` appears just once at the beginning of the pipe.
### R has a native pipe too
R has recently (as of version `4.1`) added its own native pipe to its base function. Its infix operator is written as `|>`. In most code chunks covered in these tutorials, you can substitute `%>%` with `|>`. For example, you can write the previous code chunk as:
```{r eval = FALSE}
dat.final <- dat |>
filter(Information == "Yield (Hg/Ha)",
Crop == "Oats",
Year == 2012) |>
mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) |>
select(Country, Value)
```
There are, however, a few subtle differences between the two. A deeper dive in how they differ can be found [here](https://mgimond.github.io/CRUG_Sep_2021/crug_sep_2021.html#the-native-pipe).
In this course, we'll stick with the `%>%` operator given that `|>` is new and is not yet as widely adopted as `%>%`.
### Updating a factor's levels in a dataframe
Subsetting a dataframe by row might result in a factor losing all values associated with a level. You learned in section `3.3.1.2` that the `droplevels` function can be used to remove any levels no longer present in the factor. `droplevels` can also be used in a piping operation, however, the function will drop levels for *all* factors in the dataframe--this might not be the intended goal.
If you want to selectively drop the levels for a specific set of factors in a dataframe, you might want to make use of the `fct_drop` function from the `forcats` package. For example, let's first create a subset of the built-in `mtcars` dataset and create factors from a couple of its variables. We'll use the base methods to tackle this task.
```{r}
mt <- mtcars[ , c("cyl", "gear")]
mt$cyl <- factor(mtcars$cyl)
mt$gear <- factor(mtcars$gear)
```
Now, let's filter by `cyl`.
```{r}
mt_sub <- mt %>%
filter(cyl == 8)
```
Exploring `mt_sub`'s summary, we see that some levels have no matching values in the factors:
```{r}
summary(mt_sub)
```
Applying `droplevels` to the dataframe will remove those levels from *both* factors;
```{r}
mt_sub <- mt %>%
filter(cyl == 8) %>%
droplevels()
summary(mt_sub)
```
Using `fct_drop` controls which factor(s) you want to clean up the levels for. Note, however, that `fct_drop` must be used inside of a `mutate` function.
```{r}
library(forcats)
mt_sub <- mt %>%
filter(cyl == 8) %>%
mutate( cyl = fct_drop(cyl))
summary(mt_sub)
```
## Conditional statements
### The base `ifelse`
Conditional statements are used when you want to create an output value that is conditioned on an evaluation. For example, if you want to output a value of `1` *if* an input value is less than `23` and a value of `0` otherwise, you can make use of the `ifelse` function as follows:
```{r}
x <- c(12,102, 43, 20, 90, 0, 12, 6)
ifelse(x < 23, 1, 0)
```
`ifelse` takes three arguments:
+ The condition to evaluate (`x < 23` in the above example);
+ The value to output if the condition is `TRUE` (`1` in our example);
+ The value to output if the condition is `FALSE` (`0` in our example).
The base `ifelse` function works as expected when the input/output values are numeric or character, but it does not work as expected when applied to factors or dates. For example, if you wish to replace one factor level with another, the following example will not return the expected output.
```{r}
x <- as.factor( c("apple", "banana", "banana", "pear", "apple"))
ifelse(x == "pear", "apple", x)
```
The output is a character representation of the level number (recall that factors encode level values as numbers behind the scenes, i.e. `apple` =1, `banana`=2, etc...). Likewise, if you wish to replace an erroneous date you will get:
```{r}
library(lubridate)
y <- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
ifelse( y == mdy("12/1/1901"), mdy("12/1/2016"), y)
```
Here, `ifelse` converts the date object to its internal numeric representation as number of days since 1970.
### `dplyr`'s `if_else`
`ifelse` does not preserve the attributes that might be present in a vector. In other words, it will strip away the vector's `class`. A safer alternative is to use `dplyr`'s `if_else` function.
Reworking the above examples:
```{r}
if_else( y == mdy("12/1/1901"), mdy("12/1/2016"), y)
```
The date class is preserved. Now let's check the output of a factor.
```{r error=TRUE}
if_else(x == "pear", "apple", x)
```
Note that when working with factors, however, `if_else` will strip the `factor` class of an input factor. But, instead of returning the factor's underlying integer values, it outputs the associated levels as a `character` data type.
The workaround is to convert the `if_else` output to a factor.
```{r error=TRUE}
factor(if_else(x == "pear", "apple", x))
```
If you need to explicitly define the levels, add the `levels = c("pear",...)`.
NOTE: `dplyr` offers the `recode` function that preserves factors however, this function is being superseded according to the documentation as of version `1.1.4`.
### Changing values based on multiple conditions: `case_when`
`ifelse` and `if_else` work great when a single set of conditions is to be satisfied. But, if multiple sets of conditions are to be evaluated, nested if/else statements become cumbersome and are prone to clerical error. The following code highlights an example of nested if/else statements.
```{r}
unit <- c("F","F","C", "K")
if_else( unit == "C", "Celsius", if_else(unit == "F", "Fahrenheit", "Kelvin"))
```
A simpler solution is to use the `case_when` function.
```{r}
case_when(unit == "C" ~ "Celsius",
unit == "F" ~ "Fahrenheit",
unit == "K" ~ "Kelvin")
```
`case_when` can aso be used for more complex operations. For example, given two vectors, `unit` and `temp`, we would like to convert all `temp` values to Fahrenheit by applying a temperature conversion dependent on the `unit` value.
```{r}
temp <- c(45.2, 56.3, 11.0, 285)
```
```{r}
case_when(unit == "F" ~ temp,
unit == "C" ~ (temp * 9/5) + 32,
TRUE ~ (temp - 273.15) * 9/5 + 32)
```
The last argument, `TRUE ~ `, applies to all conditions not satisfied by the previous two conditions (otherwise, not doing so would return `NA` values by default). You only need to add a `TRUE ~` condition if you know that all previously listed conditions may not cover all possible outcomes. Here, we know that some observations are associated with `unit == "K"` yet that condition is not explicitly defined in the `case_when` arguments. We could have, of course, added the `unit == "K"` condition to the above code chunk thus alleviating the need for the `TRUE ~` condition.
Note that the order in which these conditions are listed matters since evaluation stops at the first `TRUE` outcome encountered. So, had the last condition been moved to the top of the stack, all `temp` values would be assigned the first conversion option.
```{r}
# What not to do ...
case_when(TRUE ~ (temp - 273.15) * 9/5 + 32,
unit == "F" ~ temp,
unit == "C" ~ (temp * 9/5) + 32)
```
Note that `case_when` can also be used inside of a `mutate` function. For example, to replace `Canada` and `United States of America` in variable `Country` with `CAN` and `USA` respectively and to create a new variable called `Type` which will take on the values of `1`, `2` or `3` depending on the values in variable `Source`, type the following:
```{r}
dat1 <- dat %>%
mutate(Country = case_when(Country == "Canada" ~ "CAN",
Country == "United States of America" ~ "USA"),
Type = case_when(Source == "Calculated data" ~ 1,
Source == "Official data" ~ 2,
TRUE ~ 3))
head(dat1)
```
## Replacing values with `NA`
So far, we've used the `ifelse` or `if_else` functions to replace certain values with `NA`. `dplyr` offers the `na_if()` function to simplify the syntax. For example, to replace `-999` with `NA`:
```{r}
val <- c(-999, 6, -1, -999)
na_if( val , -999 )
```
Likewise, to replace all empty character values:
```{r}
val <- c("ab", "", "A b", " ")
na_if( val , "" )
```
`na_if` will also preserve the object's class. For example:
```{r}
x <- as.factor( c("apple", "walnut", "banana", "pear", "apple"))
na_if(x , "walnut")
```
But, note that it does not automatically drop the level being replaced with `NA`.
`na_if` also works with dates, but don't forget to evaluate a date object with a date value. For example, to replace dates of `12/1/1901` with `NA`, we need to make a date object of that value. Here, we'll make use of the `mdy()` function as in `mdy("12/1/1901")`.
```{r}
y <- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
na_if(y, mdy("12/1/1901"))
```
To use `na_if()` in a piping operation, it needs to be embedded in a `mutate()` function. For example, to replace `"Calculated data"` with `NA` in the `dat` dataframe, type:
```{r}
dat1 <- dat %>%
mutate(Source = na_if(Source, "Calculated data" ))
```
```{r}
unique(dat1$Source)
```
## Outputting a vector instead of a table using `pull`
Piping operations will output a table, even if a single value is returned. For example, the following summarization operation returns the total oats yield as a dataframe:
```{r}
oats <- dat %>%
filter(Crop == "Oats",
Information == "Yield (Hg/Ha)") %>%
summarise(Oats_sum = sum(Value))
oats
class(oats)
```
There may be times when you want to output a vector element and not a dataframe. To output a vector, use the `pull()` function.
```{r}
oats <- dat %>%
filter(Crop == "Oats",
Information == "Yield (Hg/Ha)") %>%
summarise(Oats_sum = sum(Value)) %>%
pull()
oats
class(oats)
```
The `pull` function can also be used to explicitly define the column to extract. For example, to extract the `Value` column type:
```{r}
# This outputs a multi-element vector
yield <- dat %>%
filter(Crop == "Oats",
Information == "Yield (Hg/Ha)") %>%
pull(Value)
```
```{r}
head(yield)
class(yield)
```