dplyr.qmd

---
output: html_document
editor_options: 
  chunk_output_type: console
---
# Manipulating data tables with `dplyr`


```{r echo=FALSE}
source("libs/Common.R")
options(width = 80)
```


```{r echo = FALSE}
pkg_ver(c("dplyr", "lubridate","forcats", "stringr"))
```


The data file  *FAO_grains_NA.csv* will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014. 

Run the following line to load the FAO data file into your current R session.

```{r}
dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)
```

Before tackling the examples that follow, make sure to load the `dplyr` package. 

```{r, message=FALSE}
library(dplyr)
```

## The dplyr basics

The basic set of R tools can accomplish many data table queries, but the syntax can be overwhelming and verbose. The `dplyr` package offers some nifty and simple querying functions. Some of `dplyr`'s key data manipulation functions are summarized in the following table:

`dplyr` function   | Description
-------------------|-------------
`filter()`         | Subset by row values
`arrange()`        | Sort rows by column values
`select()`         | Subset columns
`mutate()`         | Add columns
`summarise()`      | Summarize columns

Note that all of these functions take as first argument the data table name except when used in a piping operation (pipes are discussed later in this chapter). For example:

When used alone, dataframe `dat` is the first argument **inside** the `dplyr` function.

```{r eval=FALSE}
dat2 <- select(dat, Crop)
```

When used in a pipe, dataframe `dat` is **outside** of the `dplyr` function.

```{r eval=FALSE}
dat2 <- dat %>% select(Crop)
```

If you are using R 4.1 or greater, you can make use of the native pipe.

```{r eval=FALSE}
dat2 <- dat |> select(Crop)
```

### `filter`: Subset by rows 

Tables can be subsetted by rows based on column values. For example, we may wish to grab all rows associated with the value `Oats` in the `Crop` column: 

```{r}
dat.query1 <- filter(dat, Crop == "Oats")
```

We can use the `unique()` function to identify all unique values in the `Crop` column.  This should return a single unique value: `Oats`.

```{r}
unique(dat.query1$Crop)
```

Note that R is case sensitive, so make sure that you respect each letter's case (i.e. upper or lower).

We can expand our query by including both `Oats`, `Buckwheat` and limiting the `Country` column to `Canada`.

```{r}
dat.query2 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada")
```

```{r}
unique(dat.query1$Crop)
unique(dat.query1$Country)
```

The character `|` is the Boolean operator *OR*. So in our example, the query can be read as "... crop equals oats *OR* crop equals buckwheat". Had we used the AND operator, `&` as in `Crop == "Oats" & Crop == "Buckwheat"`, the output would have returned zero rows since a `Crop` value cannot be both `Oats` AND `Buckwheat`.

We can expand this query by limiting our output to the years 2005 to 2010

```{r}
dat.query3 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada", 
                          Year >= 2005 & Year <= 2010)
```

```{r}
unique(dat.query3$Crop)
unique(dat.query3$Country)
range(dat.query3$Year)
```

Note the use of the AND Boolean operator (`&`) instead of the OR operator (`|`) for the `Year` query. We want the `Year` value to satisfy two criteria simultaneously: *greater than or equal to 2005* **AND** *less than or equal to 2010*. Had we used the `|` operator, R would have returned all years since all year values satisfy at least one of the two critera.


### `arrange`: Sort rows by column value

You can sort a table based on a column's values using the `arrange` function. For example, to sort `dat` by crop name type:

```{r}
dat.sort1 <- arrange(dat, Crop)
head(dat.sort1)
tail(dat.sort1)
```

By default, `arrange` sorts by ascending order. To sort by descending order, wrap the column name with the function `desc()`. For example, to sort the table by `Crop` in ascending order, then by `Year` in descending order, type:

```{r}
dat.sort2 <- arrange(dat, Crop, desc(Year))
head(dat.sort2)
tail(dat.sort2)
```

### `select`: Subset by column 

You can subset a table by column(s) using the `select` function. To extract the columns `Crop`, `Year` and `Value`, type:

```{r}
dat.subcol <- select(dat, Crop, Year, Value)
head(dat.subcol, 2)
```

If you want all columns *other* than `Crop`, `Year` and `Value`, add the negative `-` symbol before the column name:

```{r}
dat.subcol <- select(dat, -Crop, -Year, -Value)
head(dat.subcol, 2)
```

### `mutate`: Creating and/or calculating column values

You can add columns (and compute their values) using the `mutate` function. For example, to add a column `Ctr_abbr` and assign it the abbreviated values `CAN` for Canada and `USA` for the United States of America based on the values in column `Country` type:

```{r}
dat.extended <- mutate(dat, Ctr_abbr = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.extended,3)
tail(dat.extended,3)
```

Here, we make use of an embedded function, `ifelse`, which performs a conditional operation: if the `Country` value is `Canada` return `CAN` if not, return `USA`. `ifelse` is covered in section 9.3.

Note that if you wish to *rename* a column, you can use the `rename()` function instead of `mutate`.

You can also use `mutate` to recompute column values. For example, to replace the `Country` column values with `CAN` or `USA` type:

```{r}
dat.overwrite <- mutate(dat, Country = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.overwrite, 3)
tail(dat.overwrite, 3)
```

### `mutate` across muliple columns 

You might find yourself wanting to apply a same set of *mutate* operations across multiple variables. For example, given the following sample dataset,

```{r}
met <- data.frame(Wind = c(3.4, 5.0, 99, 4.1, 1.5),
                  Dir  = c(181, 220, 15,  15,  99 ),
                  Prec = c(99 , 0.5,  0,  99,  99))
met
```

where the value `99` is a placeholder for a missing value for the variables `Wind` and `Prec` but a valid value for `Dir`, we want to replace all missing values with `NA`. We could either create two mutate operations as in,

```{r}
met %>%  mutate(Wind = ifelse(Wind == 99, NA, Wind),
                Prec = ifelse(Prec == 99, NA, Prec))
```

or, we could reduce the separate `mutate` operations into a single operation by adding an `across()` function to its argument.

```{r}
met %>% mutate(across( c(Wind, Prec),
                       ~ ifelse( . == 99, NA, .)))
```

`across` takes two arguments: the columns the mutate operation is to operate on, and the mutate operation to perform on these columns. Here, the tilde `~` can be interpreted as *"to the listed columns, apply the following function ..."*. The dot `.` in the `ifelse` function is a placeholder for each column listed in `across`'s first argument.

### `summarise`: Summarize columns

You can summarize (or "collapse") one or more columns using the `summarise` function. For instance, to get the minimum and maximum years from the `Year` column, type:

```{r}
summarise(dat, yr_min = min(Year), yr_max=max(Year))
```

## Combining data manipulation functions using the pipe `%>%` {#sec-pipe}

In most cases, you will find yourself wanting to combine several of `dplyr`'s data manipulation functions. For example, 

```{r}
dat.yield  <- filter(dat, Information == "Yield (Hg/Ha)", 
                          Crop == "Oats",
                          Year == 2012)
dat.rename <- mutate(dat.yield, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.final  <- select(dat.rename, Country, Value)

head(dat.final, 3)
```

The downside to this approach is the creation of several intermediate objects (e.g. `dat.yield` and `dat.rename`). This can make the workflow difficult to follow and clutter your R session with needless intermediate objects. 

Another approach to combining `dplyr` operations is to use the **piping operator** ,`%>%`, which daisy chains `dplyr` operations. So our previous workflow could look like:

```{r}
dat.final <- dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Crop == "Oats",
         Year == 2012)  %>% 
  mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) %>%
  select(Country, Value)

head(dat.final, 3)
```

The chunk of code can be read as *"... with the `dat` table, `filter` by ..., then `mutate` ...., then `select` ..."* with the result from one operation being passed on to the next using the `%>%` operator. Note that the `filter`, `mutate` and `select` functions do not include the data table name making the chunk of code less cluttered and easier to read. The input data table  `dat` appears just once at the beginning of the pipe.

### R has a native pipe too

R has recently (as of version `4.1`) added its own native pipe to its base function. Its infix operator is written as `|>`. In most code chunks covered in these tutorials, you can substitute `%>%` with `|>`. For example, you can write the previous code chunk as:

```{r eval = FALSE}
dat.final <- dat  |> 
  filter(Information == "Yield (Hg/Ha)", 
         Crop == "Oats",
         Year == 2012)   |>  
  mutate(Country = ifelse(Country == "Canada", "CAN", "USA"))  |> 
  select(Country, Value)

```

There are, however, a few subtle differences between the two. A deeper dive in how they differ can be found [here](https://mgimond.github.io/CRUG_Sep_2021/crug_sep_2021.html#the-native-pipe).

In this course, we'll stick with the `%>%` operator given that `|>` is  new and is not yet as widely adopted as `%>%`.

&nbsp;&nbsp;


### Updating a factor's levels in a dataframe

Subsetting a dataframe by row might result in a factor losing all values associated with a level. You learned in section `3.3.1.2` that the `droplevels` function can be used to remove any levels no longer present in the factor.  `droplevels` can also be used in a piping operation, however, the function will drop levels for *all* factors in the dataframe--this might not be the intended goal.

If you want to selectively drop the levels for a specific set of factors in a dataframe, you might want to make use of the `fct_drop` function from the `forcats` package. For example, let's first create a subset of the built-in `mtcars` dataset and create factors from a couple of its variables. We'll use the base methods to tackle this task.

```{r}
mt <- mtcars[ , c("cyl", "gear")]
mt$cyl  <- factor(mtcars$cyl)
mt$gear <- factor(mtcars$gear)
```

Now, let's filter by `cyl`.

```{r}
mt_sub <- mt %>% 
   filter(cyl == 8)
```

Exploring `mt_sub`'s summary, we see that some levels have no matching values in the factors:

```{r}
summary(mt_sub)
```

Applying `droplevels` to the dataframe will remove those levels from *both* factors;

```{r}
mt_sub <- mt %>% 
   filter(cyl == 8) %>% 
   droplevels()

summary(mt_sub)
```

Using `fct_drop` controls which factor(s) you want to clean up the levels for. Note, however, that `fct_drop` must be used inside of a `mutate` function.

```{r}
library(forcats) 

mt_sub <- mt %>% 
   filter(cyl == 8) %>% 
   mutate( cyl = fct_drop(cyl))

summary(mt_sub)
```


## Conditional statements
&nbsp;&nbsp;

### The base `ifelse`

Conditional statements are used when you want to create an output value that is conditioned on an evaluation. For example, if you want to output a value of `1` *if* an input value is less than `23` and a value of `0` otherwise, you can make use of the `ifelse` function as follows:

```{r}
x <- c(12,102, 43, 20, 90, 0, 12, 6)
ifelse(x < 23, 1, 0)
```

`ifelse` takes three arguments: 

 + The condition to evaluate (`x < 23` in the above example);
 + The value to output if the condition is `TRUE` (`1` in our example);
 + The value to output if the condition is `FALSE` (`0` in our example).

The  base `ifelse` function works as expected when the input/output values are  numeric or character, but it does not work as expected when applied to factors or dates. For example, if you wish to replace one factor level with another, the following example will not return the expected output.

```{r}
x <- as.factor( c("apple", "banana", "banana", "pear", "apple"))
ifelse(x == "pear", "apple", x)
```

The output is a character representation of the level number (recall that factors encode level values as numbers behind the scenes, i.e. `apple` =1, `banana`=2, etc...). Likewise, if you wish to replace an erroneous date you will get:

```{r}
library(lubridate)
y <- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
ifelse( y == mdy("12/1/1901"), mdy("12/1/2016"), y)
```

Here, `ifelse` converts the date object to its internal numeric representation as number of days since 1970. 


### `dplyr`'s `if_else`

`ifelse` does not preserve the attributes that might be present in a vector. In other words, it will strip away the vector's `class`. A safer alternative is to use `dplyr`'s `if_else` function.

Reworking the above examples: 

```{r}
if_else( y == mdy("12/1/1901"), mdy("12/1/2016"), y)
```

The date class is preserved. Now let's check the output of a factor.

```{r error=TRUE}
if_else(x == "pear", "apple", x)
```

Note that when working with factors, however, `if_else` will strip the `factor` class of an input factor. But, instead of returning the factor's underlying integer values, it outputs the associated levels as a `character` data type.

The workaround is to convert the `if_else` output to a factor.

```{r error=TRUE}
factor(if_else(x == "pear", "apple", x))
```

If you need to explicitly define the levels, add the `levels = c("pear",...)`.

NOTE: `dplyr` offers the `recode` function that preserves factors however, this function is being superseded according to the documentation as of version `1.1.4`.


### Changing values based on multiple conditions: `case_when`

`ifelse` and `if_else` work great when a single set of conditions is to be satisfied. But, if multiple sets of conditions are to be evaluated, nested if/else statements become cumbersome and are prone to clerical error. The following code highlights an example of nested if/else statements.

```{r}
unit <- c("F","F","C", "K")
if_else( unit == "C", "Celsius", if_else(unit == "F", "Fahrenheit", "Kelvin"))
```

A simpler solution is to use the `case_when` function.

```{r} 
case_when(unit == "C" ~ "Celsius",
          unit == "F" ~ "Fahrenheit",
          unit == "K" ~ "Kelvin")
```

`case_when` can aso be used for more complex operations. For example, given two vectors, `unit` and `temp`, we would like to convert all `temp` values to Fahrenheit by applying a temperature conversion dependent on the `unit` value.

```{r}
temp <- c(45.2, 56.3, 11.0, 285)
```


```{r}
case_when(unit == "F" ~ temp,
          unit == "C" ~ (temp * 9/5) + 32,
          TRUE ~ (temp - 273.15) * 9/5 + 32)
```

The last argument, `TRUE ~ `, applies to all conditions not satisfied by the previous two conditions (otherwise, not doing so would return `NA` values by default). You only need to add a `TRUE ~` condition if you know that all previously listed conditions may not cover all possible outcomes. Here, we know that some observations are associated with `unit == "K"` yet that condition is not explicitly defined in the `case_when` arguments. We could have, of course, added the `unit == "K"` condition to the above code chunk thus alleviating the need for the `TRUE ~` condition.

Note that the order in which these conditions are listed matters since evaluation stops at the first `TRUE` outcome encountered. So, had the last condition been moved to the top of the stack, all `temp` values would be assigned the first conversion option.

```{r}
# What not to do ...
case_when(TRUE ~ (temp - 273.15) * 9/5 + 32,
          unit == "F" ~ temp,
          unit == "C" ~ (temp * 9/5) + 32)
```


Note that `case_when` can also be used inside of a `mutate` function. For example, to replace `Canada` and `United States of America` in variable `Country` with `CAN` and `USA` respectively and to create a new variable called `Type` which will take on the values of `1`, `2` or `3` depending on the values in variable `Source`, type the following: 

```{r}
dat1 <- dat %>% 
  mutate(Country = case_when(Country == "Canada" ~ "CAN",
                             Country == "United States of America" ~ "USA"),
         Type = case_when(Source == "Calculated data" ~ 1,
                          Source == "Official data" ~ 2,
                          TRUE ~ 3)) 
head(dat1)   
```

&nbsp;&nbsp;

## Replacing values with `NA`

So far, we've used the `ifelse` or `if_else` functions to replace certain values with `NA`. `dplyr` offers the `na_if()` function to simplify the syntax. For example, to replace `-999` with `NA`:

```{r}
val <- c(-999, 6, -1, -999)
na_if( val , -999 )
```

Likewise, to replace all empty character values:

```{r}
val <- c("ab", "", "A b", "  ")
na_if( val , "" )
```

`na_if` will also preserve the object's class. For example:

```{r}
x <- as.factor( c("apple", "walnut", "banana", "pear", "apple"))
na_if(x , "walnut")
```

But, note that it does not automatically drop the level being replaced with `NA`.

`na_if` also works with dates, but don't forget to evaluate a date object with a date value. For example, to replace dates of `12/1/1901` with `NA`, we need to make a date object of that value. Here, we'll make use of the `mdy()` function as in `mdy("12/1/1901")`.

```{r}
y <- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
na_if(y, mdy("12/1/1901"))
```

To use `na_if()` in a piping operation, it needs to be embedded in a `mutate()` function. For example, to replace `"Calculated data"` with `NA` in the `dat` dataframe, type:

```{r}
dat1 <- dat %>% 
    mutate(Source = na_if(Source, "Calculated data" ))
```

```{r}
unique(dat1$Source)
```

## Outputting a vector instead of a table using `pull`

Piping operations will output a table, even if a single value is returned. For example, the following summarization operation returns the total oats yield as a dataframe:

```{r}
oats <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  summarise(Oats_sum = sum(Value))
oats
class(oats)
```

There may be times when you want to output a vector element and not a dataframe. To output a vector, use the `pull()` function.

```{r}
oats <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  summarise(Oats_sum = sum(Value)) %>% 
  pull()
oats

class(oats)
```

The `pull` function can also be used to explicitly define the column to extract. For example, to extract the `Value` column type:


```{r}
# This outputs a multi-element vector
yield <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  pull(Value)
```

```{r}
head(yield)
class(yield)
```