munging.Rmd

---
title: Data munging with **tidyr** and **dplyr**
author: Aaron A. King
output:
  html_document:
    toc: yes
    toc_depth: 4
    toc_float:
      collapsed: TRUE
      smooth_scroll: TRUE
    highlight: haddock
    number_sections: FALSE
    df_print: paged
    includes:
      after_body:
      - _includes/main_bottom.html
      - _includes/license.html
bibliography: tutorial.bib
csl: jss.csl
---

```{css echo=FALSE,purl=FALSE}
.folder {
	color: #3333ff; 
    font-weight: bold;
}
```
```{r knitr-opts,include=FALSE,purl=FALSE,cache=FALSE}
prefix <- "munge"
source("_includes/setup.R",local=knitr::knit_global())
set.seed(5886884L)
```

----------------------------------

## How to use this document.

This is an extremely condensed introduction to the powerful data-munging tools developed by Hadley Wickham and contained in the packages **tidyr** and **dplyr**.
Run the codes shown and study the outputs to learn about these tools.
For your convenience, the **R** codes for this document are `r xfun::embed_file("munging.R",text="provided in a script")` which you can download, edit, and run.


## Reshaping data with **tidyr**

One can easily move between wide- and long-format data using `pivot_longer` and `pivot_wider`.

### pivot_longer

`pivot_longer()` takes a wide data frame and makes it long.
Multiple columns are combined into one *value* column with a *key* column keeping track of which column each value came from.
By default, every column is gathered;
One can exclude columns, or explicitly include them, using very simple syntax.

```{r results="show"}
library(tidyr)

data.frame(
  a=letters[1:10],
  b=1:10,
  c=sample(LETTERS[1:3],10,replace=TRUE),
  d=sample(1:10,10,replace=TRUE)
) -> x
x

pivot_longer(x,c(b,d))
pivot_longer(x,-c(a,c)) -> y; y
```
```{r eval=FALSE}
pivot_longer(x,-a)
```


### pivot_wider

`pivot_wider()` turns a long data frame into a wide one.
A single column (called the *value* column) is separated into multiple columns according to the keys in the *name* column.

```{r results="markup"}
pivot_wider(y)
```

```{r}
course.url <- "https://kinglab.eeb.lsa.umich.edu/480/data/"
read.csv(file.path(course.url,"energy_production.csv"),comment="#") -> energy

head(energy)

head(pivot_wider(energy,names_from=source,values_from=TJ))
```

One can do a lot just with the ability to pivot.

### unite and separate

`unite()` and `separate()` allow one to combine one or more columns into one, or to create two columns out of one.

```{r results="markup"}
unite(x,ab,a,b) -> z; z
separate(z,ab,into=c("a","b"))
```

```{r}
unite(energy,src_reg,source,region,sep="/") -> nrg
head(nrg)
```

## Manipulating data with **dplyr**

**dplyr** implements a very powerful, flexible, and intuitive grammar for data manipulation.
Once you get the hang of it, it's also very easy to read.
It takes a principled approach, which resembles a kind of "grammar".
The "nouns" are data frames.
In the following, we describe the "verbs".
These are functions that perform operations on data.

### **dplyr** verbs

The following are the basic functions for manipulating data using **dplyr**.

- `arrange()` to sort a data frame,
- `filter()` to subset data based on its values,
- `select()` to select variables based on their names,
- `summarise()` and `reframe()` to replace a data frame with another, using computations on the values of the first data frame,
- `mutate()` and `transmute()` to modify a data frame by adding new variables, or modifying old ones,
- `rename()` to change the names of variables.

#### `arrange`

`arrange()` sorts a data frame according to specifications.

```{r results="markup"}
library(dplyr)

arrange(x,a)
arrange(x,c)
arrange(x,c,b,a)
arrange(x,c,-b)
```

```{r}
arrange(energy,year,region,source)
arrange(energy,-TJ,region)
```

#### `filter`

`filter()` picks out the rows of a data frame that satisfy some condition.
In other words, it allows you to pick out subsets of the data.
```{r results="markup"}
filter(x,d>4)
filter(x,d>1.2 & c != "B")
```

```{r}
filter(energy,year>2010)
filter(energy,year>2010 & source%in%c("Nuclear","Oil"))
```

#### `select` and `rename`

`select()` changes which variables (columns) are in the data frame by name or position.
One can also reorder and rename the variables using `select()`.

```{r results="markup"}
select(x,a,b)
select(x,-c)
select(x,z=a,d)
```

`rename()` does not throw away any variables;
it just changes names.

```{results="markup"}
rename(x,z=a)
```

```{r}
select(energy,src=source,year)
```

#### `summarize` and `reframe`

Given a data frame, `summarize()` (synonym `summarise()`), produces a new data frame.
Usually, this new data frame summarizes aspects of the original one, in some way.

```{r results="markup"}
summarize(x,mean=mean(b),sd=sd(b),top=c[1])
```

```{r}
summarize(energy,tot=sum(TJ),n=length(TJ))
summarize(energy,min(year),max(year))
summarize(energy,min(year),max(year),interval=diff(range(year)))
```

`reframe()` is like `summarize()`, but it can return a data frame with more than one row.

```{r}
reframe(x,b=fivenum(b),d=fivenum(d))
```

```{r}
reframe(energy,p=c(0.1,0.5,0.9),q=quantile(TJ,probs=p))
```


#### `mutate` and `transmute`

Given a data frame, `mutate` modifies, adds, or removes variables.
Variables not changed remain as they were.
`transmute` is similar, but drops existing variables.


```{r results="markup"}
mutate(x,d=2*b,c=tolower(c),e=b+d) -> z; z
transmute(x,d=2*b,c=tolower(c),e=b+d)
```

```{r}
mutate(energy,hydrocarbon=(source%in%c("Coal","Gas","Oil"))) -> nrg
nrg
```

#### `count`

`count(x)` counts the number of combinations that occur and returns a data frame.
```{r results="markup"}
count(x,c)
count(x,a,c)
```

```{r}
count(energy,source,region)
count(energy,source,TJ)
```


### Helper functions

**dplyr** provides a large number of useful functions for manipulating individual variables.
These include `lead`, `lag`, `na_if`, `coalesce`, `if_else`, `recode`, and `case_when`.
We'll see examples of these as time goes on.

In addition, two functions are not in **dplyr** (but really should be, since they're so useful):
`plyr::mapvalues` and `plyr::revalue`.

`plyr::revalue` allows you to change one or more of the levels of a factor without worrying about how the factors are coded.

`plyr::mapvalues` does the same, but works on vectors of any type.

`dplyr::recode` is similar, but with a slightly different syntax.

```{r}
mutate(
  energy,
  region=plyr::revalue(
                 region,
                 c(
                   `Asia and Oceania`="Asia",
                   `Central and South America`="Latin.America"
                 )
               )
) -> z
head(z)

mutate(
  energy,
  source=plyr::mapvalues(
                 source,
                 from=c("Coal","Gas","Oil"),
                 to=c("Carbon","Carbon","Carbon")
               )
) -> z
head(z)

mutate(
  energy,
  source=recode(
    source,
    Coal="Carbon",
    Gas="Carbon",
    Oil="Carbon"
  )
) -> z
head(z)
```

------------------------------------

### Grouping

We very commonly want to split a data set into groups based on some criterion, apply some operation to each group, and then recombine the results.
**dplyr** provides the `group_by` and `ungroup` operations to accomplish this.
For example:

```{r}
group_by(energy,source) -> z
summarize(z,TJ=mean(TJ))

group_by(energy,source,region) -> z
summarize(z,TJ=max(TJ))
```

-------------------------------------

### Join operations

Join operations combine two datasets.
There are various flavors of the join operation.
Using **dplyr**, one can perform a *left join*, a *right join*, an *inner join*, a *full join*, a *semi-join*, or an *anti-join*.
Read the documentation (`?join`) for explanations.

```{r results="markup"}
x <- expand.grid(a=1:3,b=1:5); head(x)
y <- expand.grid(a=1:2,b=1:5,c=factor(c("F","G"))); head(y)
```

```{r}
left_join(x,y,by=c('a','b'))
right_join(x,y,by=c('a','b'))
inner_join(x,y,by=c('a','b'))
full_join(x,y,by=c('a','b'))
full_join(x,y,by='a')
inner_join(x,y,by='a')
```

Left join is the operation corresponding to looking up items in a lookup table.
Right join is just the transposed version of left join.
In an inner join, only those cases present in *both* datasets are kept.
In a full join, all the cases present in either dataset is kept.

```{r}
categories <- data.frame(
  source=c("Coal","Oil","Nuclear","Gas","Hydro","Other Renewables"),
  cat=c("dirty","dirty","dirty","dirty","clean","clean"))
left_join(energy,categories) -> nrg
```

-------------------------------------

##### Exercise

Use a combination of the tools we've seen so far to generate a table comparing the regions of the world in terms of the *rate of increase in usage* of "clean" vs "dirty" energy sources over the past few decades.

-------------------------------------


## Pipelines

When calculations get complex, it is often easier and more natural to view them as a chain of operations instead of using nested function calls or defining intermediate variables.
As of version 4.1, **R** has a *pipe operator*, which allows one to chain operations together.

### The `|>` operator

```
f(g(data, a, b, c, ...), d, e, ...)
```

is equivalent to

```
data |> g(a, b, c, ...) |> f(d, e, ...)
```

### Examples

Study the following examples.
What does each following accomplish?

```{r}
energy |> 
  filter(year>=1990) |>
  group_by(source,year) |>
  summarize(TJ=sum(TJ)) |>
  ungroup() |>
  group_by(source) |>
  summarize(TJ=mean(TJ))
```

```{r}
energy |> 
  filter(year>=1990) |>
  group_by(region,source) |>
  summarize(TJ=mean(TJ)) |>
  ungroup() |>
  group_by(source) |>
  reframe(
    region=region,
    fraction=TJ/sum(TJ)
  ) |>
  ungroup() |>
  pivot_wider(names_from=source,values_from=fraction)
```

```{r}
energy |>
  left_join(categories,by="source") |>
  group_by(region,year,cat) |>
  summarize(TJ=sum(TJ)) |>
  ungroup() |>
  group_by(region,cat) |>
  mutate(change=TJ-lag(TJ,1)) |>
  filter(year>=1990) |>
  summarize(increase=mean(change)) |>
  pivot_wider(names_from=cat,values_from=increase) |>
  mutate(
    overall=clean+dirty,
    factor=dirty/clean
  ) |>
  ungroup()
```

-----------------------------------

Produced with **R** version `r getRversion()`.

--------------------------

## References