-
Notifications
You must be signed in to change notification settings - Fork 41
/
01-intro.Rmd
212 lines (165 loc) · 10.1 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
author: "Jenny Bryan"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: github_document
---
```{r setup, include = FALSE, cache = FALSE}
knitr::opts_chunk$set(error = TRUE, collapse = TRUE, comment = "#>")
```
```{r prepare-tidy-data, include = FALSE}
library(tidyverse)
if (!file.exists(file.path("data", "lotr_clean.tsv"))) {
download.file(paste0("https://raw.githubusercontent.com/jennybc/",
"lotr/master/lotr_clean.tsv"),
destfile = file.path("data", "lotr_clean.tsv"),
method = "curl")
}
lotr_dat <- read_tsv(file.path("data", "lotr_clean.tsv"), col_types = cols(
Film = col_character(),
Chapter = col_character(),
Character = col_character(),
Race = col_character(),
Words = col_integer()
))
females <- c("Galadriel", "Arwen", "Lobelia Sackville-Baggins", "Rosie",
"Mrs. Bracegirdle", "Eowyn", "Freda", "Rohan Maiden")
lotr_dat <-
mutate(lotr_dat,
Gender = ifelse(Character %in% females, "Female", "Male"))
(lotr_tidy <- lotr_dat %>%
dplyr::filter(Race %in% c("Elf", "Hobbit", "Man")) %>%
group_by(Film, Gender, Race) %>%
summarize(Words = sum(Words)))
(all_combns <- lotr_tidy %>%
select(-Words) %>%
map(unique) %>%
lift_dl(crossing)())
lotr_tidy <- left_join(all_combns, lotr_tidy) %>%
replace_na(list(Words = 0)) %>%
mutate(Film = factor(Film, levels = c("The Fellowship Of The Ring",
"The Two Towers",
"The Return Of The King")),
Words = as.integer(Words)) %>%
arrange(Film, Race, Gender)
## let the version from 02-gather.Rmd rule the day
## non-substantive differences in row and/or variable order
#write_csv(lotr_tidy, file.path("data", "lotr_tidy.csv"))
```
```{r make-and-write-untidy-films, echo = FALSE}
untidy_films <- lotr_tidy %>%
split(.$Film) %>%
map(~ spread(.x, Gender, Words))
## leaves files behind for lesson on how to tidy
walk2(untidy_films,
file.path("data", paste0(gsub(" ", "_", names(untidy_films)), ".csv")),
~ write_csv(.x, .y))
## remove film name
untidy_films <- untidy_films %>%
map(~select(.x, -Film))
```
```{r make-and-write-untidy-gender, include = FALSE}
## leaves files behind for exercises re: how to tidy
untidy_gender <- lotr_tidy %>%
split(.$Gender) %>%
map(~ spread(.x, key = Race, value = Words)) %>%
map(~ select(.x, Gender, everything()))
walk2(untidy_gender, file.path("data", paste0(names(untidy_gender), ".csv")),
~ write_csv(.x, .y))
```
<blockquote class="twitter-tweet" lang="en"><p>If I had one thing to tell biologists learning bioinformatics, it would be "write code for humans, write data for computers".</p>— Vince Buffalo (@vsbuffalo) <a href="https://twitter.com/vsbuffalo/statuses/358699162679787521">July 20, 2013</a></blockquote>
An important aspect of "writing data for computers" is to make your data __tidy__. Key features of __tidy__ data:
* Each column is a variable
* Each row is an observation
If you are struggling to make a figure, for example, stop and think hard about whether your data is __tidy__. Untidiness is a common, often overlooked cause of agony in data analysis and visualization.
## Lord of the Rings example
I will give you a concrete example of some untidy data I created from [this data from the Lord of the Rings Trilogy](https://github.com/jennybc/lotr).
```{r load-xtable, echo = FALSE}
library(xtable)
```
<table border = 1>
<tr>
<td>
```{r results = 'asis', echo = FALSE}
print(xtable(untidy_films[["The Fellowship Of The Ring"]],
digits = 0, caption = "The Fellowship Of The Ring"),
caption.placement = "top", include.rownames = FALSE, type = 'html')
```
</td>
<td>
```{r results = 'asis', echo = FALSE}
print(xtable(untidy_films[["The Two Towers"]],
digits = 0, caption = "The Two Towers"),
caption.placement = "top", include.rownames = FALSE, type = 'html')
```
</td>
<td>
```{r results = 'asis', echo = FALSE}
print(xtable(untidy_films[["The Return Of The King"]],
digits = 0, caption = "The Return Of The King"),
caption.placement = "top", include.rownames = FALSE, type = 'html')
```
</td>
</tr>
</table>
We have one table per movie. In each table, we have the total number of words spoken, by characters of different races and genders.
You could imagine finding these three tables as separate worksheets in an Excel workbook. Or hanging out in some cells on the side of a worksheet that contains the underlying data raw data. Or as tables on a webpage or in a Word document.
This data has been formatted for consumption by *human eyeballs* (paraphrasing Murrell; see Resources). The format makes it easy for a *human* to look up the number of words spoken by female elves in The Two Towers. But this format actually makes it pretty hard for a *computer* to pull out such counts and, more importantly, to compute on them or graph them.
## Exercises
Look at the tables above and answer these questions:
* What's the total number of words spoken by male hobbits?
* Does a certain `Race` dominate a movie? Does the dominant `Race` differ across the movies?
How well does your approach scale if there were many more movies or if I provided you with updated data that includes all the `Races` (e.g. dwarves, orcs, etc.)?
## Tidy Lord of the Rings data
Here's how the same data looks in tidy form:
```{r echo = FALSE, results = 'asis'}
print(xtable(lotr_tidy, digits = 0), include.rownames = FALSE, type = 'html')
```
Notice that tidy data is generally taller and narrower. It doesn't fit nicely on the page. Certain elements get repeated alot, e.g. `Hobbit`. For these reasons, we often instinctively resist __tidy__ data as inefficient or ugly. But, unless and until you're making the final product for a textual presentation of data, ignore your yearning to see the data in a compact form.
## Benefits of tidy data
With the data in tidy form, it's natural to *get a computer* to do further summarization or to make a figure. This assumes you're using language that is "data-aware", which R certainly is. Let's answer the questions posed above.
#### What's the total number of words spoken by male hobbits?
```{r}
lotr_tidy %>%
count(Gender, Race, wt = Words)
## outside the tidyverse:
#aggregate(Words ~ Gender, data = lotr_tidy, FUN = sum)
```
Now it takes a small bit of code to compute the word total for both genders of all races across all films. The total number of words spoken by male hobbits is `r lotr_tidy %>% filter(Race == 'Hobbit', Gender == 'Male') %>% summarize(sum(Words))`. It was important here to have all word counts in a single variable, within a data frame that also included a variables for gender and race.
#### Does a certain race dominate a movie? Does the dominant race differ across the movies?
First, we sum across gender, to obtain word counts for the different races by movie.
```{r}
(by_race_film <- lotr_tidy %>%
group_by(Film, Race) %>%
summarize(Words = sum(Words)))
## outside the tidyverse:
#(by_race_film <- aggregate(Words ~ Race * Film, data = lotr_tidy, FUN = sum))
```
We can stare hard at those numbers to answer the question. But even nicer is to depict the word counts we just computed in a barchart.
```{r barchart-lotr-words-by-film-race}
p <- ggplot(by_race_film, aes(x = Film, y = Words, fill = Race))
p + geom_bar(stat = "identity", position = "dodge") +
coord_flip() + guides(fill = guide_legend(reverse = TRUE))
```
Hobbits are featured heavily in The Fellowhip of the Ring, where as Men had a lot more screen time in The Two Towers. They were equally prominent in the last movie, The Return of the King.
Again, it was important to have all the data in a single data frame, all word counts in a single variable, and associated variables for Film and Race.
## Take home message
Having the data in __tidy__ form was a key enabler for our data aggregations and visualization.
Tidy data is integral to efficient data analysis and visualization.
If you're skeptical about any of the above claims, it would be interesting to get the requested word counts, the barchart, or the insight gained from the chart *without* tidying or plotting the data. And imagine redoing all of that on the full dataset, which includes 3 more Races, e.g. Dwarves.
### Where to next?
In [the next lesson](02-gather.md), we'll show how to tidy this data.
Our summing over gender to get word counts for combinations of film and race is an example of __data aggregation__. It's a frequent companion task with tidying and reshaping. Learn more at:
* Simple aggregation with the tidyverse: `dplyr::count()` and `dplyr::group_by()` + `dplyr::summarize()`, [STAT 545 coverage](http://stat545.com/block010_dplyr-end-single-table.html#group_by-is-a-mighty-weapon), [Data transformation](http://r4ds.had.co.nz/transform.html) chapter in R for Data Science.
* General aggregation with the tidyverse: [STAT 545 coverage](http://stat545.com/block024_group-nest-split-map.html) of general Split-Apply-Combine via nested data frames.
* Simple aggregation with base R: `aggregate()`.
* General aggregation with base R: `tapply()`, `split()`, `by()`, etc.
The figure was made with ggplot2, a popular package that implements the Grammar of Graphics in R.
### Resources
* [Tidy data](http://r4ds.had.co.nz/tidy-data.html) chapter in R for Data Science, by Garrett Grolemund and Hadley Wickham
- [tidyr](https://github.com/hadley/tidyr) R package
- The tidyverse meta-package, within which `tidyr` lives: [tidyverse](https://github.com/hadley/tidyverse).
* [Bad Data Handbook](http://shop.oreilly.com/product/0636920024422.do) by By Q. Ethan McCallum, published by O'Reilly.
- Chapter 3: Data Intended for Human Consumption, Not Machine Consumption by Paul Murrell.
* Nine simple ways to make it easier to (re)use your data by EP White, E Baldridge, ZT Brym, KJ Locey, DJ McGlinn, SR Supp. *Ideas in Ecology and Evolution* 6(2): 1–10, 2013. doi:10.4033/iee.2013.6b.6.f <http://library.queensu.ca/ojs/index.php/IEE/article/view/4608>
- See the section "Use standard table formats"
* Tidy data by Hadley Wickham. Journal of Statistical Software. Vol. 59, Issue 10, Sep 2014. <http://www.jstatsoft.org/v59/i10>