Jenny Bryan 16 October, 2018

If I had one thing to tell biologists learning bioinformatics, it would be “write code for humans, write data for computers”.

— Vince Buffalo (@vsbuffalo) July 20, 2013

An important aspect of “writing data for computers” is to make your data tidy. Key features of tidy data:

Each column is a variable
Each row is an observation

If you are struggling to make a figure, for example, stop and think hard about whether your data is tidy. Untidiness is a common, often overlooked cause of agony in data analysis and visualization.

Lord of the Rings example

I will give you a concrete example of some untidy data I created from this data from the Lord of the Rings Trilogy.

The Fellowship Of The Ring

Race	Female	Male
Elf	1229	971
Hobbit	14	3644
Man	0	1995

The Two Towers

Race	Female	Male
Elf	331	513
Hobbit	0	2463
Man	401	3589

The Return Of The King

Race	Female	Male
Elf	183	510
Hobbit	2	2673
Man	268	2459

We have one table per movie. In each table, we have the total number of words spoken, by characters of different races and genders.

You could imagine finding these three tables as separate worksheets in an Excel workbook. Or hanging out in some cells on the side of a worksheet that contains the underlying data raw data. Or as tables on a webpage or in a Word document.

This data has been formatted for consumption by human eyeballs (paraphrasing Murrell; see Resources). The format makes it easy for a human to look up the number of words spoken by female elves in The Two Towers. But this format actually makes it pretty hard for a computer to pull out such counts and, more importantly, to compute on them or graph them.

Exercises

Look at the tables above and answer these questions:

What’s the total number of words spoken by male hobbits?
Does a certain Race dominate a movie? Does the dominant Race differ across the movies?

How well does your approach scale if there were many more movies or if I provided you with updated data that includes all the Races (e.g. dwarves, orcs, etc.)?

Tidy Lord of the Rings data

Here’s how the same data looks in tidy form:

Film	Gender	Race	Words
The Fellowship Of The Ring	Female	Elf	1229
The Fellowship Of The Ring	Male	Elf	971
The Fellowship Of The Ring	Female	Hobbit	14
The Fellowship Of The Ring	Male	Hobbit	3644
The Fellowship Of The Ring	Female	Man	0
The Fellowship Of The Ring	Male	Man	1995
The Two Towers	Female	Elf	331
The Two Towers	Male	Elf	513
The Two Towers	Female	Hobbit	0
The Two Towers	Male	Hobbit	2463
The Two Towers	Female	Man	401
The Two Towers	Male	Man	3589
The Return Of The King	Female	Elf	183
The Return Of The King	Male	Elf	510
The Return Of The King	Female	Hobbit	2
The Return Of The King	Male	Hobbit	2673
The Return Of The King	Female	Man	268
The Return Of The King	Male	Man	2459

Notice that tidy data is generally taller and narrower. It doesn’t fit nicely on the page. Certain elements get repeated alot, e.g. Hobbit. For these reasons, we often instinctively resist tidy data as inefficient or ugly. But, unless and until you’re making the final product for a textual presentation of data, ignore your yearning to see the data in a compact form.

Benefits of tidy data

With the data in tidy form, it’s natural to get a computer to do further summarization or to make a figure. This assumes you’re using language that is “data-aware”, which R certainly is. Let’s answer the questions posed above.

What’s the total number of words spoken by male hobbits?

## Cmd+Opt+P to run all chunks up til here
lotr_tidy %>% 
  count(Gender, Race, wt = Words)
#> # A tibble: 6 x 3
#>   Gender Race       n
#>   <chr>  <chr>  <int>
#> 1 Female Elf     1743
#> 2 Female Hobbit    16
#> 3 Female Man      669
#> 4 Male   Elf     1994
#> 5 Male   Hobbit  8780
#> 6 Male   Man     8043
## outside the tidyverse:
#aggregate(Words ~ Gender, data = lotr_tidy, FUN = sum)

Now it takes a small bit of code to compute the word total for both genders of all races across all films. The total number of words spoken by male hobbits is 8780. It was important here to have all word counts in a single variable, within a data frame that also included a variables for gender and race.

Does a certain race dominate a movie? Does the dominant race differ across the movies?

First, we sum across gender, to obtain word counts for the different races by movie.

(by_race_film <- lotr_tidy %>% 
   group_by(Film, Race) %>% 
   summarize(Words = sum(Words)))
#> # A tibble: 9 x 3
#> # Groups:   Film [?]
#>   Film                       Race   Words
#>   <fct>                      <chr>  <int>
#> 1 The Fellowship Of The Ring Elf     2200
#> 2 The Fellowship Of The Ring Hobbit  3658
#> 3 The Fellowship Of The Ring Man     1995
#> 4 The Two Towers             Elf      844
#> 5 The Two Towers             Hobbit  2463
#> 6 The Two Towers             Man     3990
#> 7 The Return Of The King     Elf      693
#> 8 The Return Of The King     Hobbit  2675
#> 9 The Return Of The King     Man     2727
## outside the tidyverse:
#(by_race_film <- aggregate(Words ~ Race * Film, data = lotr_tidy, FUN = sum))

We can stare hard at those numbers to answer the question. But even nicer is to depict the word counts we just computed in a barchart.

p <- ggplot(by_race_film, aes(x = Film, y = Words, fill = Race))
p + geom_bar(stat = "identity", position = "dodge") +
  coord_flip() + guides(fill = guide_legend(reverse = TRUE))

Hobbits are featured heavily in The Fellowhip of the Ring, where as Men had a lot more screen time in The Two Towers. They were equally prominent in the last movie, The Return of the King.

Again, it was important to have all the data in a single data frame, all word counts in a single variable, and associated variables for Film and Race.

Take home message

Having the data in tidy form was a key enabler for our data aggregations and visualization.

Tidy data is integral to efficient data analysis and visualization.

If you’re skeptical about any of the above claims, it would be interesting to get the requested word counts, the barchart, or the insight gained from the chart without tidying or plotting the data. And imagine redoing all of that on the full dataset, which includes 3 more Races, e.g. Dwarves.

Where to next?

In the next lesson, we’ll show how to tidy this data.

Our summing over gender to get word counts for combinations of film and race is an example of data aggregation. It’s a frequent companion task with tidying and reshaping. Learn more at:

Simple aggregation with the tidyverse: dplyr::count() and dplyr::group_by() + dplyr::summarize(), STAT 545 coverage, Data transformation chapter in R for Data Science.
General aggregation with the tidyverse: STAT 545 coverage of general Split-Apply-Combine via nested data frames.
Simple aggregation with base R: aggregate().
General aggregation with base R: tapply(), split(), by(), etc.

The figure was made with ggplot2, a popular package that implements the Grammar of Graphics in R.

Resources

Tidy data chapter in R for Data Science, by Garrett Grolemund and Hadley Wickham
- tidyr R package
- The tidyverse meta-package, within which tidyr lives: tidyverse.
Bad Data Handbook by By Q. Ethan McCallum, published by O’Reilly.
- Chapter 3: Data Intended for Human Consumption, Not Machine Consumption by Paul Murrell.
Nine simple ways to make it easier to (re)use your data by EP White, E Baldridge, ZT Brym, KJ Locey, DJ McGlinn, SR Supp. Ideas in Ecology and Evolution 6(2): 1–10, 2013. doi:10.4033/iee.2013.6b.6.f http://library.queensu.ca/ojs/index.php/IEE/article/view/4608
- See the section “Use standard table formats”
Tidy data by Hadley Wickham. Journal of Statistical Software. Vol. 59, Issue 10, Sep 2014. http://www.jstatsoft.org/v59/i10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01-intro.md

01-intro.md

Lord of the Rings example

Exercises

Tidy Lord of the Rings data

Benefits of tidy data

What’s the total number of words spoken by male hobbits?

Does a certain race dominate a movie? Does the dominant race differ across the movies?

Take home message

Where to next?

Resources

Files

01-intro.md

Latest commit

History

01-intro.md

File metadata and controls

Lord of the Rings example

Exercises

Tidy Lord of the Rings data

Benefits of tidy data

What’s the total number of words spoken by male hobbits?

Does a certain race dominate a movie? Does the dominant race differ across the movies?

Take home message

Where to next?

Resources