Skip to content

Latest commit

 

History

History
1071 lines (593 loc) · 11.6 KB

01-intro.md

File metadata and controls

1071 lines (593 loc) · 11.6 KB

Jenny Bryan 16 October, 2018

If I had one thing to tell biologists learning bioinformatics, it would be “write code for humans, write data for computers”.

— Vince Buffalo (@vsbuffalo) July 20, 2013

An important aspect of “writing data for computers” is to make your data tidy. Key features of tidy data:

  • Each column is a variable
  • Each row is an observation

If you are struggling to make a figure, for example, stop and think hard about whether your data is tidy. Untidiness is a common, often overlooked cause of agony in data analysis and visualization.

Lord of the Rings example

I will give you a concrete example of some untidy data I created from this data from the Lord of the Rings Trilogy.

The Fellowship Of The Ring

Race

Female

Male

Elf

1229

971

Hobbit

14

3644

Man

0

1995

The Two Towers

Race

Female

Male

Elf

331

513

Hobbit

0

2463

Man

401

3589

The Return Of The King

Race

Female

Male

Elf

183

510

Hobbit

2

2673

Man

268

2459

We have one table per movie. In each table, we have the total number of words spoken, by characters of different races and genders.

You could imagine finding these three tables as separate worksheets in an Excel workbook. Or hanging out in some cells on the side of a worksheet that contains the underlying data raw data. Or as tables on a webpage or in a Word document.

This data has been formatted for consumption by human eyeballs (paraphrasing Murrell; see Resources). The format makes it easy for a human to look up the number of words spoken by female elves in The Two Towers. But this format actually makes it pretty hard for a computer to pull out such counts and, more importantly, to compute on them or graph them.

Exercises

Look at the tables above and answer these questions:

  • What’s the total number of words spoken by male hobbits?
  • Does a certain Race dominate a movie? Does the dominant Race differ across the movies?

How well does your approach scale if there were many more movies or if I provided you with updated data that includes all the Races (e.g. dwarves, orcs, etc.)?

Tidy Lord of the Rings data

Here’s how the same data looks in tidy form:

Film

Gender

Race

Words

The Fellowship Of The Ring

Female

Elf

1229

The Fellowship Of The Ring

Male

Elf

971

The Fellowship Of The Ring

Female

Hobbit

14

The Fellowship Of The Ring

Male

Hobbit

3644

The Fellowship Of The Ring

Female

Man

0

The Fellowship Of The Ring

Male

Man

1995

The Two Towers

Female

Elf

331

The Two Towers

Male

Elf

513

The Two Towers

Female

Hobbit

0

The Two Towers

Male

Hobbit

2463

The Two Towers

Female

Man

401

The Two Towers

Male

Man

3589

The Return Of The King

Female

Elf

183

The Return Of The King

Male

Elf

510

The Return Of The King

Female

Hobbit

2

The Return Of The King

Male

Hobbit

2673

The Return Of The King

Female

Man

268

The Return Of The King

Male

Man

2459

Notice that tidy data is generally taller and narrower. It doesn’t fit nicely on the page. Certain elements get repeated alot, e.g. Hobbit. For these reasons, we often instinctively resist tidy data as inefficient or ugly. But, unless and until you’re making the final product for a textual presentation of data, ignore your yearning to see the data in a compact form.

Benefits of tidy data

With the data in tidy form, it’s natural to get a computer to do further summarization or to make a figure. This assumes you’re using language that is “data-aware”, which R certainly is. Let’s answer the questions posed above.

What’s the total number of words spoken by male hobbits?

## Cmd+Opt+P to run all chunks up til here
lotr_tidy %>% 
  count(Gender, Race, wt = Words)
#> # A tibble: 6 x 3
#>   Gender Race       n
#>   <chr>  <chr>  <int>
#> 1 Female Elf     1743
#> 2 Female Hobbit    16
#> 3 Female Man      669
#> 4 Male   Elf     1994
#> 5 Male   Hobbit  8780
#> 6 Male   Man     8043
## outside the tidyverse:
#aggregate(Words ~ Gender, data = lotr_tidy, FUN = sum)

Now it takes a small bit of code to compute the word total for both genders of all races across all films. The total number of words spoken by male hobbits is 8780. It was important here to have all word counts in a single variable, within a data frame that also included a variables for gender and race.

Does a certain race dominate a movie? Does the dominant race differ across the movies?

First, we sum across gender, to obtain word counts for the different races by movie.

(by_race_film <- lotr_tidy %>% 
   group_by(Film, Race) %>% 
   summarize(Words = sum(Words)))
#> # A tibble: 9 x 3
#> # Groups:   Film [?]
#>   Film                       Race   Words
#>   <fct>                      <chr>  <int>
#> 1 The Fellowship Of The Ring Elf     2200
#> 2 The Fellowship Of The Ring Hobbit  3658
#> 3 The Fellowship Of The Ring Man     1995
#> 4 The Two Towers             Elf      844
#> 5 The Two Towers             Hobbit  2463
#> 6 The Two Towers             Man     3990
#> 7 The Return Of The King     Elf      693
#> 8 The Return Of The King     Hobbit  2675
#> 9 The Return Of The King     Man     2727
## outside the tidyverse:
#(by_race_film <- aggregate(Words ~ Race * Film, data = lotr_tidy, FUN = sum))

We can stare hard at those numbers to answer the question. But even nicer is to depict the word counts we just computed in a barchart.

p <- ggplot(by_race_film, aes(x = Film, y = Words, fill = Race))
p + geom_bar(stat = "identity", position = "dodge") +
  coord_flip() + guides(fill = guide_legend(reverse = TRUE))

Hobbits are featured heavily in The Fellowhip of the Ring, where as Men had a lot more screen time in The Two Towers. They were equally prominent in the last movie, The Return of the King.

Again, it was important to have all the data in a single data frame, all word counts in a single variable, and associated variables for Film and Race.

Take home message

Having the data in tidy form was a key enabler for our data aggregations and visualization.

Tidy data is integral to efficient data analysis and visualization.

If you’re skeptical about any of the above claims, it would be interesting to get the requested word counts, the barchart, or the insight gained from the chart without tidying or plotting the data. And imagine redoing all of that on the full dataset, which includes 3 more Races, e.g. Dwarves.

Where to next?

In the next lesson, we’ll show how to tidy this data.

Our summing over gender to get word counts for combinations of film and race is an example of data aggregation. It’s a frequent companion task with tidying and reshaping. Learn more at:

  • Simple aggregation with the tidyverse: dplyr::count() and dplyr::group_by() + dplyr::summarize(), STAT 545 coverage, Data transformation chapter in R for Data Science.
  • General aggregation with the tidyverse: STAT 545 coverage of general Split-Apply-Combine via nested data frames.
  • Simple aggregation with base R: aggregate().
  • General aggregation with base R: tapply(), split(), by(), etc.

The figure was made with ggplot2, a popular package that implements the Grammar of Graphics in R.

Resources

  • Tidy data chapter in R for Data Science, by Garrett Grolemund and Hadley Wickham
    • tidyr R package
    • The tidyverse meta-package, within which tidyr lives: tidyverse.
  • Bad Data Handbook by By Q. Ethan McCallum, published by O’Reilly.
    • Chapter 3: Data Intended for Human Consumption, Not Machine Consumption by Paul Murrell.
  • Nine simple ways to make it easier to (re)use your data by EP White, E Baldridge, ZT Brym, KJ Locey, DJ McGlinn, SR Supp. Ideas in Ecology and Evolution 6(2): 1–10, 2013. doi:10.4033/iee.2013.6b.6.f http://library.queensu.ca/ojs/index.php/IEE/article/view/4608
    • See the section “Use standard table formats”
  • Tidy data by Hadley Wickham. Journal of Statistical Software. Vol. 59, Issue 10, Sep 2014. http://www.jstatsoft.org/v59/i10