Nov 9, 2021
To interactively work with the code below, open lecture12.ipynb in VSCode. Make sure to select the kernel
for R
so that you can execute R
code. You should have already set this up following the software installation instructions here.
R
is the second programming language after Python
that we will learn in this course. We will use R
over the next 5 lectures.
R
is particularly well suited for reading, manipulating, and visualizing data in tabular and biological sequence formats.
Many statistical tests are also available out of the box in R
.
While "base" R
is used widely, I almost exclusively use R
for its two excellent package collections:
- Tidyverse - suited for tabular data
- Bioconductor - suited for biology-aware analyses
Today we will learn a few basic functions from tidyverse
for working with tabular data.
Unlike pandas
which is a single package with lot of functionality, tidyverse
is a collection of packages that are focused on specific tasks.
- ggplot2 - for plotting data
- dplyr - for filtering, aggregating, and transforming data
- readr - for reading and writing data
- tidyr - for cleaning and transforming data
- stringr - for manipulating strings
- purrr - for manipulating lists of R objects
- forcats - for manipulating categorical data
You can load all the above packages in one go:
library(tidyverse)
Various options for reading and writing data are in package readr
.
data <- read_tsv("data/example_dataset_1.tsv")
data
The tabular data structure is called a tibble
in tidyverse
, and is a souped-up version of the data.frame R data structure with additional nice features.
The <-
assignment operator is equivalent to the =
assignment operator and can be used interchangeably. However, using the <-
operator is more conventional.
ggplot(data, aes(x = kozak_region, y = mean_ratio)) +
geom_point()
Anatomy of a ggplot2
plot
- Begins with
ggplot
function with atibble
argument as the first argument. aes
specifies the variables to plot.geom
specifies the type of plot.+
adds additionallayers
to the plot.
Key differences with Python
- No need to specify variables within quotes.
- Indentation convention is different.
options(repr.plot.width = 5, repr.plot.height = 3)
Plotting a point graph with color
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
color = insert_sequence)) +
geom_point()
Plotting a line graph
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
color = insert_sequence,
group = insert_sequence)) +
geom_line()
Plotting point and line graphs
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
color = insert_sequence,
group = insert_sequence)) +
geom_line() +
geom_point()
options(repr.plot.width = 6, repr.plot.height = 3)
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
group = insert_sequence)) +
geom_line() +
geom_point() +
facet_grid(~ insert_sequence)
(20 min)
See https://ggplot2.tidyverse.org/reference/labs.html
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
color = insert_sequence,
group = insert_sequence)) +
geom_line() +
geom_point()
See https://ggplot2.tidyverse.org/reference/ggtheme.html
See https://ggplot2.tidyverse.org/reference/scale_continuous.html
Uses functions from the dplyr
package.
data <- read_tsv("data/example_dataset_1.tsv")
data
select(data, strain, mean_ratio, insert_sequence, kozak_region)
data <- read_tsv("data/example_dataset_1.tsv") %>%
select(strain, mean_ratio, insert_sequence, kozak_region)
Above is the same as the following:
data <- read_tsv("data/example_dataset_1.tsv") %>%
select(., strain, mean_ratio, insert_sequence, kozak_region)
The %>%
operator lets you chain
different data analysis tasks together and makes the analysis logic easier to understand.
Side note: You can create keyboard shortcuts for <-
and %>%
in VSCode as explained here.
I use Alt + -
for <-
and Alt + Shift + m
for %>%
following RStudio
convention.
You can get a view of the transformed data using print()
as the last step in a chain of commands
data <- read_tsv("data/example_dataset_1.tsv") %>%
select(strain, mean_ratio, insert_sequence, kozak_region) %>%
print()
data <- read_tsv("data/example_dataset_1.tsv")
data %>%
filter(kozak_region == "A")
data %>%
filter(kozak_region == "A", insert_sequence == "10×AGA")
data %>%
filter(kozak_region == "A") %>%
filter(insert_sequence == "10×AGA")
data %>%
arrange(mean_ratio)
data <- read_tsv("data/example_dataset_2.tsv") %>%
print()
data <- data %>%
mutate(mean_ratio = mean_yfp / mean_rfp) %>%
print()
Use mutate to modify existing columns
data %>%
mutate(mean_ratio = round(mean_ratio, 2))
Variants: inner_join
, left_join
, right_join
, full_join
See https://dplyr.tidyverse.org/reference/mutate-joins.html
annotations <- read_tsv("data/example_dataset_3.tsv")
annotations
data %>%
inner_join(annotations, by = "strain")
data %>%
left_join(annotations, by = "strain")
data %>%
right_join(annotations, by = "strain")
But remember to use %>%
in dplyr
vs +
in ggplot2
!
data %>%
mutate(mean_ratio = mean_yfp / mean_rfp) %>%
left_join(annotations, by = "strain") %>%
ggplot(aes(x = kozak_region, y = mean_ratio,
color = insert_sequence, group = insert_sequence)) +
geom_line() +
geom_point()
All functions are named nicely and begin with str_
. I find them easier to use than the equivalent Python
regular expression functions.
See https://stringr.tidyverse.org/reference/index.html
data %>%
mutate(mean_ratio = mean_yfp / mean_rfp) %>%
left_join(annotations, by = "strain") %>%
mutate(codon = str_extract(insert_sequence, "[A-Z]{3}$"))
(20 min)
Google for log2 R
to find the appropriate function
data <- read_tsv("data/example_dataset_2.tsv")
2. Extract strain number from the strain
column into a new column and sort numerically by strain number
Extract the strain number using a stringr
function.
Google for character to integer R
to find appropriate function to use in mutate
.
Then sort.
annotations <- read_tsv("data/example_dataset_3.tsv")
annotations
This requires a bit more reading and discussion, but it is a good example of how to learn new tidyverse
functions on your own!
Use fct_reorder function from the forcats
package to sort kozak_region
by strain number you created above in a mutate
step and then feed it into ggplot
.
data %>%
mutate(mean_ratio = mean_yfp / mean_rfp) %>%
left_join(annotations, by = "strain") %>%
ggplot(aes(x = kozak_region, y = mean_ratio,
color = insert_sequence, group = insert_sequence)) +
geom_line() +
geom_point()