-
Notifications
You must be signed in to change notification settings - Fork 10
Introduction to R: July 28, 2021
Wednesday July 28, 2021 10am-noon PDT/ 1-3pm EDT
Co-Instructors: Abhijna Parigi, Marisa Lim
Moderator: Saranya Canchi
Description:
This free two hour workshop will provide an overview of the basics of R programming language along with RStudio which is a user-friendly environment for working with R. You will be introduced to the syntax, variables, functions, packages along with the various data structures in R. You will also learn basics of data wrangling from reading data from files, subsetting, merging and exporting data out of the R environment.
While we wait to get started --
-
✔️ Have you checked out the pre-workshop resources page?
-
📝 Please fill out our pre-workshop survey if you have not already done so! Click the survey link here
-
💻 Open a web browser - e.g., Firefox, Chrome, or Safari. Open the binder by clicking this button:
We are part of the training and engagement team for the NIH Common Fund Data Ecosystem, a project supported by the NIH to increase data reuse and cloud computing for biomedical research.
You can contact us at training@cfde.atlassian.net
If you have questions at any point,
- Drop them in the chat, or
- Direct messages to the moderator or helpers are welcome, or
- Unmute yourself and ask during the workshop
We're going to use the "raise hand" reaction in zoom to make sure people are on board during the hands-on activities.
What is R?
R is a statistical computing and data visualization programming language.
What is RStudio?
R is often used via integrated development environments (IDE), and RStudio is probably the most popular option. R and RStudio work on Mac, Linux, and Windows operating systems. The RStudio layout displays lots of useful information and allows us to run R code from a script and view and save outputs all from one interface.
- RStudio panels
- The 4 panels:
- View R scripts
- Console to run code
- Environment, history and collections
- Environment: collection of objects, variables, and functions
- View/load/manage R package list
- View file system, help docs, view/save plots
Let's customize the layout to see code better!
- Function: A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.
- Variable/object: Variables are containers for storing data values
-
Assignment operator: The
<-
is called an assignment operator and it is an R convention for assigning values to variable names (i.e., in Python it is=
). - Packages: R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN (the Comprehensive R Archive Network)
- Library: Often used interchangeably with package. A library is usually a single file containing many functions or objects, and a package can contain libraries and other files.
We are using sequencing data from a European Nucleotide Archive dataset (PRJEB5348).
This dataset is for an RNA-Seq experiment comparing two yeast strains, SNF2 (mut) and WT. They sequenced 96 samples in 7 different lanes, with 48 wt and 48 mut strains.
So there are 48 biological replicates and 7 technical replicates, with a total of 672 samples!
Let's start by loading the packages we will be using!
Please note!
The Rstudio binder comes pre-loaded with R and the R packages we're using in today's workshop. On your local computer, you need to install R, Rstudio, and the R packages before using them. Here's the code to install R packages:
# install
install.packages('ggplot2')
install.packages('readr')
install.packages('dplyr')
All of these packages belong to a giant package called tidyverse
. On your local computer, you can install all these useful packages and more by simply intalling tidyverse
.
# install.packages('tidyverse')
We are not doing this today because tidyverse makes the RStudio binder very slow.
Some Useful Info 🏝️
- Nucleic Acid Conc. = RNA concentration in the sample in units of nanogram per microliter (ng/ul)
- 260/280 = ratio measured with a spectrophotometer (i.e., Nanodrop) to assess RNA sample purity. For RNA, the ratio should be ~2.0
- Total RNA = total RNA molecular weight in the sample in units of microgram (ugm)
- General information about RNA quality for RNA-Seq
library(dplyr)
library(readr)
library(ggplot2)
The notes about masked objects mean that two or more R packages have functions with the same name
- For example, both
stats
anddplyr
have a functions calledfilter
- check them out with
?stats::filter
and?dplyr::filter
)
#
are used for notes in scripts or "commented out" lines of code that we don't want to run
Next, we will examine the RNA quality and quantity that the authors measured for all their samples.
We will explore this dataset using some base R and an RStudio package called dplyr
.
The data file we're using today is stored in an online repository (https://osf.io/pzs7g/) and it is a tab-delimited .tsv
file. To read in the dataset in R, we are using the read_tsv()
function from the readr
R package.
read_tsv(file = 'https://osf.io/pzs7g/download/')
Functions like read_tsv
specify required and optional inputs within ( )
.
Let's assign the input data table values to the variable experiment_info
.
experiment_info <- read_tsv(file = 'https://osf.io/pzs7g/download/')
You will see a warning message, what does that means?
Warning: 3 parsing failures.
row col expected actual file
2 -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
3 -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
4 -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
Warning message:
Missing column names filled in: 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12]
# how big is the data set ?
dim(experiment_info)
# examine the first 6 rows across all columns of the data
head(experiment_info)
Tibbles are a type of data table -- a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. data.frame
is another type of data table in R.
- Real world data is messy and needs some formatting before you can use it for analysis.
- Often, you are only interested in a subset of the data and want to avoid loading the entire data set.
- Some programs/visualizations require data to be in a specific format.
dplyr
is a R package that contains sets of tools for dataframe manipulation.
For this workshop we will cover three dplyr
functions:
dplyr verb | Description |
---|---|
select() |
subset columns |
filter() |
subset rows matching some condition |
mutate() |
create new columns using information from other columns |
There are some empty columns in the experiment data file we have been working with. Let's create a subset of data without those empty columns using the select
function.
experiment_info_cleaned <- select(experiment_info,
Sample,
Yeast Strain,
Nucleic Acid Conc.,
Unit,
A260,
A280,
260/280,
Total RNA)
# As before, since R cannot parse spaces in column names, we need to enclose them in backticks to indicate that these words belong together.
experiment_info_cleaned <- select(experiment_info,
Sample,
`Yeast Strain`,
`Nucleic Acid Conc.`,
Unit,
A260,
A280,
260/280,
`Total RNA`)
# As a general rule, it is best to avoid column names that start with a number; we can use backticks for this column name.
experiment_info_cleaned <- select(experiment_info,
Sample,
`Yeast Strain`,
`Nucleic Acid Conc.`,
Unit,
A260,
A280,
`260/280`,
`Total RNA`)
# Check the dimensions of the subsetted dataframe
dim(experiment_info_cleaned)
You can also use select
to specify the columns you don't want. This can be done with the -
notation.
# Remove two columns: A260 and A280
select(experiment_info, -A260, -A280)
Select the columns Sample, Yeast Strain, A260 and A280 and assign them to a new object called "experiment_data".
Solution
experiment_data <- select(experiment_info_cleaned, Sample, `Yeast Strain`, A260, A280)
# Check the data
head(experiment_data)
We learned to subset entire columns. Using the filter
function, we can subset based on conditions, using a logical vector with TRUE and FALSE, or with operators.
The filter
function works on rows, so we can use it to subset data based on conditions in a column. Here's how you would use the function:
filter(experiment_info_cleaned, `Nucleic Acid Conc.` > 1500)
filter(experiment_info_cleaned, `A260/A280` >= 2.0 & `Nucleic Acid Conc.` > 1500)
Here's a list of operators you can use:
Conditional Subsetting | Description |
---|---|
> | greater than the value |
>= | greater than or equal to the value |
< | less than the value |
<= | less than or equal to the value |
l | applies logical operator OR
|
& |
applies logical operator AND
|
== |
equal to the value |
%in% |
operator used to identify if the value belongs to a vector |
!= |
not equal to the value |
mutate
can be used to create new columns in our data frame using existing data.
mutate(experiment_info_cleaned,
concentration_ug_uL = `Nucleic Acid Conc.` / 1000)
So far, we have applied the actions individually. What if we wanted to apply select
and filter
at the same time?
We can use the concept of a pipe which is represented by %>%
in R. This is equivalent to |
in shell.
Pipes are available via the magrittr
package, installed automatically with dplyr
. If you use RStudio, the shortcut to produce the pipe is:
- Ctrl + Shift + M on a PC or
- Cmd + Shift + M on a Mac
Here's how you would use it:
experiment_info_wt <- experiment_info_cleaned %>%
filter(`Yeast Strain` == 'WT' & `Nucleic Acid Conc.` > 1500) %>%
select(Sample, `Yeast Strain`, A260, A280)
(when using pipes, the order of the command matters!)
Create a new data table called exp_info_wrangled
and only keep the samples that have 260/280 ratio values < 2.15. Add a new column called total_RNA_in_nano_grams
that has total RNA in nano grams. Include the columns sample
, yeast strain
and 260/280
and Total_RNA_in_nano_grams
Solution
# create data object
exp_info_wrangled<- experiment_info_cleaned %>%
filter(`260/280` < 2.15) %>%
mutate(total_RNA_in_nano_grams = `Total RNA`*1000) %>%
select(Sample, `Yeast Strain`, `260/280`, total_RNA_in_nano_grams)
# check the data
head(exp_info_wrangled)
==Instructor switch!==
ggplot
and its related packages are a flexible and widely-used plotting R package. You can make standard plots (i.e., scatter, histogram, box, bar, density plots), as well as a variety of other plot types like maps and phylogenetic trees!
It's a very helpful tool for exploring your data/results and generating publication-ready figures.
We're going to create some plots with the dataframe from the previous section. To make it easier to specify column names, let's rename to remove spaces (you could also specify the column names using backticks as we did above):
# current column names
colnames(experiment_info_cleaned)
# new names specified with a vector
colnames(experiment_info_cleaned) <- c('Sample', 'Yeast_strain', 'Nucleic_Acid_Conc.',
'ng/ul', 'A260', 'A280', 'A260_280',
'Total_RNA', 'ugm')
The ggplot syntax always begins with a function that specifies the data and then layers on additional functions to specify a whole bunch of customizable plot aesthetics. Here, we specify the data object as the experiment_info_cleaned
dataframe and set the axes with the mapping=
flag.
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.))
RStudio has some really helpful features for finding help docs, and for autofilling. Use the tab
key to autofill variable or R package function names for example. Find help docs about functions or R packages with the ?
key:
# help docs for ggplot() function
? ggplot
# help docs for ggplot2 R package
? ggplot2
Let's make a scatter plot by adding geom_point()
with a +
to the ggplot()
function:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) +
geom_point() +
ggtitle('Scatter plot')
Notice the indentation after the +
sign.
Add transparency to points (also works for other plot types!) with the alpha=
option (0=completely transparent, 1=solid color) and change point size with size=
:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) +
geom_point(alpha=0.6, size=4) +
ggtitle('Scatter with transparent points')
We can modify the code to fill points by yeast strain and make point size correspond to the 260/280 ratio:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc., fill=Yeast_strain,
size=A260_280)) +
geom_point(alpha=0.6, pch=21, col='white') +
ggtitle('Scatter: color by strain \n and size by 260/280 ratio')
There are 2 strains in this dataset, WT
and snf2
.
unique(experiment_info_cleaned$Yeast_strain)
Let's make a boxplot by strain:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Yeast_strain, y=Total_RNA)) +
geom_boxplot() +
ggtitle('Boxplot')
Boxplots can be more informative if we add points to see the distribution of the data:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Yeast_strain, y=Total_RNA)) +
geom_point(alpha=0.6) +
geom_boxplot() +
ggtitle('Boxplot w/ points')
- What happens if you reverse the order of the
geom_point()
andgeom_boxplot()
functions for the boxplot code above? Solutions Above, the points are plotted first, then the boxplot. If we reverse the order, the points will be plotted on top of the boxplot.
- What happens if you use
geom_jitter()
instead ofgeom_point()
? Solutionsgeom_jitter()
'jitters' or spreads the points out so they are not overlapping as much.
- Modifying axis label text:
xlab()
andylab()
- Facets:
facet_wrap()
orfacet_grid()
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc., col=Yeast_strain)) +
geom_point() +
facet_wrap(~Yeast_strain) +
xlab('Total RNA (ugm)') +
ylab('Nucleic Acid Conc. (ng/ul)') +
ggtitle('Facet scatter plot')
-
Try making 2 plots of the
A260_280
ratio values: 1) a histogram withgeom_histogram()
and 2) a density plot withgeom_density()
. -
Hint: a histogram plots counts on the y-axis so you only need to specify the x-axis variable in the
ggplot()
function. -
Bonus: how would you make 1 plot with a histogram for each strain? Solutions Here is the basic histogram code - you may need to modify the bin size:
ggplot(data=experiment_info_cleaned, mapping=aes(x=A260_280))+ geom_histogram(bins=20) + ggtitle('Histogram')
Here is the basic density plot code:
ggplot(data=experiment_info_cleaned, mapping=aes(x=A260_280)) + geom_density() + ggtitle('Density plot')
Extras! Below are more ways to customize these plots.
You could add colors and plot by strain type like this:
ggplot(data=experiment_info_cleaned, mapping=aes(x=A260_280, fill=Yeast_strain))+ geom_histogram(bins=20, col='black') + ggtitle('Histogram by strain')
And we could facet the plot by strain like this:
ggplot(data=experiment_info_cleaned, mapping=aes(x=A260_280, fill=Yeast_strain))+ geom_histogram(bins=20, col='black') + facet_grid(Yeast_strain~.) + #this sets whether the facet is horizontal or vertical. here, we will get 2 rows of histograms by yeast strain ggtitle('Facet histogram')
One more edit! Let's change the colors on the histogram. There are many built-in color palettes! Check more out here:
ggplot(data=experiment_info_cleaned, mapping=aes(x=A260_280, fill=Yeast_strain))+ geom_histogram(bins=20, col='black') + facet_grid(Yeast_strain~.) + scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) + # Add blue and yellow colors that are more colorblind friendly for plotting ggtitle('Histogram w/ custom colors')
Now, let's switch this to a density plot and change the plot background
ggplot(data=experiment_info_cleaned, mapping=aes(x=A260_280, fill=Yeast_strain))+ geom_density(alpha=0.6) + facet_grid(Yeast_strain~.) + scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) + theme_bw() + ggtitle('Density plot')
Hopefully this exercise demonstrates how flexible ggplot is!
Use the ggsave()
function to save images (variety of file options available), for example:
myplot <- ggplot(data=experiment_info_cleaned,
mapping=aes(x=A260_280, fill=Yeast_strain))+
geom_density(alpha=0.6) +
facet_grid(Yeast_strain~.) +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
theme_bw() +
ggtitle('Density plot')
ggsave(filename='mydensityplot.jpg', plot=myplot, height=4, width=4, units='in', dpi=600)
:::info
Want to save a figure with multiple plots? Check out cowplot
!
# install cowplot package
install.packages('cowplot')
# load library
library(cowplot)
# assign plots to variables
hist_plt <- ggplot(data=experiment_info_cleaned,
mapping=aes(x=A260_280))+
geom_histogram(bins=20) +
ggtitle('Histogram')
density_plt <- ggplot(data=experiment_info_cleaned,
mapping=aes(x=A260_280)) +
geom_density() +
ggtitle('Density plot')
# save plot as a jpeg file. note that the syntax is similar to ggsave() but resolution is specified with res=, instead of dpi=
jpeg('histogram_and_density_plt.jpg', height=5, width=7, units='in', res=500)
# arrange the plot objects in a grid, label plots, and specify plots in 1 row
plot_grid(hist_plt, density_plt, labels=c('A', 'B'), nrow=1)
# shut down plotting device
dev.off()
:::
There are many more options to edit plot aesthetics and there are lots of excellent tutorials and resources, including:
- Rstudio cheatsheet for ggplot
- R Cookbook graphs section
- It's important to consider colorblind-friendly color palettes for plots too!
- The ColorBrewer website is a fantastic tool for selecting nice color combinations
- Viridis has one of the best color contrasts between light and dark colors; it works reasonably well for printing color plots in black and white
- cowplot is a useful package for further customization of plot arrangements and publication-ready figures
- ggplot tutorial from Cedric Scherer
:::warning Be sure to save any work/notes you took in the binder to your computer. Any new files/changes are not available when the binder session is closed!
For example, select a file, click "More", click "Export": :::
📝 Please fill out our post-workshop survey! We use the feedback to help improve our training materials and future workshops. Click here for survey!
Questions?
Acknowledgements: This lesson was inspired by material from 2019 ANGUS Intro to R lesson
- RStudio cheatsheet for readr
- RStudio cheatsheet for dplyr
- RStudio cheatsheet for data Wrangling with dplyr
- More tutorials:
Links for Moderator to share:
- binder: https://binder.pangeo.io/v2/gh/nih-cfde/training-R-binder/rstudio-binder?urlpath=rstudio
- hackmd workshop notes: https://hackmd.io/La_sJ1ghSQuh6LFNZSe1_Q?view
- pre-survey: https://forms.gle/ceB2mfBykUsc7byQ8
- post-survey: https://forms.gle/p12ndumycJfC97Xf9
clock time | time | section |
---|---|---|
10:00 - 10:15 | 15 mins | intro |
10:15 - 11:00 | 45 mins | data types/wrangling |
11:00 - 11:45 | 45 mins | plotting |
11:45 - 12:00 | 15 mins | wrap up/Q&A |
https://hackmd.io/7rMbbd2oTB6ryx_qQYTlDw?view
There are six data types in R:
Data Type | Description | Examples |
---|---|---|
Numeric | numerical values | 1, 1.5, 20, pi |
Character | letters or words, enclosed in quotes | “random”, “2”, “TRUE”, “@”, “#” |
Integer | for integer numbers; the L indicates to R that it’s an integer) | 2L, 500L, -17L |
Logical | for boolean type data | TRUE, FALSE, T, F |
Complex | numbers with real and imaginary parts | 4+ 3i |
Raw | contains raw bytes encoding |
# what type of data structure is this ?
class(experiment_info)
:::info There are other useful commands to get attribute information about a dataset.
# how many number of rows ?
nrow(experiment_info)
# how many number of columns ?
ncol(experiment_info)
# what are the column names of your dataset ?
colnames(experiment_info)
# what are the rownames of your dataset ?
rownames(experiment_info)
# what does the last few lines of my dataset look like ?
tail(experiment_info)
# what is the structure of my dataset ?
str(experiment_info)
:::
data.frame
is tabular data similar to an excel spreadsheet which can contain mixed data types.
vectors
are uni-dimensional data structures with uniform data types.
Let's create a vector from the table we have.
a260_vector <- experiment_info$A260
# confirm the data structure
class(a260_vector)
# get the data type of the vector
typeof(a260_vector)
We can access individual elements within the vector using it's position. Indices start at 1 and use the []
notation.
# obtain the fifth element in the vector
a260_vector[5]
# subset the fifth and tenth element
a260_vector[c(5,10)]
# subset the fifth through tenth elements
a260_vector[c(5:10)]
Another, two dimensional data structure is the matrix
. The major difference between matrix and dataframe is that matrix can have only one type of data type i.e. either character or numeric but not both!
Let's create a matrix containing the Nucleic Acid Conc., 260/280 ratio and Total RNA values.
# Create a matrix by subsetting the data frame by selecting the appropriate columns by column names
yeast_strain_matrix <- data.matrix(experiment_info[,c("Nucleic Acid Conc.",
"260/280",
"Total RNA")])
# View the data
head(yeast_strain_matrix )
The approach of split-apply-combine allows you to split the data into groups, apply a function/analysis and then combine the results. We can do this using group_by()
and summarize()
group_by() takes as argument the column name/s that contain categorical variables that can be used to group the data.
experiment_info_cleaned %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`))
# summarize using more than one column
experiment_info_cleaned %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`))
Next, we can use arrange()
to sort the data in either ascending (defalt order) or descending order.
# arrange new table in ascending mean concentrations
experiment_info_cleaned %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`)) %>%
arrange(mean_concentration)
# arrange new table in descending mean concentrations
experiment_info_cleaned %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`)) %>%
arrange(desc(mean_concentration))
We can also examine the number of categorical values using count()
experiment_info_cleaned %>%
count(`Yeast Strain`)
- Home
- Resources for Attendees
- Resources for Instructors
- Training Workshop Notes
-
HuBMAP Tools
-
R
-
RNA-Seq Concepts, Design and Workflows
-
RNA-Seq in the Cloud
-
Snakemake Part I & II
-
UNIX