-
Notifications
You must be signed in to change notification settings - Fork 295
/
Copy pathBioconductor_intro.Rmd
386 lines (278 loc) · 12.3 KB
/
Bioconductor_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
---
title: 'Bioinformatics for Big Omics Data: Introduction to Bioconductor'
author: "Raphael Gottardo"
date: "January 21, 2015"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
keep_md: yes
smaller: yes
---
## Setting up some options
Let's first turn on the cache for increased performance and improved styling
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
```
## The Bioconductor project
- [Bioconductor](http://www.bioconductor.org) is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data. It is based primarily on the R programming language.
- Most Bioconductor components are distributed as R packages. The functional scope of Bioconductor packages includes the analysis of DNA microarray, sequence, flow, SNP, and other data.
## Project Goals
The broad goals of the Bioconductor project are:
- To provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.
- To facilitate the inclusion of biological metadata in the analysis of genomic data, e.g. literature data from PubMed, annotation data from Entrez genes.
- To provide a common software platform that enables the rapid development and deployment of extensible, scalable, and interoperable software.
- To further scientific understanding by producing high-quality documentation and reproducible research.
- To train researchers on computational and statistical methods for the analysis of genomic data.
## Getting started
Before running this presentation, you will need to:
- set the current directory in the console (go to session -> Set Working Directory -> To Source File Location)
- run the following commands which take up time
- source("http://bioconductor.org/biocLite.R")
- biocLite()
- getSQLiteFile() Note: this will take some time
- getGEOSuppFiles("GSE29617", makeDirectory = TRUE, baseDir = "./")
- untar("./GSE29617/GSE29617_RAW.tar", exdir="./GSE29617", tar = Sys.getenv("TAR"))
Note that I have updated the code so that you may not have to do what's described above.
## Getting started
```{r}
source("http://bioconductor.org/biocLite.R")
# Install all core packages and update all installed packages
biocLite()
```
You can also install specific packages
```{r, eval=FALSE}
biocLite(c("GEOmetadb", "GEOquery"))
```
## The Gene Expression Omnibus (GEO)
The [Gene Expression Omnibus](http://www.ncbi.nlm.nih.gov/geo/) is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community.
The three main goals of GEO are to:
- Provide a robust, versatile database in which to efficiently store high-throughput functional genomic data
- Offer simple submission procedures and formats that support complete and well-annotated data deposits from the research community
- Provide user-friendly mechanisms that allow users to query, locate, review and download studies and gene expression profiles of interest
## Getting data from GEO
Before getting data from GEO, we need to see what data we want. For that we can use the `GEOmetadb` package.
```{r}
library(GEOmetadb)
```
Remember that packages in Bioconductor are well documented with a vignette that can be access as follows:
```{r eval=FALSE}
vignette("GEOmetadb")
```
or if the package contains multiple vignettes or a vignette with a non-standard name
```{r eval=FALSE}
browseVignettes(package = "GEOmetadb")
```
## Finding the right data in GEO
Zhu, Y., Davis, S., Stephens, R., Meltzer, P. S., & Chen, Y. (2008). GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus. Bioinformatics (Oxford, England), 24(23), 2798–2800. doi:10.1093/bioinformatics/btn520
GEOmetadb uses a SQLite database to store all metadata associate with GEO.
```{r}
## This will download the entire database, so can be slow
if(!file.exists("GEOmetadb.sqlite"))
{
# Download database only if it's not done already
getSQLiteFile()
}
```
## Finding the right data in GEO
```{r}
geo_con <- dbConnect(SQLite(),'GEOmetadb.sqlite')
dbListTables(geo_con)
```
```{r}
dbListFields(geo_con, 'gse')
```
## Finding a study
The basic record types in GEO include Platforms (GPL), Samples (GSM), Series (GSE) and DataSets (GDS)
```{r}
dbGetQuery(geo_con, "SELECT gse.ID, gse.title, gse.gse FROM gse WHERE gse.pubmed_id='21743478';")
```
## Finding a study
What samples were used?
```{r}
dbGetQuery(geo_con, "SELECT gse.gse, gsm.gsm, gsm.title FROM (gse JOIN gse_gsm ON gse.gse=gse_gsm.gse) j JOIN gsm ON j.gsm=gsm.gsm WHERE gse.pubmed_id='21743478' LIMIT 5;")
```
gse_gsm contains the gse number that is associated with the gsm number.
j is the name of a table that is created by joining gse and ges_gsm. Then j is joined with table gsm.
## Finding a study
What about raw data?
```{r}
res <- dbGetQuery(geo_con, "SELECT gsm.gsm, gsm.supplementary_file FROM (gse JOIN gse_gsm ON gse.gse=gse_gsm.gse) j JOIN gsm ON j.gsm=gsm.gsm WHERE gse.pubmed_id='21743478' LIMIT 5;")
head(res)
```
raw data is contained in the supplementary files, which are listed in the gsm file.
## Finding specific data
To get list of manufacturers:
```{r}
library(data.table)
manu <- data.table(dbGetQuery(geo_con, "SELECT manufacturer FROM gpl"))
manu[,list(length(manufacturer)), by=manufacturer]
```
## Finding specific data
To get supplementary file names ending with cel.gz from only manufacturer Affymetrix
```{r}
res <- dbGetQuery(geo_con, "SELECT gpl.bioc_package, gsm.title, gsm.series_id, gsm.gpl, gsm.supplementary_file FROM gsm JOIN gpl ON gsm.gpl=gpl.gpl WHERE gpl.manufacturer='Affymetrix' AND gsm.supplementary_file like '%CEL.gz';")
head(res)
```
## Finding specific data
From previous table:
- bioc_package = bioconductor package
- hu6800 = Affymetrix HuGeneFL Genome Array annotation data (chip hu6800)
- rgu34a = Affymetrix Rat Genome U34 Set annotation data (chip rgu34a)
- title = data set title or study title
For example BM_CD34-1a = bone marrow flow-sorted CD34+ cells (>95% purity) and has GSM sample number GSM575.
## Getting the data we want
Now that we have our GSE ID we can use the GEOquery to download the corresponding data as follows,
```{r query-GEO, cache = TRUE}
# Download the mapping information and processed data
# This returns a list of eSets
GSE29617_set <- getGEO("GSE29617", destdir = "Data/GEO/")[[1]]
```
which returns (a list of) an ExpressionSet (eSet).
## The eSet class
What is an eSet? An S4 class that tries to:
- Coordinate high through-put (e.g., gene expression) and phenotype data.
- Provide common data container for diverse Bioconductor packages.
```{r}
str(GSE29617_set,max.level=2)
```
str() is the command to get the internal structure of an R object.
An eSet contains the necessary "parts" to summarize an experiment.
## Classes and methods
**Everything in R is an OBJECT.**
- A class is the definition of an object.
- A method is a function that performs specific calculations on objects of a
specific class. Generic functions are used to determine the class of its
arguments and select the appropriate method. A generic function is a
function with a collection of methods.
- See ?Classes and ?Methods for more information.
## Classes and methods
```{r}
data(iris)
class(iris)
summary(iris)
```
## Classes and methods
There are two types of classes in R: S3 Classes (old style, informal) and S4 Classes - (new style, more rigorous and formal)
```{r}
# S3 class
head(methods(class="data.frame"))
# S4 class
showMethods(class="eSet")
```
## The eSet
You can get a sense of the defined methods for an `eSet` as follows:
```{r, eval=FALSE}
library(Biobase)
showMethods(class="eSet")
```
in particular, the following methods are rather convenient:
- assayData(obj); assayData(obj) `<-` value: access or assign assayData
- phenoData(obj); phenoData(obj) `<-` value: access or assign phenoData
- experimentData(obj); experimentData(obj) `<-` value: access or assign experimentData
- annotation(obj); annotation(obj) `<-` value: access or assign annotation
## The ExpressionSet subclass
Similar to the `eSet` class but tailored to gene expression, with an expression matrix that can be accessed with the `exprs` method.
```{r}
class(GSE29617_set)
exprs(GSE29617_set)[1:2,1:3]
```
also provides additional methods such as `fData`.
## The ExpressionSet subclass
`ExpressionSet` objects are meant to facilitate the adoption of MIAME standard. MIAME = "Minimum Information about a Microarray experiment". Alvis Brazma et. al. (2001) Nature Genetics
Unfortrunately, not all contributors will upload all the information.
```{r}
# Information about preprocessing
# Nothing in here!
preproc(GSE29617_set)
```
## The ExpressionSet subclass
```{r}
# A data.frame with number of rows equal to the number of samples
pData(GSE29617_set)[1:2,1:2]
# A data.frame with number of rows equal to the number of features/probes
fData(GSE29617_set)[1:2,1:2]
```
## The ExpressionSet subclass
So the `ExpressionSet` objects facilitate the encapsulation of everything that's needed to summarize and analyze an experiment. Specific elements can be access with the `@` operator but many classes have convenient accessor methods.
```{r}
fData(GSE29617_set)[1:2,1:2]
# Note that S4 classes can be nested!
GSE29617_set@featureData@data[1:2,1:2]
```
## What if you want the raw data?
GEO also provides access to raw data that can be downloaded with `GEOquery`.
```{r download-raw-data}
# Download all raw data. This should only be evaluated once
# Then the data would be stored locally in the data directory
# Make sure the directory exists
if(length(dir("Data/GEO/",pattern="GSE29617"))==0)
{
getGEOSuppFiles("GSE29617", makeDirectory = TRUE, baseDir = "./Data/GEO/")
untar("./Data/GEO/GSE29617/GSE29617_RAW.tar", exdir="./Data/GEO/GSE29617/",
tar=Sys.getenv("TAR"))
}
# untar downloaded data
```
## Starting from the raw data
Now that we have the Affymetrix raw data (CEL) files, we can apply some of the concepts we've discussed related to normalization and probe summary. We first need to load the appropriate package
```{r, eval=FALSE}
biocLite("affy")
```
```{r}
library(affy)
```
then we use the following commands
```{r }
# Read the CEL file and creates and AffyBatch
GSE29617_affyBatch <- ReadAffy(celfile.path="Data/GEO/GSE29617/")
# Normalize and summarize the data
GSE29617_set2 <- rma(GSE29617_affyBatch)
```
## Starting from the raw data
Let's check the results and compare to the expression matrix that was submitted to GEO
```{r}
exprs(GSE29617_set2)[1:2,1:2]
```
The rows are the features (i.e., probes). Columns are the samples.
## What are those probes?
```{r, eval=FALSE}
# We first need to install our annotation package
library(BiocInstaller)
# Note that you don't have to use source anymore!
biocLite("hthgu133a.db")
```
```{r}
library(hthgu133a.db)
probe_ids <- rownames(GSE29617_set2)
probe_data <- select(hthgu133a.db, keys = probe_ids, columns = "SYMBOL", keytype = "PROBEID")
probe_data[1,]
```
This didn't work very well, did it?
The problem is that the probe IDs in hthgu133a.db have a different naming scheme than those in GSE29617_set2. This is fixed on the next slide.
## What are those probes?
Let's fix this: Replace _PM with <empty> for the probe id names in GSE29617_set2
```{r, warning=TRUE}
probe_ids <- gsub("_PM","",rownames(GSE29617_set2))
probe_data <- select(hthgu133a.db, keys = probe_ids, columns = "SYMBOL", keytype = "PROBEID")
probe_data[1, ]
```
What's the warning? Some probes match up with multiple genes, therefore those probe IDs will have more than one record.
## What are those probes?
This gives us too many rows, what do we do? Concatenate the gene names so that there will be one row per probe ID.
```{r}
library(data.table)
probe_data_dt <- data.table(probe_data)
probe_data_dt_unique <- probe_data_dt[,list(SYMBOL=paste(SYMBOL,collapse=";")),by="PROBEID"]
probe_data_dt_unique[SYMBOL%like%";"]
```
## Completing our ExpressionSet
```{r}
annotaded_probes <- data.frame(probe_data_dt_unique)
rownames(annotaded_probes) <- rownames(GSE29617_set2)
fData(GSE29617_set2) <- annotaded_probes
head(fData(GSE29617_set2))
```
Now you're ready to get the data that you will have to analyze for your second homework.