-
Notifications
You must be signed in to change notification settings - Fork 295
/
Copy pathSequence_analysis.Rmd
536 lines (382 loc) · 16.7 KB
/
Sequence_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
---
title: 'Bioinformatics for Big Omics Data: Basics of sequence analysis and genomic interval manipulation in R'
author: "Raphael Gottardo"
date: "February 2, 2015"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
keep_md: yes
smaller: yes
---
## Setting up some options
Let's first turn on the cache for increased performance and improved styling
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
```
```{r, cache=FALSE, echo=TRUE}
library(data.table)
library(ggplot2)
library(GenomicRanges)
library(IRanges)
library(reshape2)
library(GenomicAlignments)
```
## Outline
Here we will discuss some of the core functionality in `Bioconductor` for manipulating sequence data and genomic intervals in R.
You should read the following papers:
1. Morgan, M. et al. ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics 25, 2607-2608 (2009).
2. Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).
3. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009).
## Some motivation
The process of gene regulation is a lot more complex than what we originally discussed.
It's not as simple as DNA → RNA → Proteins.
It's more like
Gene expression = f(Transcription factors, Splicing, miRNA, Nucleosome, Methylation, etc.)
Fortunately, Next Generation Sequencing (NGS) can help us to learn about many of these fundamental processes.
## Transcription factors
<img src="http://howardhughes.trinity.duke.edu/uploads/assets/DAILYILLINI.jpg" width=600>
## Alternative splicing
<img src="http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/images/alternative_splicing.gif" width=600>
## Next generation sequencing (NGS)
The sequencing revolution started with the Human Genome project. An international collaboration to generate a map of the human genome:
- Took years to complete using Sanger sequencing
- Today the same can be done much more rapidly, at a fraction of the cost using next generation sequencing
## Sequencing types
- [Single end sequencing](http://www.illumina.com/technology/single_read_sequencing_assay.ilmn)
Single-read sequencing involves sequencing DNA from only one end, and is the simplest type of sequencing.
- [Paired-end sequencing](http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn)
Paired-end sequencing allows users to choose the length of the insert and sequence either end of the insert, generating high-quality, alignable sequence data. Paired-end sequencing can detect rearrangements, including insertions and deletions (indels) and inversions. Note that this is different from [mate pair sequencing](http://www.illumina.com/technology/mate_pair_sequencing_assay.ilmn)
## NGS aligment
Next-generation sequencing generally produces short reads or short read pairs (e.g. $<200$ bases) as compared to long reads by Sanger sequencing, which cover ~1000 bases. We need to align these sequences to a reference genome (The exception being de-novo sequencing where we want to infer a new genome).
This is called aligning or mapping the reads against the reference sequence/genome. This is not an easy task:
- The short reads do not come with position information → need to find the corresponding region in the reference sequence.
- The reference sequence can be quite long (~3 billion bases for human) → computationally intensive task
- Reads are short → uncertainty with respect to where the reads can be aligned.
- We need to allow some mismatches and small structural variation (InDels) in our reads. For RNA-seq we need to allow for splice junctions.
- Sequencing is not perfect → biological variation and sequencing errors can be confounded.
## Aligners
There exist a multitude of aligners that can be used for aligning short reads to a reference genome such as: Bowtie, BWA, GSNAP, TopHat, SOAP, STAR, etc. The choice of the aligner is usually guided by the actual application (e.g. RNA-seq, de novo, etc).
More on this later.
## Common data formats
(IMAGES →) FASTA/FASTQ → SAM/BAM → VCF
All of these data formats can be read in R.
- The SAM/BAM format has emerged as the de facto standard format for short read alignments.
- SAM → plain-text version
- BAM → compressed binary version
- BAM files can also be used as an space-saving alternative to FASTQ files to store raw sequence data
- All current alignment software can generate SAM/BAM as an output format. BAM files can be indexed to improve data accession.
## Sequence manupulation using Biostrings
Here we will see how to manipulate sequence data in R using the `Biostrings` package
```{r, eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite(c("Biostrings","hgu133plus2probe"))
```
Now, we are ready to use the package
```{r}
library(Biostrings)
```
## The Biostrings package
`Biostrings` provides memory efficient string containers, string matching algorithms, and other utilities, for fast manipulation of large biological sequences or sets of sequences.
Most of the `Biostrings` functionalities are summarized [here](http://www.bioconductor.org/packages/2.12/bioc/vignettes/Biostrings/inst/doc/BiostringsQuickOverview.pdf).
## The XString class
The XString is in fact a virtual class and therefore cannot be instanciated. Only subclasses
(or subtypes) BString, DNAString, RNAString and AAString can. These classes are direct
extensions of the XString class (no additional slot).
```{r}
library(Biostrings)
b <- BString("I am a BString object")
b
length(b)
d <- DNAString("TTGAAAA-CTC-N")
d
length(d)
```
## The XString class
```{r}
d[3]
d[7:12]
d[]
b[length(b):1]
```
## The XStringViews class
An XStringViews object contains a set of views on the same XString object called the subject
string. Particularly useful for matches and alignments.
```{r}
v4 <- Views(d, start=3:0, end=5:8)
v4
length(v4)
start(v4)
end(v4)
```
## Summarizing SC genome
Let's look at the Scerevisiae genome:
You will need the following package:
```{r, eval=FALSE}
biocLite("BSgenome.Scerevisiae.UCSC.sacCer1")
```
and then load it
```{r}
library("BSgenome.Scerevisiae.UCSC.sacCer1")
```
here is a quick summary of the number of ACGT across the first chr1.
```{r}
alphabetFrequency(Scerevisiae[["chr1"]])/length(Scerevisiae[["chr1"]])
```
**Exercise:** Do the same with the human genome and let me know what you get.
## Finding patterns
```{r}
matchPattern(pattern = "GATAGA", subject = Scerevisiae[["chr1"]])
```
```{r}
countPattern(pattern = "GATAGA", subject = Scerevisiae[["chr1"]])
```
**Exercise:** Try to play with some of the options such as the number of mismatch and/or indels.
## Handling probe sequence information
Let's install the following packages
```{r eval=FALSE}
biocLite(c("Biostrings", "hgu133plus2probe", "ShortRead"))
```
and load the following two libraries
```{r}
library(Biostrings)
library(hgu133plus2probe)
```
We now have access to the `hgu133plus2probe` dataset containing our probe information.
## Handling probe sequence information
Let's create a `DNAStringSet`
```{r}
probes <- DNAStringSet(hgu133plus2probe)
```
Let's go back to our flu dataset:
```{r}
library(affy)
# Read the CEL file and creates and AffyBatch
GSE29617_affyBatch <- ReadAffy(celfile.path="Data/GEO/GSE29617/")
# Normalize and summarize the data
GSE29617_set2 <- rma(GSE29617_affyBatch)
```
## Handling probe sequence information
Let's calculate the GC content of each probe
```{r}
dt_exprs <- data.table(probe_name=gsub("_PM", "", rownames(GSE29617_set2)), exprs(GSE29617_set2))
# frequency for all probes
freq <- alphabetFrequency(probes)
# Compute the GC content
GC_count <- freq[,"G"]+freq[,"C"]
dt_probes <- data.table(probe_name=hgu133plus2probe$Probe.Set.Name, GC_count)
setkey(dt_exprs, "probe_name")
setkey(dt_probes, "probe_name")
dt_merged <- dt_exprs[dt_probes]
dt_merged_melt <- melt(dt_merged, id.vars = c("probe_name", "GC_count"))
# This line is not needed if you're using the latest dev version of data.tabler
dt_merged_melt <- data.table(dt_merged_melt)
dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]
```
## Handling probe sequence information
We now look at the effect of probe composition on observed intensities
```{r fig.height=3}
library(ggplot2)
library(reshape2)
ggplot(dt_merged_melt[variable=="GSM733942.CEL.gz"],aes(x=as.factor(GC_count), y=value))+geom_violin()+geom_point(data=dt_merged_melt_sum,aes(x=as.factor(GC_count), y=mean), col="red", size=6)+theme_minimal(base_size = 18)
```
What do you think about the GC content effect?
**Excercise:** Repeat the plot above with other sequence characteristics
## GenomicRanges
Bioconductor possesses an infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: `IRanges`, `GenomicRanges`, and `GenomicFeatures`. These packages provide scalable data structures for representing annotated ranges on the genome, and performing opperations on genomic intervals (e.g. overlaps, etc).
Most of the text and examples used here are taken from the various `*Ranges` Bioconductor package vignettes.
## IRanges
The IRanges package is designed to represent sequences, ranges representing indices along those sequences,
and data related to those ranges. IRanges makes use run-length encodings to provide increased performance.
For example, the sequence {1, 1, 1, 2, 3, 3} can be represented as values= {1, 2, 3}, run lengths = {3, 1, 2}.
```{r}
xVector <- sort(sample(1:100, replace = TRUE, size = 10000))
xRle <- Rle(xVector)
as.vector(object.size(xRle) / object.size(xVector))
identical(as.vector(xRle), xVector)
```
## IRanges
The `IRanges` class is a "simple" container where 2 integer vectors of the same length are used to store the start and width values.
```{r}
ir1 <- IRanges(start = 1:10, width = 10:1)
ir2 <- IRanges(start = 1:10, end = 11)
ir3 <- IRanges(end = 11, width = 10:1)
```
`IRanges` provides common accessor functions to retrieve relevant pieces of information.
```{r}
start(ir1)
end(ir1)
width(ir1)
mid(ir1)
names(ir1)
```
## RangesList
An extension of `List` that holds only `Ranges` objects. Useful for storing ranges over a set of spaces (e.g. chromosomes), each of which requires a separate Ranges object. As a Vector, RangesList may be annotated with its universe identifier (e.g. a genome) in which all of its spaces exist.
```{r}
range1 <- IRanges(start=c(1,2,3), end=c(5,2,8))
range2 <- IRanges(start=c(15,45,20,1), end=c(15,100,80,5))
named <- RangesList(one = range1, two = range2)
length(named) # 2
start(named) # same as start(c(range1, range2))
names(named) # "one" and "two"
start(named[["one"]])
start(named[[1]])
universe(named) <- "hg18"
universe(named)
```
## GRanges: Single Interval Range Features
`GRanges` are like `IRanges` but with a strand and a chromosome
```{r}
gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)),
strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
score = 1:10,
GC = seq(1, 0, length=10))
seqlengths(gr)
seqnames(gr)
ranges(gr)
strand(gr)
mcols(gr)$score
```
## GRangesList: Multiple Interval Range Features
Whenever genomic features consist of multiple ranges that are grouped by a parent
feature, they can be represented as `GRangesList` object. Consider the simple example of the two transcript
`GRangesList` below created using the `GRangesList` constructor.
```{r}
gr1 <- GRanges(seqnames = "chr2", ranges = IRanges(3, 6), strand = "+", score = 5L, GC = 0.45)
gr2 <- GRanges(seqnames = c("chr1", "chr1"), ranges = IRanges(c(7,13), width = 3), strand = c("+", "-"), score = 3:4, GC = c(0.3, 0.5))
grl <- GRangesList("txA" = gr1, "txB" = gr2)
grl
# Unlist to GRanges
ul <- unlist(grl)
```
Most of the accessors that work for `GRanges` also work for `GRangesList`.
## GRanges API
For more information about the GRanges API look at the following image
![API](http://www.ploscompbiol.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.1003118.t001&representation=PNG_M)
extracted from Lawrence et al. (2013)
## Load a BAM file into a GAlignments object
```{r}
library(Rsamtools)
aln1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
aln1 <- readGAlignments(aln1_file)
aln1
```
## Load a BAM file into a GAlignments object
What if you want the actual sequences too?
```{r}
library(ShortRead)
aln1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
aln1 <- readGappedReads(aln1_file)
aln1
qseq(aln1)
```
## Converting to a GRanges
you can (almost) always convert a given object to a relevant class
```{r}
gr1 <- as(aln1,"GRanges")
```
## Reduce
Merge overlapping and adjacent ranges
```{r}
reduce(gr1)
```
## Disjoin
Ranges formed from union of endpoints
```{r}
disjoin(gr1)
```
## Coverage
```{r fig.height=5}
gr1_list <- split(gr1, seqnames(gr1))
cv <- coverage(gr1_list[[1]])
cv_vec <- as.vector(cv$seq1)
qplot(x=1:1575, y=cv_vec)+geom_smooth()+theme_minimal(base_size = 18)
```
## Finding overlaps
`findOverlaps` and `countOverlaps` produce a mapping and a
tabulation of interval overlaps, respectively.
```{r}
query <- IRanges(c(1, 5, 3, 4), width=c(2, 2, 4, 6))
subject <- IRanges(c(1, 3, 5, 6), width=c(4, 4, 5, 4))
findOverlaps(query, subject, type = "start")
countOverlaps(query, subject)
```
## Preparing data for RNA-seq analysis
Let's assume that we want to generate gene counts using the drosophilia genome.
You first need to install the required transcript dataset.
```{r, eval=FALSE}
biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
```
and load the package, which will create an object of class `TranscriptDB`.
```{r}
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
TxDb.Dmelanogaster.UCSC.dm3.ensGene
```
## Preparing data for RNA-seq analysis
As sample data we use pasillaBamSubset which contains both a single-end BAM (untreated1 chr4) and a paired-end
BAM (untreated3 chr4). Each le is a subset of chr4 from the "Pasilla" experiment. See ?pasillaBamSubset for details, but we first need to install the package
```{r, eval=FALSE}
biocLite("pasillaBamSubset")
```
and load the package
```{r}
library(pasillaBamSubset)
un1 <- untreated1_chr4() ## single-end records
```
## Preparing data for RNA-seq analysis
Let's also extract the exon ranges by genes as follows,
```{r}
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")
```
we are now ready to count the number of sequences that fall within each gene
```{r}
se <- summarizeOverlaps(exbygene, un1, mode="IntersectionStrict")
se
```
which is an object of class `SummarizedExperiment`.
## Preparing data for RNA-seq analysis
We can easily extract the counts:
```{r}
head(table(assays(se)$counts))
```
The annotation is stored in the rowData slot:
```{r}
rowData(se)
```
## Alternative approaches
1. featureCounts: Liao, Y., Smyth, G. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).
2. RSEM: Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).
## The Rsubread package
```{r}
# load the package
library("Rsubread")
library("R.utils")
```
The Rsubread package can be used to perform alignment and expression quantification.
## featureCounts
```{r echo=FALSE}
# We download an annotation file that's necessary here.
if(!file.exists("annotation.gtf"))
{
download.file("ftp://ftp.ensembl.org/pub/release-78/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.78.gtf.gz", destfile = "annotation.gtf.gz")
gunzip("annotation.gtf.gz")
}
```
```{r}
bam_file <- system.file("extdata", "untreated1_chr4.bam", package = "pasillaBamSubset")
fc_counts <- featureCounts(files=bam_file,annot.ext="annotation.gtf", isGTFAnnotationFile=TRUE,GTF.featureType="exon",GTF.attrType="gene_id")
```
## featureCounts
```{r, fig.height=3, cache=FALSE}
count1 <- data.table(exon_id=rownames(assays(se)$counts), count_se=as.vector(assays(se)$counts))
count2 <- data.table(exon_id=rownames(fc_counts$counts), count_fc=as.vector(fc_counts$counts))
count_dt <- merge(count1, count2, by="exon_id")
qplot(log10(count_se+1), log10(count_fc+1), data=count_dt)
```
## RSEM
RSEM: Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).
Let's look at these lecture notes:
https://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf