Sequence_analysis.Rmd

---
title: 'Bioinformatics for Big Omics Data: Basics of sequence analysis and genomic interval manipulation in R'
author: "Raphael Gottardo"
date: "February 2, 2015"
output:
  ioslides_presentation:
    fig_caption: yes
    fig_retina: 1
    keep_md: yes
    smaller: yes
---

## Setting up some options

Let's first turn on the cache for increased performance and improved styling
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
```

```{r, cache=FALSE, echo=TRUE}
library(data.table)
library(ggplot2)
library(GenomicRanges)
library(IRanges)
library(reshape2)
library(GenomicAlignments)
```


## Outline

Here we will discuss some of the core functionality in `Bioconductor` for manipulating sequence data and genomic intervals in R.
You should read the following papers:

1. Morgan, M. et al. ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics 25, 2607-2608 (2009).

2. Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).

3. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009).

## Some motivation

The process of gene regulation is a lot more complex than what we originally discussed. 
It's not as simple as DNA &#8594; RNA &#8594; Proteins. 
It's more like

Gene expression = f(Transcription factors, Splicing, miRNA, Nucleosome, Methylation, etc.)

Fortunately, Next Generation Sequencing (NGS) can help us to learn about many of these fundamental processes. 

## Transcription factors

<img src="http://howardhughes.trinity.duke.edu/uploads/assets/DAILYILLINI.jpg" width=600>

## Alternative splicing

<img src="http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/images/alternative_splicing.gif" width=600>


## Next generation sequencing (NGS)

The sequencing revolution started with the Human Genome project. An international collaboration to generate a map of the human genome:

- Took years to complete using Sanger sequencing

- Today the same can be done much more rapidly, at a fraction of the cost using next generation sequencing


## Sequencing types

- [Single end sequencing](http://www.illumina.com/technology/single_read_sequencing_assay.ilmn)

Single-read sequencing involves sequencing DNA from only one end, and is the simplest type of sequencing.

- [Paired-end sequencing](http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn)

Paired-end sequencing allows users to choose the length of the insert and sequence either end of the insert, generating high-quality, alignable sequence data. Paired-end sequencing can detect rearrangements, including insertions and deletions (indels) and inversions. Note that this is different from [mate pair sequencing](http://www.illumina.com/technology/mate_pair_sequencing_assay.ilmn)

## NGS aligment

Next-generation sequencing generally produces short reads or short read pairs (e.g. $<200$ bases) as compared to long reads by Sanger sequencing, which cover ~1000 bases. We need to align these sequences to a reference genome (The exception being de-novo sequencing where we want to infer a new genome).

This is called aligning or mapping the reads against the reference sequence/genome. This is not an easy task:

- The short reads do not come with position information &#8594; need to find the corresponding region in the reference sequence.
- The reference sequence can be quite long (~3 billion bases for human) &#8594; computationally intensive task
- Reads are short &#8594; uncertainty with respect to where the reads can be aligned.
- We need to allow some mismatches and small structural variation (InDels) in our reads. For RNA-seq we need to allow for splice junctions. 
- Sequencing is not perfect &#8594; biological variation and sequencing errors can be confounded. 

## Aligners

There exist a multitude of aligners that can be used for aligning short reads to a reference genome such as: Bowtie, BWA, GSNAP, TopHat, SOAP, STAR, etc. The choice of the aligner is usually guided by the actual application (e.g. RNA-seq, de novo, etc).

More on this later.

## Common data formats

(IMAGES &#8594;) FASTA/FASTQ &#8594; SAM/BAM &#8594; VCF

All of these data formats can be read in R. 

- The SAM/BAM format has emerged as the de facto standard format for short read alignments. 
- SAM  &#8594; plain-text version
- BAM  &#8594; compressed binary version
- BAM files can also be used as an space-saving alternative to FASTQ files to store raw sequence data 
- All current alignment software can generate SAM/BAM as an output format. BAM files can be indexed to improve data accession. 


## Sequence manupulation using Biostrings

Here we will see how to manipulate sequence data in R using the `Biostrings` package

```{r, eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite(c("Biostrings","hgu133plus2probe"))
```

Now, we are ready to use the package

```{r}
library(Biostrings)
```

## The Biostrings package

`Biostrings` provides memory efficient string containers, string matching algorithms, and other utilities, for fast manipulation of large biological sequences or sets of sequences. 


Most of the `Biostrings` functionalities are summarized [here](http://www.bioconductor.org/packages/2.12/bioc/vignettes/Biostrings/inst/doc/BiostringsQuickOverview.pdf).

## The XString class

The XString is in fact a virtual class and therefore cannot be instanciated. Only subclasses
(or subtypes) BString, DNAString, RNAString and AAString can. These classes are direct
extensions of the XString class (no additional slot). 

```{r}
library(Biostrings)
b <- BString("I am a BString object")
b
length(b)
d <- DNAString("TTGAAAA-CTC-N")
d
length(d)
```

## The XString class

```{r}
d[3]
d[7:12]
d[]
b[length(b):1]
```

## The XStringViews class

An XStringViews object contains a set of views on the same XString object called the subject
string. Particularly useful for matches and alignments.

```{r}
v4 <- Views(d, start=3:0, end=5:8)
v4
length(v4)
start(v4)
end(v4)
```

## Summarizing SC genome

Let's look at the Scerevisiae genome:
You will need the following package:
```{r, eval=FALSE}
biocLite("BSgenome.Scerevisiae.UCSC.sacCer1")
```
and then load it
```{r}
library("BSgenome.Scerevisiae.UCSC.sacCer1")
```

here is a quick summary of the number of ACGT across the first chr1.

```{r}
alphabetFrequency(Scerevisiae[["chr1"]])/length(Scerevisiae[["chr1"]])
```


**Exercise:** Do the same with the human genome and let me know what you get.

## Finding patterns

```{r}
matchPattern(pattern = "GATAGA", subject = Scerevisiae[["chr1"]])
```

```{r}
countPattern(pattern = "GATAGA", subject = Scerevisiae[["chr1"]])
```


**Exercise:** Try to play with some of the options such as the number of mismatch and/or indels.

## Handling probe sequence information

Let's install the following packages
```{r eval=FALSE}
biocLite(c("Biostrings", "hgu133plus2probe", "ShortRead"))
```

and load the following two libraries
```{r}
library(Biostrings)
library(hgu133plus2probe)
```

We now have access to the `hgu133plus2probe` dataset containing our probe information.

## Handling probe sequence information

Let's create a `DNAStringSet`

```{r}
probes <- DNAStringSet(hgu133plus2probe)
```

Let's go back to our flu dataset:

```{r}
library(affy)
# Read the CEL file and creates and AffyBatch
GSE29617_affyBatch <- ReadAffy(celfile.path="Data/GEO/GSE29617/")
# Normalize and summarize the data
GSE29617_set2 <- rma(GSE29617_affyBatch)
```

## Handling probe sequence information

Let's calculate the GC content of each probe
```{r}
dt_exprs <- data.table(probe_name=gsub("_PM", "", rownames(GSE29617_set2)), exprs(GSE29617_set2))
# frequency for all probes
freq <- alphabetFrequency(probes)
# Compute the GC content
GC_count <- freq[,"G"]+freq[,"C"]
dt_probes <- data.table(probe_name=hgu133plus2probe$Probe.Set.Name, GC_count)
setkey(dt_exprs, "probe_name")
setkey(dt_probes, "probe_name")
dt_merged <- dt_exprs[dt_probes]
dt_merged_melt <- melt(dt_merged, id.vars = c("probe_name", "GC_count"))
# This line is not needed if you're using the latest dev version of data.tabler
dt_merged_melt <- data.table(dt_merged_melt)
dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]
```

## Handling probe sequence information

We now look at the effect of probe composition on observed intensities

```{r fig.height=3}
library(ggplot2)
library(reshape2)
ggplot(dt_merged_melt[variable=="GSM733942.CEL.gz"],aes(x=as.factor(GC_count), y=value))+geom_violin()+geom_point(data=dt_merged_melt_sum,aes(x=as.factor(GC_count), y=mean), col="red", size=6)+theme_minimal(base_size = 18)
```


What do you think about the GC content effect? 
**Excercise:** Repeat the plot above with other sequence characteristics

## GenomicRanges

Bioconductor possesses an infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: `IRanges`, `GenomicRanges`, and `GenomicFeatures`. These packages provide scalable data structures for representing annotated ranges on the genome, and performing opperations on genomic intervals (e.g. overlaps, etc).

Most of the text and examples used here are taken from the various `*Ranges` Bioconductor package vignettes.

## IRanges

The IRanges package is designed to represent sequences, ranges representing indices along those sequences,
and data related to those ranges. IRanges makes use run-length encodings to provide increased performance.
For example, the sequence {1, 1, 1, 2, 3, 3} can be represented as values= {1, 2, 3}, run lengths = {3, 1, 2}.

```{r}
xVector <- sort(sample(1:100, replace = TRUE, size = 10000))
xRle <- Rle(xVector)
as.vector(object.size(xRle) / object.size(xVector))
identical(as.vector(xRle), xVector)
```

## IRanges

The `IRanges` class is a "simple" container where 2 integer vectors of the same length are used to store the start and width values. 
```{r}
ir1 <- IRanges(start = 1:10, width = 10:1)
ir2 <- IRanges(start = 1:10, end = 11)
ir3 <- IRanges(end = 11, width = 10:1)
```

`IRanges` provides common accessor functions to retrieve relevant pieces of information. 

```{r}
start(ir1)
end(ir1)
width(ir1)
mid(ir1)
names(ir1)
```

## RangesList

An extension of `List` that holds only `Ranges` objects. Useful for storing ranges over a set of spaces (e.g. chromosomes), each of which requires a separate Ranges object. As a Vector, RangesList may be annotated with its universe identifier (e.g. a genome) in which all of its spaces exist.

```{r}
range1 <- IRanges(start=c(1,2,3), end=c(5,2,8))
range2 <- IRanges(start=c(15,45,20,1), end=c(15,100,80,5))
named <- RangesList(one = range1, two = range2)
length(named) # 2
start(named) # same as start(c(range1, range2))
names(named) # "one" and "two"
start(named[["one"]])
start(named[[1]])
universe(named) <- "hg18"
universe(named)
```

## GRanges: Single Interval Range Features

`GRanges` are like `IRanges` but with a strand and a chromosome

```{r}
gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
              ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)),
              strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
              score = 1:10,
              GC = seq(1, 0, length=10))
seqlengths(gr)
seqnames(gr)
ranges(gr)
strand(gr)
mcols(gr)$score
```

## GRangesList: Multiple Interval Range Features

Whenever genomic features consist of multiple ranges that are grouped by a parent
feature, they can be represented as `GRangesList` object. Consider the simple example of the two transcript
`GRangesList` below created using the `GRangesList` constructor.

```{r}
gr1 <- GRanges(seqnames = "chr2", ranges = IRanges(3, 6), strand = "+", score = 5L, GC = 0.45)
gr2 <- GRanges(seqnames = c("chr1", "chr1"), ranges = IRanges(c(7,13), width = 3), strand = c("+", "-"), score = 3:4, GC = c(0.3, 0.5))
grl <- GRangesList("txA" = gr1, "txB" = gr2)
grl
# Unlist to GRanges
ul <- unlist(grl)
```

Most of the accessors that work for `GRanges` also work for `GRangesList`.

## GRanges API

For more information about the GRanges API look at the following image
![API](http://www.ploscompbiol.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.1003118.t001&representation=PNG_M)

extracted from Lawrence et al. (2013)

## Load a BAM file into a GAlignments object

```{r}
library(Rsamtools)
aln1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
aln1 <- readGAlignments(aln1_file)
aln1
```

## Load a BAM file into a GAlignments object

What if you want the actual sequences too?

```{r}
library(ShortRead)
aln1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
aln1 <- readGappedReads(aln1_file)
aln1
qseq(aln1)
```

## Converting to a GRanges

you can (almost) always convert a given object to a relevant class
```{r}
gr1 <- as(aln1,"GRanges")
```

## Reduce

Merge overlapping and adjacent ranges
```{r}
reduce(gr1)
```

## Disjoin

Ranges formed from union of endpoints
```{r}
disjoin(gr1)
```


## Coverage

```{r fig.height=5}
gr1_list <- split(gr1, seqnames(gr1))
cv <- coverage(gr1_list[[1]])
cv_vec <- as.vector(cv$seq1)
qplot(x=1:1575, y=cv_vec)+geom_smooth()+theme_minimal(base_size = 18)
```


## Finding overlaps

`findOverlaps` and `countOverlaps` produce a mapping and a
tabulation of interval overlaps, respectively. 

```{r}
query <- IRanges(c(1, 5, 3, 4), width=c(2, 2, 4, 6))
subject <- IRanges(c(1, 3, 5, 6), width=c(4, 4, 5, 4))
findOverlaps(query, subject, type = "start")
countOverlaps(query, subject)
```

## Preparing data for RNA-seq analysis

Let's assume that we want to generate gene counts using the drosophilia genome.
You first need to install the required transcript dataset.
```{r, eval=FALSE}
biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
```
and load the package, which will create an object of class `TranscriptDB`.
```{r}
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
TxDb.Dmelanogaster.UCSC.dm3.ensGene
```


## Preparing data for RNA-seq analysis 

As sample data we use pasillaBamSubset which contains both a single-end BAM (untreated1 chr4) and a paired-end
BAM (untreated3 chr4). Each le is a subset of chr4 from the "Pasilla" experiment. See ?pasillaBamSubset for details, but we first need to install the package
```{r, eval=FALSE}
biocLite("pasillaBamSubset")
```
and load the package
```{r}
library(pasillaBamSubset)
un1 <- untreated1_chr4() ## single-end records
```


## Preparing data for RNA-seq analysis 

Let's also extract the exon ranges by genes as follows,
```{r}
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")
```

we are now ready to count the number of sequences that fall within each gene
```{r}
se <- summarizeOverlaps(exbygene, un1, mode="IntersectionStrict")
se
```
which is an object of class `SummarizedExperiment`.

## Preparing data for RNA-seq analysis 

We can easily extract the counts:
```{r}
head(table(assays(se)$counts))
```

The annotation is stored in the rowData slot:
```{r}
rowData(se)
```

## Alternative approaches

1. featureCounts: Liao, Y., Smyth, G. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

2. RSEM: Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).

## The Rsubread package

```{r}
# load the package
library("Rsubread")
library("R.utils")
```

The Rsubread package can be used to perform alignment and expression quantification.

## featureCounts 

```{r echo=FALSE}
# We download an annotation file that's necessary here.
if(!file.exists("annotation.gtf"))
{
  download.file("ftp://ftp.ensembl.org/pub/release-78/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.78.gtf.gz", destfile = "annotation.gtf.gz")
  gunzip("annotation.gtf.gz")
}
```


```{r}
bam_file <- system.file("extdata", "untreated1_chr4.bam", package = "pasillaBamSubset")
fc_counts <- featureCounts(files=bam_file,annot.ext="annotation.gtf", isGTFAnnotationFile=TRUE,GTF.featureType="exon",GTF.attrType="gene_id")
```

## featureCounts 

```{r, fig.height=3, cache=FALSE}
count1 <- data.table(exon_id=rownames(assays(se)$counts), count_se=as.vector(assays(se)$counts))
count2 <- data.table(exon_id=rownames(fc_counts$counts), count_fc=as.vector(fc_counts$counts))
count_dt <- merge(count1, count2, by="exon_id")
qplot(log10(count_se+1), log10(count_fc+1), data=count_dt)
```

## RSEM

RSEM: Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).

Let's look at these lecture notes:
https://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf