lectures/lecture16/Lecture16_GenomicData.Rmd

---
title: "MCB536 Lecture 16 (Part 1): Genomic Data Analysis in R"
author: "Gavin Ha"
date: "11/30/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# `GenomicRanges` Object to Store Genomic Data

Genomic data is often described using chromosomes and coordinates. A locus can be a single base position or a region that includes a start and end coordinate. In R, there is a Bioconductor package called `GenomicRanges` that stores this in a convenient structure for efficient querying using routine operations. `GRanges` object class is in which genomic data will be stored. We will demonstrate the most common operation, `findOverlaps`, to determine intersecting positions or regions in the genome. See https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html

An additional package called `plyranges` provides convenient syntax similar to that used in `tidyverse` to manipulate and apply operations on `Granges` objects. See https://www.bioconductor.org/packages/devel/bioc/vignettes/plyranges/inst/doc/an-introduction.html

In this tutorial, we will work with The Cancer Genome Atlas (TCGA) data for primary breast cancer patient samples. Specifically, these are segmentation data used for copy number alteration analysis. See Lecture 16: Slide 47.

## 0. Load the `GenomicRanges` Bioconductor package
```{r, message=FALSE}
#BiocManager::install(GenomicRanges")
library(GenomicRanges)
library(plyranges)
```

## 1. Create a GRanges object.
A `GRanges` object must contain an attribute called `seqnames` to represent chromosomes and `ranges` attribute to represent the `start` and `end` coordinates. The range is 1-index-based (as opposed to 0-index), The `start` and `end` can be the same value if it is a single base-pair.  
```{r}
myGRange <- GRanges(seqnames = "17",
                    ranges = IRanges(start = 37844393, end = 37844393))
```

Alternatively, using `plyranges` we can use familiar syntax to create the same `GRanges` object.
```{r}
myGRange <- data.frame(seqnames = "17", start = 37844393, end = 37844393) %>% as_granges()
```

## 2. Load Genomic Data From A File
There are numerous text file formats for representing genomic data and some of these were discussed in Lecture 16. Here, we will show you that a `GRanges` can be easily created from any text file that contains delimited columns specifying genomic coorindates.

### 2.1 SEG format
SEGment Data (http://software.broadinstitute.org/software/igv/SEG) format is tab-delimited and a flexible way to define any genomic data.

There are 4 required columns:

  1. Name
  2. Chromosome
  3. Start Coordinate
  4. End Coordinate

This is similar to the BED file format but with the additional requirement for *Name* as the first column.

### a. Load the SEG file containing the segments into a `data.frame` object.
```{r, message=FALSE}
segs <- read.table("BRCA.genome_wide_snp_6_broad_Level_3_scna.seg", header = TRUE)
```

Small processing of this file to correct a few legacy hacks. We need to change chromosome 23 to chromosome X.
```{r}
str(segs) # show the class type for each column
mode(segs$Chromosome) <- "character" # change the class of the chromosome to character
segs[segs$Chromosome == 23, "Chromosome"] <- "X"
```

### b. Convert the `data.frame` object into a `GRanges`. 
You can use the `as()` function, as long as the 3 required columns are present. It is also flexible how the columns are named. For example, the column can be `Start`, `start`, `Chr`, `chr`, `Chromosome`, `End`, `Stop`, etc.
```{r}
segs.gr <- as(segs, "GRanges")
segs.gr
```

Alternatively, using `plyranges`. Here, we need to rename the columns: `Chromosome`->`seqnames`, `Start`->`start`, `End`->`end`.
```{r}
colnames(segs)[2:4] <- c("seqnames", "start", "end")
segs.gr <- segs %>% as_granges()
```

## 3. Operations and features of GenomicRanges
Some of the most useful features of `GRanges` object is the fast and easy methods for determining overlaps between sets of ranges. Here, we will describe examples using some of the common functions.

### 3.1 Tiling the genome
Often we would like to *find* or *count* events overlapping regions in the genome. In an unbiased fashion, we could do this genome-wide by dividing the genome into tiles/windows/bins. 

We will use the `tileGenome()` for this task, which requires three arguments: length of the chromosomes, number of tiles and the size of each tile.

### a. We need the lengths of the chromosomes in the human genome.
We need to load human genome information for build `hg19`. Since there are non-standard chromosomes, we only want to keep the standard chromosomes using `keepStandardChromosomes`. Then, since our `segs` data uses `NCBI` chromosome naming convention (i.e. `1` instead of `chr1`), we need set the `seqlevelStyle`.
```{r}
seqinfo <- Seqinfo(genome = "hg19")
seqinfo <- keepStandardChromosomes(seqinfo) 
seqlevelsStyle(seqinfo) <- "NCBI"
seqinfo
```

### b. Split the genome into 500kb tiles or windows.
```{r}
slen <- seqlengths(seqinfo) # get the length of the chromosomes
tileWidth <- 500000 # tile size of 500kb
tiles <- tileGenome(seqlengths = slen, tilewidth = tileWidth,
                    cut.last.tile.in.chrom = TRUE)
tiles
```

### 3.2 Finding overlap of ranges
One of the most useful features of `GenomicRanges` is to simply identify the ranges that overlap between two `GRanges` objects. The `findOverlaps` function is a basic method in the `GRanges` class for finding the overlaps of the elements that overlap between two `GRanges`. The argmuents `query` for your main `tiles.subset` and `subject` for the `segs.gr`. The `type` argument describes the type of overlap, such as `any`, `within`, `start`, `end`, `equal`, and there are additional arguments for criteria for overlap such as `minoverlap` size.

For this example, let's find which copy number alteration segments from `segs.gr` overlap in *any* way with our ranges in `tiles.subset` (`17:35500000-37000000`). 

### a. Find the tiled ranges for chromosome `17`, starting `35500000` and ending `37000000`.
```{r}
tiles.subset <- tiles[seqnames(tiles) == "17" & start(tiles) >= 35500000 & end(tiles) <= 37000000]
tiles.subset
```

Alternatively, using `plyranges` and the `filter` verb.
```{r}
tiles.subset <- tiles %>% filter(seqnames == "17" & start >= 35500000 & end <= 37000000)
```

### b. Find the overlap between `segs.gr` and `tiles.subset`.
Let's find the segments in `segs.gr` (`query`) that overlap our `tiles.subset` (`subject`).

`plyranges` provides convenient functions that can bypass having to deal with hits/indices and returns the overlapped regions. Here, we use the function `find_overlaps`. This function will return all of the ranges in the `query` that overlap with the `subject`.
```{r}
segs.overlap <- find_overlaps(segs.gr, tiles.subset)  # arguments: find_overlaps(query, subject)
segs.overlap[1:2] # show first 2 segments from segs.gr that overlapped tiles.subset
```


## Exercise 1: 

### a. Create a range for `11:69400000-69500000`.
```{r}
# GRanges()
```

### b. Find overlap between `11:69400000-69500000` and `segs.gr`.
```{r}
# find_overlaps() from plyranges
```

### c. What is the `Segment_Mean` for the 2nd segment that overlaps `11:69400000-69500000`?
```{r}
# index the 2nd segment in the result to (b)
```