-
Notifications
You must be signed in to change notification settings - Fork 18
/
README.Rmd
211 lines (165 loc) · 7.58 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# plyranges: fluent genomic data analysis <img id="plyranges_logo" src="man/figures/logo.png" align="right" width = "125" />
<!-- badges: start -->
[![R-CMD-check-bioc](https://github.com/tidyomics/plyranges/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/tidyomics/plyranges/actions?query=workflow%3AR-CMD-check-bioc)
[![BioC status](http://www.bioconductor.org/shields/build/release/bioc/plyranges.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/plyranges)
<!-- badges: end -->
[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) provides a consistent interface for importing and wrangling
genomics data from a variety of sources. The package defines a grammar of
genomic data transformation based on `dplyr` and the Bioconductor packages
`IRanges`, `GenomicRanges`, and `rtracklayer`. It does this by providing a set
of verbs for developing analysis pipelines based on _Ranges_ objects that
represent genomic regions:
* Modify genomic regions with the `mutate()` and `stretch()` functions.
* Modify genomic regions while fixing the start/end/center coordinates with the `anchor_` family of functions.
* Sort genomic ranges with `arrange()`.
* Modify, subset, and aggregate genomic data with the `mutate()`,
`filter()`, and `summarise()`functions.
* Any of the above operations can be performed on partitions of the
data with `group_by()`.
* Find nearest neighbour genomic regions with the `join_nearest_` family
of functions.
* Find overlaps between ranges with the `join_overlaps_` family of functions.
* Merge all overlapping and adjacent genomic regions with `reduce_ranges()`.
* Merge the end points of all genomic regions with `disjoin_ranges()`.
* Import and write common genomic data formats with the `read_/write_` family
of functions.
For more details on the features of plyranges, read the
[vignette](https://tidyomics.github.io/plyranges/articles/an-introduction.html).
For a complete case-study on using plyranges to combine ATAC-seq and RNA-seq
results read the [*fluentGenomics*
workflow](https://tidyomics.github.io/fluentGenomics).
plyranges is part of the [tidyomics](https://github.com/tidyomics)
project, providing a `dplyr`-based interface for many types of
genomics datasets represented in Bioconductor.
# Installation
[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) can be installed from the latest Bioconductor
release:
```{r, eval=FALSE}
# install.packages("BiocManager")
BiocManager::install("plyranges")
```
To install the development version from GitHub:
```{r, eval=FALSE}
BiocManager::install("tidyomics/plyranges")
```
# Quick overview
## About `Ranges`
`Ranges` objects can either represent sets of integers as `IRanges` (which have
start, end and width attributes) or represent genomic intervals (which have
additional attributes, sequence name, and strand) as `GRanges`. In addition,
both types of `Ranges` can store information about their intervals as metadata
columns (for example GC content over a genomic interval).
`Ranges` objects follow the tidy data principle: each row of a `Ranges` object
corresponds to an interval, while each column will represent a variable about
that interval, and generally each object will represent a single unit of
observation (like gene annotations).
We can construct a `IRanges` object from a `data.frame` with a `start` or
`width` using the `as_iranges()` method.
```{r, message=FALSE}
library(plyranges)
df <- data.frame(start = 1:5, width = 5)
as_iranges(df)
# alternatively with end
df <- data.frame(start = 1:5, end = 5:9)
as_iranges(df)
```
We can also construct a `GRanges` object in a similar manner. Note that a
`GRanges` object requires at least a seqnames column to be present in the
data.frame (but not necessarily a strand column).
```{r}
df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"),
start = 1:5,
width = 5)
as_granges(df)
# strand can be specified with `+`, `*` (mising) and `-`
df$strand <- c("+", "+", "-", "-", "*")
as_granges(df)
```
# Example: finding GWAS hits that overlap known exons
Let's look at a more a realistic example (taken from HelloRanges vignette).
```{r, include=FALSE}
dir <- system.file(package = "HelloRangesData", "extdata/")
genome <- as_granges(read.delim(file.path(dir, "hg19.genome"),
header = FALSE),
seqnames = V1, start = 1L, width = V2)
gwas <- read_bed(file.path(dir, "gwas.bed"), genome_info = genome)
exons <- read_bed(file.path(dir, "exons.bed"), genome_info = genome)
```
Suppose we have two _GRanges_ objects: one containing coordinates of known
exons and another containing SNPs from a GWAS.
The first and last 5 exons are printed below, there are two additional columns
corresponding to the exon name, and a score.
We could check the number of exons per chromosome using `group_by` and
`summarise`.
```{r}
exons
exons %>%
group_by(seqnames) %>%
summarise(n = n())
```
Next we create a column representing the transcript_id with `mutate`:
```{r}
exons <- exons %>%
mutate(tx_id = sub("_exon.*", "", name))
```
To find all GWAS SNPs that overlap exons, we use `join_overlap_inner`. This
will create a new _GRanges_ with the coordinates of SNPs that overlap exons, as
well as metadata from both objects.
```{r}
olap <- join_overlap_inner(gwas, exons)
olap
```
For each SNP we can count the number of times it overlaps a transcript.
```{r}
olap %>%
group_by(name.x, tx_id) %>%
summarise(n = n())
```
We can also generate 2bp splice sites on either side of the exon using
`flank_left` and `flank_right`. We add a column indicating the side of flanking
for illustrative purposes. The `interweave` function pairs the left and right
ranges objects.
```{r}
left_ss <- flank_left(exons, 2L)
right_ss <- flank_right(exons, 2L)
all_ss <- interweave(left_ss, right_ss, .id = "side")
all_ss
```
# Learning more
- The [*fluentGenomics* workflow](https://sa-lee.github.io/fluentGenomics) package shows you how to combine differential expression genes and differential chromatin accessibility peaks using plyranges. It extends the [case study](https://github.com/mikelove/plyrangesTximetaCaseStudy) by Michael Love for using plyranges with [tximeta](https://bioconductor.org/packages/release/bioc/html/tximeta.html).
- The [extended vignette in the plyrangesWorkshops package](https://github.com/sa-lee/plyrangesWorkshops) has a detailed
walk through of using plyranges for coverage analysis.
- The [Bioc 2018 Workshop book](https://bioconductor.github.io/BiocWorkshops/fluent-genomic-data-analysis-with-plyranges.html) has worked examples of using `plyranges` to analyse publicly available genomics data.
# Citation
If you found `plyranges` useful for your work please cite our
[paper](http://dx.doi.org/10.1186/s13059-018-1597-8):
```
@ARTICLE{Lee2019,
title = "plyranges: a grammar of genomic data transformation",
author = "Lee, Stuart and Cook, Dianne and Lawrence, Michael",
journal = "Genome Biol.",
volume = 20,
number = 1,
pages = "4",
month = jan,
year = 2019,
url = "http://dx.doi.org/10.1186/s13059-018-1597-8",
doi = "10.1186/s13059-018-1597-8",
pmc = "PMC6320618"
}
```
# Contributing
We welcome contributions from the R/Bioconductor community. We ask that
contributors follow the [code of conduct](.github/CODE_OF_CONDUCT.md) and the guide
outlined [here](.github/CONTRIBUTING.md).