This repository has been archived by the owner on May 10, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 16
/
README.Rmd
162 lines (106 loc) · 6.41 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "img/"
)
library(bindrcpp)
```
# umapr
[![Travis-CI Build Status](https://travis-ci.org/ropenscilabs/umapr.svg?branch=master)](https://travis-ci.org/ropenscilabs/umapr)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/juyeongkim/umapr?branch=master&svg=true)](https://ci.appveyor.com/project/juyeongkim/umapr)
[![codecov](https://codecov.io/gh/ropenscilabs/umapr/branch/master/graph/badge.svg)](https://codecov.io/gh/ropenscilabs/umapr)
`umapr` wraps the Python implementation of UMAP to make the algorithm accessible from within R. It uses the great [`reticulate`](https://cran.r-project.org/web/packages/reticulate/index.html) package.
Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction algorithm. It is similar to t-SNE but computationally more efficient. UMAP was created by Leland McInnes and John Healy ([github](https://github.com/lmcinnes/umap), [arxiv](https://arxiv.org/abs/1802.03426)).
Recently, two new UMAP R packages have appeared. These new packages provide more features than `umapr` does and they are more actively developed. These packages are:
* [umap](https://github.com/tkonopka/umap), which provides the same Python wrapping function as `umapr` and also an R implementation, removing the need for the Python version to be installed. It is available on [CRAN](https://cran.r-project.org/web/packages/umap/index.html).
* [uwot](https://github.com/jlmelville/uwot), which also provides an R implementation, removing the need for the Python version to be installed.
## Contributors
[Angela Li](https://github.com/angela-li), [Ju Kim](https://github.com/juyeongkim), [Malisa Smith](https://github.com/malisas), [Sean Hughes](https://github.com/seaaan), [Ted Laderas](https://github.com/laderast)
`umapr` is a project that was first developed at [rOpenSci Unconf 2018](http://unconf18.ropensci.org).
## Installation
**First**, you will need to install `Python` and the `UMAP` package. Instruction available [here](https://github.com/lmcinnes/umap#installing).
Then, you can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("ropenscilabs/umapr")
```
## Basic use
Here is an example of running UMAP on the `iris` data set.
```{r message=FALSE, warning=FALSE, fig.width=7}
library(umapr)
library(tidyverse)
# select only numeric columns
df <- as.matrix(iris[ , 1:4])
# run UMAP algorithm
embedding <- umap(df)
```
`umap` returns a `data.frame` with two attached columns called "UMAP1" and "UMAP2". These columns represent the UMAP embeddings of the data, which are column-bound to the original data frame.
```{r}
# look at result
head(embedding)
# plot the result
embedding %>%
mutate(Species = iris$Species) %>%
ggplot(aes(UMAP1, UMAP2, color = Species)) + geom_point()
```
There is a function called `run_umap_shiny()` which will bring up a Shiny app for exploring different colors of the variables on the umap plots.
```{r eval=FALSE}
run_umap_shiny(embedding)
```
![Shiny App for Exploring Results](img/shiny.png)
## Function parameters
There are a few important parameters. These are fully described in the UMAP Python [documentation](https://github.com/lmcinnes/umap/blob/bf1c3e5c89ea393c9de10bd66c5e3d9bc30588ee/notebooks/UMAP%20usage%20and%20parameters.ipynb).
The `n_neighbor` argument can range from 2 to n-1 where n is the number of rows in the data.
```{r fig.width=7}
neighbors <- c(4, 8, 16, 32, 64, 128)
neighbors %>%
map_df(~umap(as.matrix(iris[,1:4]), n_neighbors = .x) %>%
mutate(Species = iris$Species, Neighbor = .x)) %>%
mutate(Neighbor = as.integer(Neighbor)) %>%
ggplot(aes(UMAP1, UMAP2, color = Species)) +
geom_point() +
facet_wrap(~ Neighbor, scales = "free")
```
The `min_dist` argument can range from 0 to 1.
```{r fig.width=7}
dists <- c(0.001, 0.01, 0.05, 0.1, 0.5, 0.99)
dists %>%
map_df(~umap(as.matrix(iris[,1:4]), min_dist = .x) %>%
mutate(Species = iris$Species, Distance = .x)) %>%
ggplot(aes(UMAP1, UMAP2, color = Species)) +
geom_point() +
facet_wrap(~ Distance, scales = "free")
```
The `distance` argument can be many different distance functions.
```{r fig.width=7}
dists <- c("euclidean", "manhattan", "canberra", "cosine", "hamming", "dice")
dists %>%
map_df(~umap(as.matrix(iris[,1:4]), metric = .x) %>%
mutate(Species = iris$Species, Metric = .x)) %>%
ggplot(aes(UMAP1, UMAP2, color = Species)) +
geom_point() +
facet_wrap(~ Metric, scales = "free")
```
## Comparison to t-SNE and principal components analysis
t-SNE and UMAP are both non-linear dimensionality reduction methods, in contrast to PCA. Because t-SNE is relatively slow, PCA is sometimes run first to reduce the dimensions of the data.
We compared UMAP to PCA and t-SNE alone, as well as to t-SNE run on data preprocessed with PCA. In each case, the data were subset to include only complete observations. The code to reproduce these findings are available in [`timings.R`](timings.R).
The first data set is the same iris data set used above (149 observations of 4 variables):
![t-SNE, PCA, and UMAP on iris](img/multiple_algorithms_iris.png)
Next we tried a cancer data set, made up of 699 observations of 10 variables:
![t-SNE, PCA, and UMAP on cancer](img/multiple_algorithms_cancer.png)
Third we tried a soybean data set. It is made up of 531 observations and 35 variables:
![t-SNE, PCA, and UMAP on soybeans](img/multiple_algorithms_bean.png)
Finally we used a large single-cell RNAsequencing data set, with 561 observations (cells) of 55186 variables (over 30 million elements)!
![t-SNE, PCA, and UMAP on rna](img/multiple_algorithms_rna.png)
PCA is orders of magnitude faster than t-SNE or UMAP (not shown). UMAP, though, is a substantial improvement over t-SNE both in terms of memory and time taken to run.
![Time to run t-SNE vs UMAP](img/multiple_algorithms_time.png)
![Memory to run t-SNE vs UMAP](img/multiple_algorithms_memory.png)
## Related projects
* [`umap`](https://github.com/tkonopka/umap): R implementation of UMAP
* [`seurat`](https://github.com/satijalab/seurat): R toolkit for single cell genomics
* [`smallvis`](https://github.com/jlmelville/smallvis): R package for dimensionality reduction of small datasets