0-Seurat_PBMC3k_tutorial_TENxPBMCData.Rmd

---
title: "Seurat PBMC 3k tutorial using TENxPBMCData"
author: "Kevin Rue-Albrecht"
date: "`r BiocStyle::doc_date()`"
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE)
knitr::opts_chunk$set(cache=TRUE)
stopifnot({
    require(BiocStyle)
})
```

# Overview

In this example, we use count data for 2,700 peripheral blood mononuclear cells (PBMC) obtained using the [10X Genomics](www.10xgenomics.com) platform, and process it following the [Guided Clustering Tutorial](https://satijalab.org/seurat/pbmc3k_tutorial_1_4.html) of the `r CRANpkg("Seurat")` package.

# Getting the data

First, fetch the data as a `r Biocpkg("SingleCellExperiment")` object using the `r Biocpkg("TENxPBMCData")` package.
The first time that the following code chunk is run, users should expect it to take additional time as it downloads data from the web and caches it on their local machine; subsequent evaluations of the same code chunk should only take a few seconds as the data set is then loaded from the local cache.

```{r, message=FALSE}
library(TENxPBMCData)
tenx_pbmc3k <- TENxPBMCData(dataset="pbmc3k")
colnames(tenx_pbmc3k) <- paste0("Cell", seq_len(ncol(tenx_pbmc3k)))
tenx_pbmc3k
```

# Preparing the data

Next, prepare a sparse matrix that emulates the first section of the [Guided Clustering Tutorial](https://satijalab.org/seurat/pbmc3k_tutorial_1_4.html).

```{r, message=FALSE}
library(Matrix)
pbmc.data <- as(counts(tenx_pbmc3k), "Matrix")
pbmc.data <- as(pbmc.data, "dgTMatrix")
```

# Seurat - Guided Clustering Tutorial

From here on, follow the [Guided Clustering Tutorial](https://satijalab.org/seurat/pbmc3k_tutorial_1_4.html) to the letter (code obtained on 2018-11-24).

```{r, message=FALSE}
library(Seurat)
library(dplyr)
# Initialize the Seurat object with the raw (non-normalized data).  Keep all
# genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at
# least 200 detected genes
pbmc <- CreateSeuratObject(raw.data=pbmc.data, min.cells=3, min.genes=200, project="10X_PBMC")
pbmc
```

```{r}
# The number of genes and UMIs (nGene and nUMI) are automatically calculated
# for every object by Seurat.  For non-UMI data, nUMI represents the sum of
# the non-normalized values within a cell We calculate the percentage of
# mitochondrial genes here and store it in percent.mito using AddMetaData.
# We use object@raw.data since this represents non-transformed and
# non-log-normalized counts The % of UMI mapping to MT-genes is a common
# scRNA-seq QC metric.
mito.genes <- subset(rowData(tenx_pbmc3k), grepl("^MT-", Symbol_TENx), "ENSEMBL_ID", drop=TRUE)
mito.genes <- intersect(mito.genes, rownames(pbmc@raw.data))
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data)

# AddMetaData adds columns to object@meta.data, and is a great place to
# stash QC stats
pbmc <- AddMetaData(object=pbmc, metadata=percent.mito, col.name="percent.mito")
VlnPlot(object=pbmc, features.plot=c("nGene", "nUMI", "percent.mito"), nCol=3)
```

```{r}
# GenePlot is typically used to visualize gene-gene relationships, but can
# be used for anything calculated by the object, i.e. columns in
# object@meta.data, PC scores etc.  Since there is a rare subset of cells
# with an outlier level of high mitochondrial percentage and also low UMI
# content, we filter these as well
par(mfrow=c(1, 2))
GenePlot(object=pbmc, gene1="nUMI", gene2="percent.mito")
GenePlot(object=pbmc, gene1="nUMI", gene2="nGene")
```

```{r}
# We filter out cells that have unique gene counts over 2,500 or less than
# 200 Note that low.thresholds and high.thresholds are used to define a
# 'gate'.  -Inf and Inf should be used if you don't want a lower or upper
# threshold.
pbmc <- FilterCells(object=pbmc, subset.names=c("nGene", "percent.mito"), 
    low.thresholds=c(200, -Inf), high.thresholds=c(2500, 0.05))
```

```{r}
pbmc <- NormalizeData(object=pbmc, normalization.method="LogNormalize", 
    scale.factor=10000)
```

```{r}
pbmc <- FindVariableGenes(object=pbmc, mean.function=ExpMean, dispersion.function=LogVMR, 
    x.low.cutoff=0.0125, x.high.cutoff=3, y.cutoff=0.5)
```

```{r}
length(x=pbmc@var.genes)
```

```{r}
pbmc <- ScaleData(object=pbmc, vars.to.regress=c("nUMI", "percent.mito"))
```

```{r}
pbmc <- RunPCA(object=pbmc, pc.genes=pbmc@var.genes, do.print=TRUE, pcs.print=1:5, 
    genes.print=5)
```

```{r}
# Examine and visualize PCA results a few different ways
PrintPCA(object=pbmc, pcs.print=1:5, genes.print=5, use.full=FALSE)
```

```{r}
VizPCA(object=pbmc, pcs.use=1:2)
```

```{r}
PCAPlot(object=pbmc, dim.1=1, dim.2=2)
```

```{r}
# ProjectPCA scores each gene in the dataset (including genes not included
# in the PCA) based on their correlation with the calculated components.
# Though we don't use this further here, it can be used to identify markers
# that are strongly correlated with cellular heterogeneity, but may not have
# passed through variable gene selection.  The results of the projected PCA
# can be explored by setting use.full=T in the functions above
pbmc <- ProjectPCA(object=pbmc, do.print=FALSE)
```

```{r}
PCHeatmap(object=pbmc, pc.use=1, cells.use=500, do.balanced=TRUE, label.columns=FALSE)
```

```{r}
PCHeatmap(object=pbmc, pc.use=1:12, cells.use=500, do.balanced=TRUE, 
    label.columns=FALSE, use.full=FALSE)
```

Small deviation from the tutorial. Skip the lengthy JackStraw computation.

```{r}
# NOTE: This process can take a long time for big datasets, comment out for
# expediency.  More approximate techniques such as those implemented in
# PCElbowPlot() can be used to reduce computation time
if (FALSE) {
    pbmc <- JackStraw(object=pbmc, num.replicate=100, display.progress=FALSE)
    JackStrawPlot(object=pbmc, PCs=1:12)
}
```

```{r}
PCElbowPlot(object=pbmc)
```

```{r}
# save.SNN=T saves the SNN so that the clustering algorithm can be rerun
# using the same graph but with a different resolution value (see docs for
# full details)
pbmc <- FindClusters(object=pbmc, reduction.type="pca", dims.use=1:10, 
    resolution=0.6, print.output=0, save.SNN=TRUE)
```

```{r}
PrintFindClustersParams(object=pbmc)
```

```{r}
pbmc <- RunTSNE(object=pbmc, dims.use=1:10, do.fast=TRUE)
```

```{r}
# note that you can set do.label=T to help label individual clusters
TSNEPlot(object=pbmc)
```

Save the `seurat` object

```{r}
saveRDS(pbmc, file="pbmc3k_tutorial.rds")
```

Save the original `SingleCellExperiment` object, after:

- removing the cells excluded by quality metrics during the Seurat workflow
- adding the cluster assignments
- adding the PCA and t-SNE dimensionality reduction results


```{r}
tenx_pbmc3k <- tenx_pbmc3k[, pbmc@cell.names]
colData(tenx_pbmc3k)[["seurat.ident"]] <- as.factor(pbmc@ident)
reducedDim(tenx_pbmc3k, "PCA") <- pbmc@dr$pca@cell.embeddings
reducedDim(tenx_pbmc3k, "TSNE") <- pbmc@dr$tsne@cell.embeddings
colnames(tenx_pbmc3k) <- NULL # remove the dummy cell names before exporting
saveRDS(tenx_pbmc3k, file="pbmc3k_tutorial.sce.rds")
```

Export the cluster identity vector.
It will be included in the `r Githubpkg("kevinrue/hancock")` package, and mapped using the dummy cell names defined at the start of this notebook.

```{r}
ident <- pbmc@ident
names(ident) <- pbmc@cell.names
saveRDS(ident, "pbmc3k.ident.rds")
```

# See also

- [Applying Gene Signatures to the Seurat PBMC 3k Tutorial](1-proportion_signature.html)
- [Learning Gene Signatures from the Seurat PBMC 3k Tutorial](2-learn-signatures.html)

<button onclick="topFunction()" id="myBtn" title="Go to top">Back to top</button>