Integration of the human lung cell atlas with metacells using a supervised workflow

Integration of the human lung cell atlas with metacells using a supervised workflow

In this example we will work with the Human Cell Lung Atlas core HLCA gathering around 580,000 cells from 107 individuals distributed in 166 samples.

Taking advantage of the single-cell annotation of the original study we will build metacell for each cell type in each sample and guide the integration with the cell type label using STACAS.

Setting up the environment

Be sure to be in the MetacellAnalysisToolkit environment when you are running this Rmarkdown.

library(Seurat)

## The legacy packages maptools, rgdal, and rgeos, underpinning this package
## will retire shortly. Please refer to R-spatial evolution reports on
## https://r-spatial.org/r/2023/05/15/evolution4.html for details.
## This package is now running under evolution status 0

## Attaching SeuratObject

library(anndata)
library(SuperCell)
library(ggplot2)

wilcox.test <- "wilcox"
if(packageVersion("Seurat") >= 5) {
  options(Seurat.object.assay.version = "v4") 
  wilcox.test <- "wilcox_limma"
  print("you are using seurat v5 with assay option v4")}

color.celltypes  <- c('#E5D2DD', '#53A85F', '#F1BB72', '#F3B1A0', '#D6E7A3', '#57C3F3', '#476D87',
                      '#E95C59', '#E59CC4', '#AB3282', '#23452F', '#BD956A', '#8C549C', '#585658',
                      '#9FA3A8', '#E0D4CA', '#5F3D69', '#58A4C3', "#b20000",'#E4C755', '#F7F398',
                      '#AA9A59', '#E63863', '#E39A35', '#C1E6F3', '#6778AE', '#91D0BE', '#B53E2B',
                      '#712820', '#DCC1DD', '#CCE0F5', '#CCC9E6', '#625D9E', '#68A180', '#3A6963',
                      '#968175')

Downloading the data (done in the previous example)

If you didn’t try the unsupervised example first and haven’t downloaded the data yet you can do it from cellxgene. Choose the .h5ad option after clicking on the download button for the core atlas (3 tissues, 584,944 cells).

You can use a bash command line of this form to download the data directly in the ./HLCA_data directory. You will have to update the link (obtained by clicking on download, .h5ad selection) as links are temporary.

Please note that this may take some time (~45 mins) as the file is quite large (5.6 GB).

#Uncomment to download the data in the ./HLCA_data/ directory after updating the link
#mkdir -p ./HLCA_data
#curl -o ./HLCA_data/local.h5ad "https://corpora-data-prod.s3.amazonaws.com/7bcad396-49c3-40d9-80c1-16d74e7b88bd/local.h5ad?AWSAccessKeyId=ASIATLYQ5N5XZ2V3CYXW&Signature=CI8hgXdSO2ewDXpP%2FCb7ouxW6R8%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEC0aCXVzLXdlc3QtMiJGMEQCIGoOTAGVxanApGEIeRVOL%2BRK7silMZiTtgLE%2BXguyjPjAiARoOLhXmQwzwHgme2Ll0OIZK0VIrBLaH3bSbFzRzBfuSrrAwh2EAEaDDIzMTQyNjg0NjU3NSIMbCRmBRpD%2BT0U5T8%2BKsgDcLw0fAhlIgdEjdOw%2FvUOo36uXvDClcBPXmosjNUDGVIYy67gprxvikZ%2FZHqtu%2BnodejEEIIxGJw2kv0l7dcjmGgP9IFLP6WBmsGekfI7kFCkFypmZtKXqggx9stp2K3MZCrsfcEcWttsV62c690lzdiQ4UI4lUqGqXq8C7Ah1RnxfXPQJsa3YKmHs39c3mX%2BHG5Nv4rydgzhkWE7qTkGxZvqV1cLuPMz2X78zBq5GXY0HTaGvGMgAzE5OcKbqF50sxmh0pE7PGmvz1wLYN8LB6YpMbD8qCXMdP7e4uBk2yjkK23m5m%2FrMVrCWEarSh5QqrzDR347XTg%2BkVDY301ygqy3GpCTq342sTKmUZH0PRhkliGyKvakNQU4QBy6meSQORvRX1WEhn0cRYPygyD9ugK2sDqtBl0JXUlEfqSDmE%2BXGDoRFGnKiTDSvnHhVgj64h4eTUcutZFdTILwMaYGEIl1ItElCptqvYS3rmrzdvAr5nSjx%2BnK9tKt6linyh%2Bau7zc6IfQSTzZoMut%2Fw1fOuCQ%2BQmxCaEyBXzfTTrx4%2FuxyiYAkPN0vLTtSvtuklZH7O1axMTQIonnFDsnKeVnUzl3ZEgdUbxhMLL20qoGOqYBdtJOXqTiQUDX4ZH0ReubHpog%2BorDorDJ0B08Edu6k36SwuSNu6Hv8MW%2BdWFVfqs0X%2Fx74oMs8yQC8T1gSG2HrlCfLoWIBep9lA9EHq4vUBhYB4mmJ7Fsc2MdhOtof%2BzrE8b1ILxU%2Fdeliek9Aqz0uBWcfJsEu%2FlHrC1sX4P5F8nytcLxvzCTGB43mPHeqB5DZaAKC%2FY8SmSa9CJ1Njfz8n%2FIuTLv8w%3D%3D&Expires=1700662555"

First we need to specify that we will work with the MATK conda environment for the anndata package relying on reticulate and for the MATK tool.

library(reticulate)
conda_env <-  conda_list()[reticulate::conda_list()$name == "MetacellAnalysisToolkit","python"]

Sys.setenv(RETICULATE_PYTHON = conda_env)

Splitting atlas by datasets

adata <- read_h5ad("./HLCA_data/local.h5ad",backed = "r")
adata$var_names <- adata$var$feature_name # We will use gene short name for downstream analyses
datasets <- unique(adata$obs$dat)

If you did not try the HLCA_core_atlas example first and haven’t divided the atlas in one h5ad file for each dataset you can do it with the following chunk.

# #Uncoment to split atlas by datasets 
# t0.split <- Sys.time()
# 

# 
# # If you are limited in time you can process on half of the datasets (uncomment th following line)
# # datasets <- datasets[1:7]
# 
# 
# print(dim(adata))
# 
# lapply(datasets,FUN =  function(x) {
#   dir.create(paste0("./HLCA_data/datasets/",x),recursive = T)
#   adata.dataset <- AnnData(X = adata[adata$obs$dataset == x]$raw$X,
#                            var = adata[adata$obs$dataset == x]$var,
#                            obs = adata[adata$obs$dataset == x]$obs)
#   #This will allow us to construct supervised metacell for each cell type in each sample later in the second example
#   adata.dataset$obs$ann <- as.character(adata.dataset$obs$ann_level_3)
#   # For cell without an annotation at the 3rd level we will use the second level of annotation
#   adata.dataset$obs$ann[adata.dataset$obs$ann_level_3 == 'None'] = as.character(adata.dataset$obs$ann_level_2[adata.dataset$obs$ann_level_3 == 'None'])
#   adata.dataset$obs$ann_sample <- paste0(adata.dataset$obs$ann,"_",adata.dataset$obs$sample)
# 
#   write_h5ad(adata.dataset,paste0("./HLCA_data/datasets/",x,"/sc_adata.h5ad"))
# }
# )
# 
# tf.split <- Sys.time()
# 
# tf.split - t0.split

remove(adata)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 3083518 164.7    5491658 293.3  5491658 293.3
## Vcells 5756289  44.0   31167873 237.8 36487015 278.4

Constructing supervised metacell

Sikkema et al made a remarkable job in finely annotating hundreds thousands of cells. Within the framework of this re-analysis, let’s now try to use this prior knowledge to obtain slightly better results using a supervised workflow.

We added previously a ann_sample column in the metadata of the single cell object. We now can use it to build metacell for each cell type in each sample.

If you are limited in memory you should still be able to process the samples by reducing the number of cores (e.g. -l 3) or by sequentially processing the samples (just remove the -l) in a slightly longer time.

This should take around 30 minutes.

for d in ./HLCA_data/datasets/*;
do ../cli/MATK -t SuperCell -i $d/sc_adata.h5ad -o $d/sup_mc -a ann_sample -l 12 -n 50 -f 2000 -k 30 -g 50 -s adata
done

Load metacell objects

We load the .h5ad objects and directly convert them in Seurat objects to benefit from all the functions of this framework.

datasets <- list.dirs("./HLCA_data/datasets/",full.names = F,recursive = F)

metacell.files <- sapply(datasets, FUN = function(x){paste0("./HLCA_data/datasets/",x,"/sup_mc/mc_adata.h5ad")})

metacell.objs <- lapply(X = metacell.files, function(X){
  adata <- read_h5ad(X)
  countMatrix <- Matrix::t(adata$X)
  colnames(countMatrix) <- adata$obs_names
  rownames(countMatrix) <- adata$var_names
  sobj <- Seurat::CreateSeuratObject(counts = countMatrix,meta.data = adata$obs)
  if(packageVersion("Seurat") >= 5) {sobj[["RNA"]] <- as(object = sobj[["RNA"]], Class = "Assay")}
  sobj <- RenameCells(sobj, add.cell.id = unique(sobj$sample)) # we give unique name to metacells
  return(sobj)
})

Merging objects and basic quality control

Given the single-cell metadata, the MATK tool automatically assign annotations to metacells and computes purities for all the categorical variables present in the metadata of the input single-cell object.

Thus, let’s check the purity of our metacells at different level of annotations, as well as their size (number of single cells they contain).

To do so we merge the object together and use Seurat VlnPlot function.

unintegrated.mc <- merge(metacell.objs[[1]],metacell.objs[-1])

VlnPlot(unintegrated.mc[,unintegrated.mc$ann_level_3 != "None"],features = c("size","ann_level_2_purity"),group.by = 'dataset',pt.size = 0.001,ncol=2)

## Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
## idents, : All cells have the same value of ann_level_2_purity.

VlnPlot(unintegrated.mc[,unintegrated.mc$ann_level_3 != "None"],features = c("ann_level_3_purity","ann_level_4_purity"),group.by = 'dataset',pt.size = 0.001,ncol=2)

## Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
## idents, : All cells have the same value of ann_level_3_purity.

We can also use box plots.

p_4 <- ggplot(unintegrated.mc@meta.data,aes(x=dataset,y=ann_level_4_purity,fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45)) + ggtitle("sup metacells level 4 purity") + NoLegend() + ylim(c(0,1))

p_finest <- ggplot(unintegrated.mc@meta.data,aes(x=dataset,y=ann_finest_level_purity,fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45))  + ggtitle("sup metacells finest level purity") + NoLegend() + ylim(c(0,1))

p_4 + p_finest

Overall using supervised metacells construction we obtain pure metacell until the 3rd level of annotaion and improve metacell purities for finer levels compared to the unsupervised approach (see previous example).

meta.data.unsup <- readRDS("./HLCA_data/combined.mc.unsup.rds")@meta.data 

p_4_unsup <- ggplot(meta.data.unsup,aes(x=dataset,y=ann_level_4_purity,fill = dataset)) + geom_boxplot() +
    scale_x_discrete(guide = guide_axis(angle = 45)) + ggtitle("unsup metacells level 4 purity") + NoLegend() + ylim(c(0,1))

p_finest_unsup <- ggplot(meta.data.unsup,aes(x=dataset,y=ann_finest_level_purity,fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45))  + ggtitle("unsup metacells finest level purity") + NoLegend() + ylim(c(0,1))

p_4_unsup | p_4

p_finest_unsup + p_finest

Unintegrated analysis

Let’s first do a standard dimensionality reduction without batch correction.

DefaultAssay(unintegrated.mc) <- "RNA"
unintegrated.mc <- NormalizeData(unintegrated.mc)
unintegrated.mc <- FindVariableFeatures(unintegrated.mc)
unintegrated.mc <- ScaleData(unintegrated.mc)
unintegrated.mc <- RunPCA(unintegrated.mc)
unintegrated.mc <- RunUMAP(unintegrated.mc,dims = 1:30)

umap.unintegrated.datasets <- DimPlot(unintegrated.mc,reduction = "umap",group.by = "dataset") + NoLegend() + ggtitle("unintegrated datasets")
umap.unintegrated.types <- DimPlot(unintegrated.mc,reduction = "umap",group.by = "ann_level_2",label = T,repel = T,cols = color.celltypes)+ NoLegend() + ggtitle("unintegrated cell types")

umap.unintegrated.datasets + umap.unintegrated.types

You can see on the plots that a batch effect is clearly present at the metacell level. Let’s correct it using a supervised approach.

STACAS integration

In the original study, datasets were integrated using SCANVI semi-supervised integration using partial annotation obtained for each dataset prior integration. Here in this second example we propose to use a similar approach in R using STACAS. We will use the “ann” labels we used to construct the metacells (3rd level of annotation if available for the cell, otherwise 2nd level).

To be noted that, as in the original study, we use the dataset rather than the donor as the batch parameter. See method section Data integration benchmarking of the original study for more details.

# Install package if needed
if (!requireNamespace("STACAS")) remotes::install_github("carmonalab/STACAS",upgrade = "never")

library(STACAS)

n.metacells <- sapply(metacell.objs,FUN = function(x){ncol(x)})
names(n.metacells) <- datasets
ref.names <- sort(n.metacells,decreasing = T)[1:5]
ref.index <- which(datasets %in% names(ref.names))


# normalize and identify variable features for each dataset independently
metacell.objs <- lapply(X = metacell.objs, FUN = function(x) {
  DefaultAssay(x) <- "RNA";
  x <- RenameCells(x, add.cell.id = unique(x$sample)) # we give unique name to metacells
  x <- NormalizeData(x)
  return(x)})
gc()


# Perform a supervised integration of the dataset using STACAS 
combined.mc <- Run.STACAS(object.list = metacell.objs, 
                          anchor.features = 2000, 
                          min.sample.size = 80,
                          k.weight = 80, #smallest dataset contains 86 metacells
                          cell.labels = "ann", # Note that by not you can use STACAS in its unsupervised mode
                          reference = ref.index, # the 5 biggest datasets are used as reference
                          dims = 1:30)

remove(metacell.objs) # We don't need the object list anymore
gc()

Check the obtained object

combined.mc

## An object of class Seurat 
## 30024 features across 12914 samples within 2 assays 
## Active assay: integrated (2000 features, 2000 variable features)
##  1 other assay present: RNA
##  1 dimensional reduction calculated: pca

We can verify that the sum of metacell sizes correspond to the original number of single-cells

sum(combined.mc$size)

## [1] 584944

STACAS directly returns a pca for the slot "integrated" that we can use to make a UMAP of the corrected data.

DefaultAssay(combined.mc) = "integrated"

combined.mc <- RunUMAP(combined.mc, dims = 1:30,reduction =  "pca",reduction.name = "umap")

Now we can make the plots and visually compare the results with the unintegrated analysis.

umap.stacas.datasets <- DimPlot(combined.mc,reduction = "umap",group.by = "dataset") + NoLegend() + ggtitle("integrated datasets")
umap.stacas.celltypes <- DimPlot(combined.mc,reduction = "umap",group.by = "ann_level_2",label = T,repel = T,cols = color.celltypes) + NoLegend() + ggtitle("integrated cell types")

umap.stacas.datasets + umap.stacas.celltypes + umap.unintegrated.datasets + umap.unintegrated.types

STACAS efficiently corrected the batch effect in the data while keeping the cell type separated.

We can navigate in the different annotation levels.

library(ggplot2)

DimPlot(combined.mc,group.by = "ann_level_1",reduction = "umap",label = T, repel = T,cols= color.celltypes) + NoLegend()

DimPlot(combined.mc,group.by = "ann_level_2",reduction = "umap",label = T, repel = T,cols= color.celltypes) + NoLegend()

DimPlot(combined.mc,group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= color.celltypes) + NoLegend()

Comparison with unsupervised analysis

we can quickly visually compare these results with the unsupervised integration obtained with Seurat

combined.mc.unsup <- readRDS("./HLCA_data/combined.mc.unsup.rds")

combined.mc$ann_level_3 <- factor(combined.mc$ann_level_3)
matched.color.celltypes <- color.celltypes[1:length(levels(combined.mc$ann_level_3))]
names(matched.color.celltypes) <- levels(combined.mc$ann_level_3)

level3_sup <- DimPlot(combined.mc,group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= matched.color.celltypes) + NoLegend() + ggtitle("Sup workflow")
level3_unsup <- DimPlot(combined.mc.unsup,group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= matched.color.celltypes) + NoLegend() + ggtitle("Unsup workflow")
 
level3_sup + level3_unsup

Look at epithelial cells in particular

level3_sup <- DimPlot(combined.mc[,combined.mc$ann_level_1 == "Epithelial"],group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= matched.color.celltypes) + NoLegend() + ggtitle("Sup workflow")
level3_unsup <- DimPlot(combined.mc.unsup[,combined.mc.unsup$ann_level_1 == "Epithelial"],group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= matched.color.celltypes) + NoLegend() + ggtitle("Unsup workflow")
 
level3_sup + level3_unsup

Downstream analysis

You can try conduce the same downstream analyses as in the previous example (clustering, cell type abundances, DEG …).

Here to show you the interest of supervised workflow with pure metacell we can zoom on the smooth muscle sub types. Despite the low metacell number for each cell type these different subtypes are separated on the UMAP, especially the rare FAM83D+ smooth muscles that were discovered in the original study.

combined.mc$ann <- factor(combined.mc$ann)
color.celltypes.ann <- color.celltypes[c(1:length(levels(combined.mc$ann)))]
names(color.celltypes.ann) <- levels(combined.mc$ann)

DimPlot(combined.mc[,combined.mc$ann_level_2 == "Smooth muscle"],group.by = "ann",cols = color.celltypes.ann)

Using a DEG analysis we can check if we retrieve their markers. MYH11 and CNN1 genes are canonical smooth muscle markers while FAM83D was found uniquely and consistently expressed by this rare cell type in the original study

DefaultAssay(combined.mc) <- "RNA"
Idents(combined.mc) <- "ann"
markersSmoothMuscle <- FindMarkers(combined.mc,ident.1 = "Smooth muscle FAM83D+",only.pos = T, logfc.threshold = 0.25,test.use =  wilcox.test)

## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session

head(markersSmoothMuscle)

##               p_val avg_log2FC pct.1 pct.2     p_val_adj
## MYOCD 4.887974e-176  1.3879478 0.758 0.022 1.369806e-171
## NMRK2 1.092465e-129  0.4261093 0.273 0.003 3.061523e-125
## PLN   4.060884e-124  3.1234102 0.879 0.044 1.138022e-119
## HSPB3 7.779955e-121  1.0321301 0.545 0.016 2.180255e-116
## CASQ2 9.830136e-117  1.1149403 0.636 0.023 2.754797e-112
## ASB5  2.233460e-107  0.2845419 0.273 0.004 6.259049e-103

markersSmoothMuscle[c("MYH11","CNN1","FAM83D"),]

##               p_val avg_log2FC pct.1 pct.2    p_val_adj
## MYH11  2.456769e-32   4.250920 0.970 0.286 6.884850e-28
## CNN1   6.519506e-71   4.627323 0.970 0.106 1.827026e-66
## FAM83D 3.418146e-11   2.189558 0.636 0.284 9.579012e-07

# Many classical smooth muscles cells are not annotated at the 3rd level of annotation (labelled None)
VlnPlot(combined.mc,features = c("MYH11","CNN1","FAM83D"),group.by = "ann",ncol = 2,cols = color.celltypes.ann)

Conclusion

Taking advantage of the single cell annotation in a supervised workflow we could improve the precision of our metacell re-analysis. When cell annotations are given and of good quality, which is far from being the case every time, building metacells accordingly and use a supervised integration workflow should be preferred.

To be noted that we used an intermediary level of annotation to supervise our analysis, using a finer level for this data would have resulted in a longer time for metacell building. Plus, we would have obtained to few metacells per cell type in the different samples to be able to make an efficient supervised batch correction with STACAS.

To be more precise at the cost of computational efficiency one could also try to reduce the graining level of the analysis (using a graining level of 20 for instance),

To conclude, keep in mind that in one hand, for certain analysis such as very rare cell type analysis, we will struggle to achieve the same level of sensitivity with metacells compared to single-cells. On the other hand, you certainly won’t be able to analyze so many single-cells so easily, and you may not need extremely fine cell-type resolution for many analyses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HLCA_core_atlas_supervised.md

HLCA_core_atlas_supervised.md

Integration of the human lung cell atlas with metacells using a supervised workflow

Setting up the environment

Downloading the data (done in the previous example)

Splitting atlas by datasets

Constructing supervised metacell

Load metacell objects

Merging objects and basic quality control

Unintegrated analysis

STACAS integration

Comparison with unsupervised analysis

Downstream analysis

Conclusion

Files

HLCA_core_atlas_supervised.md

Latest commit

History

HLCA_core_atlas_supervised.md

File metadata and controls

Integration of the human lung cell atlas with metacells using a supervised workflow

Setting up the environment

Downloading the data (done in the previous example)

Splitting atlas by datasets

Constructing supervised metacell

Load metacell objects

Merging objects and basic quality control

Unintegrated analysis

STACAS integration

Comparison with unsupervised analysis

Downstream analysis

Conclusion