1 Overview

In single-cell RNA sequencing (scRNA-seq) experiments, doublets are artifactual libraries generated from two cells. They typically arise due to errors in cell sorting or capture, especially in droplet-based protocols (Zheng et al. 2017) involving thousands of cells. Doublets are obviously undesirable when the aim is to characterize populations at the single-cell level. In particular, they can incorrectly suggest the existence of intermediate populations or transitory states that not actually exist. Thus, it is desirable to remove doublet libraries so that they do not compromise interpretation of the results.

Several experimental strategies are available for doublet removal. One approach exploits natural genetic variation when pooling cells from multiple donor individuals (Kang et al. 2018). Doublets can be identified as libraries with allele combinations that do not exist in any single donor. Another approach is to mark a subset of cells (e.g., all cells from one sample) with an antibody conjugated to a different oligonucleotide (Stoeckius et al. 2017). Upon pooling, libraries that are observed to have different oligonucleotides are considered to be doublets and removed. These approaches can be highly effective but rely on experimental information that may not be available.

A more general approach is to infer doublets from the expression profiles alone (Dahlin et al. 2018). In this workflow, we will describe two purely computational approaches for detecting doublets from scRNA-seq data. The main difference between these two methods is whether or not they need cluster information beforehand. Both are implemented in the scran package from the open-source Bioconductor project (Huber et al. 2015). We will demonstrate the use of these methods on data from a droplet-based scRNA-seq study of the mouse mammary gland (Bach et al. 2017), available from NCBI GEO with the accession code GSE106273.

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
base.path <- "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2834nnn/GSM2834500/suppl"
barcode.fname <- bfcrpath(bfc, file.path(base.path, 
    "GSM2834500%5FG%5F1%5Fbarcodes%2Etsv%2Egz"))
gene.fname <- bfcrpath(bfc, file.path(base.path,
    "GSM2834500%5FG%5F1%5Fgenes%2Etsv%2Egz"))
counts.fname <- bfcrpath(bfc, file.path(base.path,
    "GSM2834500%5FG%5F1%5Fmatrix%2Emtx%2Egz"))

2 Preparing the data

2.1 Reading in the counts

We create a SingleCellExperiment object from the count matrix. The files have been modified from the CellRanger output, so we have to manually load them in rather than using read10xCounts().

library(scater)
library(Matrix)
gene.info <- read.table(gene.fname, stringsAsFactors=FALSE)
colnames(gene.info) <- c("Ensembl", "Symbol")
sce <- SingleCellExperiment(
    list(counts=as(readMM(counts.fname), "dgCMatrix")), 
    rowData=gene.info, 
    colData=DataFrame(Barcode=readLines(barcode.fname))
)

We put some more meaningful information in the row and column names. Note the use of uniquifyFeatureNames() to generate unique row names from gene symbols.

rownames(sce) <- uniquifyFeatureNames(
    rowData(sce)$Ensembl, rowData(sce)$Symbol)
colnames(sce) <- sce$Barcode
sce
## class: SingleCellExperiment 
## dim: 27998 2915 
## metadata(0):
## assays(1): counts
## rownames(27998): Xkr4 Gm1992 ... Vmn2r122 CAAA01147332.1
## rowData names(2): Ensembl Symbol
## colnames(2915): AAACCTGAGGATGCGT-1 AAACCTGGTAGTAGTA-1 ...
##   TTTGTCATCCTTAATC-1 TTTGTCATCGAACGGA-1
## colData names(1): Barcode
## reducedDimNames(0):
## spikeNames(0):

We add some genomic location annotation for downstream use.

library(TxDb.Mmusculus.UCSC.mm10.ensGene)
chrloc <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keytype="GENEID", 
    keys=rowData(sce)$Ensembl, column="CDSCHROM")
rowData(sce)$Chr <- chrloc

2.2 Quality control

We compute quality control (QC) metrics using the calculateQCMetrics() function from the scater package (McCarthy et al. 2017).

is.mito <- rowData(sce)$Chr == "chrM"
summary(is.mito)
##    Mode   FALSE    TRUE    NA's 
## logical   21767      13    6218
sce <- calculateQCMetrics(sce, feature_controls=list(Mito=which(is.mito)))

We remove cells that are outliers for any of these metrics, as previously discussed. Note that some quality control was already performed by the authors of the original study, so relatively few cells are discarded here.

low.lib <- isOutlier(sce$total_counts, log=TRUE, nmads=3, type="lower")
low.nexprs <- isOutlier(sce$total_features_by_counts, log=TRUE, nmads=3, type="lower")
high.mito <- isOutlier(sce$pct_counts_Mito, nmads=3, type="higher")
discard <- low.lib | low.nexprs | high.mito
DataFrame(LowLib=sum(low.lib), LowNum=sum(low.nexprs), HighMito=sum(high.mito), 
    Discard=sum(discard), Kept=sum(!discard))
## DataFrame with 1 row and 5 columns
##      LowLib    LowNum  HighMito   Discard      Kept
##   <integer> <integer> <integer> <integer> <integer>
## 1         0         0       143       143      2772

We then subset the SingleCellExperiment object to remove these low-quality cells.

sce <- sce[,!discard]

2.3 Normalization for cell-specific biases

We apply the deconvolution method with pre-clustering (Lun, Bach, and Marioni 2016) to compute size factors for scaling normalization of cell-specific biases.

library(scran)
library(BiocSingular)
set.seed(1000)
clusters <- quickCluster(sce, BSPARAM=IrlbaParam())
table(clusters)
## clusters
##   1   2   3   4   5   6   7   8   9 
## 461 383 531 512 146 234 113 187 205
sce <- computeSumFactors(sce, clusters=clusters, min.mean=0.1) 
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2792  0.5237  0.7589  1.0000  1.2135 10.4998

We then compute log-normalized expression values for downstream use. This data set does not contain spike-in transcripts so separate normalization with computeSpikeFactors() is not required.

sce <- normalize(sce)
assayNames(sce)
## [1] "counts"    "logcounts"

2.4 Modelling and removing noise

As we have no spike-ins, we model technical noise using the makeTechTrend() function.

tech.trend <- makeTechTrend(x=sce)
fit <- trendVar(sce, use.spikes=FALSE)
plot(fit$mean, fit$var, pch=16, 
    xlab="Mean log-expression",
    ylab="Variance of log-expression")
curve(tech.trend(x), add=TRUE, col="red")
Variance of the log-expression values as a function of the mean log-expression in the mammary gland data set. Each point represents a gene, and the red line corresponds to Poisson variance.

Figure 1: Variance of the log-expression values as a function of the mean log-expression in the mammary gland data set. Each point represents a gene, and the red line corresponds to Poisson variance.

We use denoisePCA() to choose the number of principal components (PCs) to retain based on the technical noise per gene. We need to set the seed for reproducibility when BSPARAM=IrlbaParam(), due to the use of randomized methods from irlba.

set.seed(12345)
sce <- denoisePCA(sce, technical=tech.trend, BSPARAM=IrlbaParam())
ncol(reducedDim(sce))
## [1] 15

2.5 Clustering into subpopulations

We cluster cells into putative subpopulations using buildSNNGraph() (Xu and Su 2015). We use a higher k to increase connectivity and reduce the granularity of the clustering.

snn.gr <- buildSNNGraph(sce, use.dimred="PCA", k=25)
sce$Cluster <- factor(igraph::cluster_walktrap(snn.gr)$membership)
table(sce$Cluster)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 788 470 295 550  24  79  89 328  52  39  33  25

We visualize the clustering on a t-SNE plot (Van der Maaten and Hinton 2008). Figure 2 shows that there are a number of well-separated clusters as well as some more inter-related clusters.

set.seed(1000)
sce <- runTSNE(sce, use_dimred="PCA")
plotTSNE(sce, colour_by="Cluster")