1 Introduction

Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.

Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for valid downstream analysis. However, existing methods such as removeBatchEffect() (Ritchie et al. 2015) assume that the composition of cell populations are either known or the same across batches. This workflow describes the application of an alternative strategy for batch correction based on the detection of mutual nearest neighbours (MNNs) (Haghverdi et al. 2018). The MNN approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches. We demonstrate its use on two human pancreas scRNA-seq datasets generated in separate studies.

2 Processing the different datasets

2.1 CEL-seq, GSE81076

2.1.1 Loading in the data

This dataset was generated by Grun et al. (2016) using the CEL-seq protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above.

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
grun.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
    "GSE81nnn/GSE81076/suppl/GSE81076%5FD2%5F3%5F7%5F10%5F17%2Etxt%2Egz"))

We first read the table into memory.

gse81076.df <- read.table(grun.fname, sep='\t', 
    header=TRUE, stringsAsFactors=FALSE, row.names=1)
dim(gse81076.df)
## [1] 20148  1728
head(rownames(gse81076.df))
## [1] "A1BG-AS1__chr19" "A1BG__chr19"     "A1CF__chr10"     "A2M-AS1__chr12" 
## [5] "A2ML1__chr12"    "A2MP1__chr12"
head(colnames(gse81076.df))
## [1] "D2ex_1" "D2ex_2" "D2ex_3" "D2ex_4" "D2ex_5" "D2ex_6"

Unfortunately, the data and metadata are all mixed together in this file. As a result, we need to manually extract the metadata from the column names.

donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse81076.df))
table(donor.names)
## donor.names
##    D101    D102  D10631     D17   D1713 D172444      D2      D3     D71 
##      96      96      96     288      96      96      96     480      96 
##     D72     D73     D74 
##      96      96      96
plate.id <- sub("^D[0-9]+(.*)_.*", "\\1", colnames(gse81076.df))
table(plate.id)
## plate.id
##      All1 All2 TGFB  en1  en2  en3  en4   ex 
##  864   96   96   96   96   96   96   96  192

Another irritating feature of this dataset is that gene symbols were supplied, rather than stable identifiers such as Ensembl. Stable identifiers are desirable as they facilitate reliable cross-referencing of gene identities between data sets1 The two data sets used in this workflow come from the same provider, who at least uses gene symbols consistently. So, technically, we could have skipped the conversion here. Nonetheless, we have chosen to perform it to demonstrate how one would deal with more heterogeneous data sources (e.g., mixtures of Ensembl identifiers and gene symbols, or multiple synonymous gene symbols).. We convert all row names to Ensembl identifiers, removing NA or duplicated entries (with the exception of spike-in transcripts).

gene.symb <- gsub("__chr.*$", "", rownames(gse81076.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE  TRUE 
## 20064    84
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]

keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse81076.df <- gse81076.df[keep,]
rownames(gse81076.df) <- gene.ids[keep]
summary(keep)
##    Mode   FALSE    TRUE 
## logical    2372   17776

We create a SingleCellExperiment object to store the counts and metadata together. This reduces the risk of book-keeping errors in later steps of the analysis. Note that we re-identify the spike-in rows, as the previous indices would have changed after the subsetting.

library(SingleCellExperiment)
sce.gse81076 <- SingleCellExperiment(list(counts=as.matrix(gse81076.df)),
    colData=DataFrame(Donor=donor.names, Plate=plate.id),
    rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse81076, "ERCC") <- grepl("^ERCC-", rownames(gse81076.df)) 
sce.gse81076  
## class: SingleCellExperiment 
## dim: 17776 1728 
## metadata(0):
## assays(1): counts
## rownames(17776): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
##   ENSG00000036549
## rowData names(1): Symbol
## colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95 D17TGFB_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC

2.1.2 Quality control and normalization

We compute quality control (QC) metrics for each cell (McCarthy et al. 2017) and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.

library(scater)
sce.gse81076 <- calculateQCMetrics(sce.gse81076, compact=TRUE)
QC <- sce.gse81076$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes), 
    HighSpike=sum(high.spike, na.rm=TRUE))
##   LowLib LowNgenes HighSpike
## 1     55       130       388

Cells with extreme values for these QC metrics are presumed to be of low quality and are removed. A more thorough analysis would examine the distributions of these QC metrics beforehand, but we will skip that step for brevity here.

discard <- low.lib | low.genes | high.spike
sce.gse81076 <- sce.gse81076[,!discard]
summary(discard)
##    Mode   FALSE    TRUE 
## logical    1292     436

We compute size factors for the endogenous genes using the deconvolution method (Lun, Bach, and Marioni 2016). This is done with pre-clustering by quickCluster() to avoid pooling together very different cells.

library(scran)
library(BiocSingular)
set.seed(1000) # for irlba. 
clusters <- quickCluster(sce.gse81076, BSPARAM=IrlbaParam())
table(clusters)
## clusters
##   1   2   3   4   5   6   7 
## 106 196 273 299 153 163 102
sce.gse81076 <- computeSumFactors(sce.gse81076, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse81076))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003541 0.438547 0.791872 1.000000 1.290387 8.275084

We also compute size factors for the spike-in transcripts (Lun et al. 2017). Recall that we set general.use=FALSE to ensure that the spike-in size factors are only applied to the spike-in transcripts.

sce.gse81076 <- computeSpikeFactors(sce.gse81076, general.use=FALSE)
summary(sizeFactors(sce.gse81076, "ERCC"))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01042 0.57782 0.88699 1.00000 1.27765 7.43998

We then compute normalized log-expression values for use in downstream analyses.

sce.gse81076 <- normalize(sce.gse81076)

2.1.3 Modelling variability

We identify highly variable genes (HVGs) using trendVar() and decomposeVar(), using the variances of spike-in transcripts to model technical noise. We set block= to ensure that uninteresting differences between plates or donors do not inflate the variance. The small discrepancy in the fitted trend in Figure 1 is caused by the fact that the trend is fitted robustly to the block-wise variances of the spike-ins, while the variances shown are averaged across blocks and not robust to outliers.

block <- paste0(sce.gse81076$Plate, "_", sce.gse81076$Donor)
fit <- trendVar(sce.gse81076, block=block, parametric=TRUE) 
dec <- decomposeVar(sce.gse81076, fit)

plot(dec$mean, dec$total, xlab="Mean log-expression", 
    ylab="Variance of log-expression", pch=16)
is.spike <- isSpike(sce.gse81076)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
Variance of normalized log-expression values for each gene in the GSE81076 dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

Figure 1: Variance of normalized log-expression values for each gene in the GSE81076 dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

We order genes by decreasing biological component, revealing some usual suspects such as insulin and glucagon. We will be using this information later when performing feature selection prior to running mnnCorrect().

dec.gse81076 <- dec
dec.gse81076$Symbol <- rowData(sce.gse81076)$Symbol
dec.gse81076[order(dec.gse81076$bio, decreasing=TRUE),]
## DataFrame with 17776 rows and 7 columns
##                              mean             total                 bio
##                         <numeric>         <numeric>           <numeric>
## ENSG00000254647  2.84438696097646  6.37771955862257    5.93492163812922
## ENSG00000129965  1.89026042119725    6.029667887451    5.58238903990959
## ENSG00000115263  4.01989463994372  5.74235119909272    5.50960551193726
## ENSG00000118271  3.67399583064614  5.77781565535565    5.47705987083213
## ENSG00000115386  4.24266345753652  5.39869761580333    5.16308284971761
## ...                           ...               ...                 ...
## ERCC-00095      0.553795666013432 0.452348515513772 -0.0561557264658337
## ERCC-00092      0.637709106227571 0.500985818813633 -0.0642680848584971
## ERCC-00022      0.862960231263933 0.638946209543932 -0.0691817369593759
## ERCC-00076       1.07551171426781 0.700607404061928 -0.0836790103702701
## ERCC-00060       1.28686343117818 0.703486522627051 -0.0998254603379607
##                              tech   p.value       FDR      Symbol
##                         <numeric> <numeric> <numeric> <character>
## ENSG00000254647 0.442797920493351         0         0         INS
## ENSG00000129965 0.447278847541415         0         0    INS-IGF2
## ENSG00000115263 0.232745687155467         0         0         GCG
## ENSG00000118271 0.300755784523523         0         0         TTR
## ENSG00000115386 0.235614766085716         0         0       REG1A
## ...                           ...       ...       ...         ...
## ERCC-00095      0.508504241979606        NA        NA  ERCC-00095
## ERCC-00092       0.56525390367213        NA        NA  ERCC-00092
## ERCC-00022      0.708127946503308        NA        NA  ERCC-00022
## ERCC-00076      0.784286414432198        NA        NA  ERCC-00076
## ERCC-00060      0.803311982965012        NA        NA  ERCC-00060

2.2 CEL-seq2, GSE85241

2.2.1 Loading in the data

This dataset was generated by Muraro et al. (2016) using the CEL-seq2 protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above.

muraro.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
    "GSE85nnn/GSE85241/suppl",
    "GSE85241%5Fcellsystems%5Fdataset%5F4donors%5Fupdated%2Ecsv%2Egz"))

We first read the table into memory.

gse85241.df <- read.table(muraro.fname, sep='\t', 
    header=TRUE, row.names=1, stringsAsFactors=FALSE)
dim(gse85241.df)
## [1] 19140  3072
head(rownames(gse85241.df))
## [1] "A1BG-AS1__chr19" "A1BG__chr19"     "A1CF__chr10"     "A2M-AS1__chr12" 
## [5] "A2ML1__chr12"    "A2M__chr12"
head(colnames(gse85241.df))
## [1] "D28.1_1" "D28.1_2" "D28.1_3" "D28.1_4" "D28.1_5" "D28.1_6"

We extract the metadata from the column names.

donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse85241.df))
table(donor.names)
## donor.names
## D28 D29 D30 D31 
## 768 768 768 768
plate.id <- sub("^D[0-9]+\\.([0-9]+)_.*", "\\1", colnames(gse85241.df))
table(plate.id)
## plate.id
##   1   2   3   4   5   6   7   8 
## 384 384 384 384 384 384 384 384

Yet again, gene symbols were supplied instead of Ensembl or Entrez identifiers. We convert all row names to Ensembl identifiers, removing NA or duplicated entries (with the exception of spike-in transcripts).

gene.symb <- gsub("__chr.*$", "", rownames(gse85241.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE  TRUE 
## 19059    81
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]

keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse85241.df <- gse85241.df[keep,]
rownames(gse85241.df) <- gene.ids[keep]
summary(keep)
##    Mode   FALSE    TRUE 
## logical    2223   16917

We create a SingleCellExperiment object to store the counts and metadata together.

sce.gse85241 <- SingleCellExperiment(list(counts=as.matrix(gse85241.df)),
    colData=DataFrame(Donor=donor.names, Plate=plate.id),
    rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse85241, "ERCC") <- grepl("^ERCC-", rownames(gse85241.df)) 
sce.gse85241  
## class: SingleCellExperiment 
## dim: 16917 3072 
## metadata(0):
## assays(1): counts
## rownames(16917): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
##   ENSG00000036549
## rowData names(1): Symbol
## colnames(3072): D28.1_1 D28.1_2 ... D30.8_95 D30.8_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC

2.2.2 Quality control and normalization

We compute QC metrics for each cell and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.

sce.gse85241 <- calculateQCMetrics(sce.gse85241, compact=TRUE)
QC <- sce.gse85241$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes), 
    HighSpike=sum(high.spike, na.rm=TRUE))
##   LowLib LowNgenes HighSpike
## 1    577       669       695

Low-quality cells are defined as those with extreme values for these QC metrics and are removed.

discard <- low.lib | low.genes | high.spike
sce.gse85241 <- sce.gse85241[,!discard]
summary(discard)
##    Mode   FALSE    TRUE 
## logical    2346     726

We compute size factors for the endogenous genes and spike-in transcripts, and use them to compute log-normalized expression values.

set.seed(1000)
clusters <- quickCluster(sce.gse85241, BSPARAM=IrlbaParam())
table(clusters)
## clusters
##   1   2   3   4   5   6   7   8   9 
## 257 393 224 316 226 426 129 203 172
sce.gse85241 <- computeSumFactors(sce.gse85241, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse85241))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0845  0.5328  0.8151  1.0000  1.2090 15.3318
sce.gse85241 <- computeSpikeFactors(sce.gse85241, general.use=FALSE)
summary(sizeFactors(sce.gse85241, "ERCC"))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09295 0.61309 0.88902 1.00000 1.27519 4.04643
sce.gse85241 <- normalize(sce.gse85241)

2.2.3 Modelling variability

We fit a trend to the spike-in variances as previously described, allowing us to model the technical noise for each gene (Figure 2). Again, we set block= to ensure that uninteresting differences between plates or donors do not inflate the variance.

block <- paste0(sce.gse85241$Plate, "_", sce.gse85241$Donor)
fit <- trendVar(sce.gse85241, block=block, parametric=TRUE) 
dec <- decomposeVar(sce.gse85241, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression", 
    ylab="Variance of log-expression", pch=16)
is.spike <- isSpike(sce.gse85241)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)