1 Introduction

In the previous batch correction workflow, we merged two scRNA-seq data sets involving human pancreas cell populations. This workflow extends the previous example to describe how to perform a more complicated merge operation involving different levels of batch effects. Here, we will use pancreas data sets generated using Smart-based technologies (Segerstolpe et al. 2016; Lawlor et al. 2017), and merge them with the previous data sets generated using CEL-seq-based methods (Grun et al. 2016; Muraro et al. 2016). Our overall strategy is to use a hierarchical merge to remove batch effects within each technology, followed by the removal of batch effects between technologies.

2 Loading in the data

2.1 SMARTer, GSE86469

2.1.1 Reading in data

Here, we use data from the Lawlor et al. (2017) study of pancreatic islet cells from healthy and type II diabetic donors. This was generated using the SMARTer protocol on the Fluidigm C1 system. We download and cache the count matrix using the BiocFileCache package.

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)    
count.tab <- bfcrpath(bfc, file.path(
    "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE86nnn/GSE86469/suppl",
    "GSE86469_GEO.islet.single.cell.processed.data.RSEM.raw.expected.counts.csv.gz"
))

We read in the count matrix in sparse format using scater (McCarthy et al. 2017).

library(scater)
mat <- readSparseCounts(count.tab, sep=",", quote='"', row.names=1)
dim(mat)
## [1] 26616   638

We load in the metadata from NCBI GEO using the GEOquery package (Davis and Meltzer 2007).

library(GEOquery)
metadata <- pData(getGEO("GSE86469")[[1]])
metadata <- metadata[,c("title", "cell type:ch1", "islet unos id:ch1")]

rownames(metadata) <- metadata$title
metadata <- metadata[,-1]
colnames(metadata) <- c("CellType", "Donor")

stopifnot(identical(colnames(mat), rownames(metadata)))
head(metadata)
##                 CellType   Donor
## 10th_C10_S104 None/Other ACIW009
## 10th_C11_S96        Beta ACIW009
## 10th_C13_S61        Beta ACIW009
## 10th_C14_S53        Beta ACIW009
## 10th_C16_S105 None/Other ACIW009
## 10th_C17_S97        Beta ACIW009

Finally, we create a SingleCellExperiment object.

library(SingleCellExperiment)
sce.gse86469 <- SingleCellExperiment(list(counts=mat), colData=metadata)
isSpike(sce.gse86469, "ERCC") <- grep("^ERCC-", rownames(sce.gse86469))
sce.gse86469
## class: SingleCellExperiment 
## dim: 26616 638 
## metadata(0):
## assays(1): counts
## rownames(26616): ENSG00000229483 ENSG00000232849 ... ENSG00000251576
##   ENSG00000082898
## rowData names(0):
## colnames(638): 10th_C10_S104 10th_C11_S96 ... 9th-C96_S81 9th-C9_S13
## colData names(2): CellType Donor
## reducedDimNames(0):
## spikeNames(1): ERCC

2.1.2 Quality control and normalization

We remove low-quality cells based on outliers for various quality control metrics, such as the total library size and the number of expressed genes. This is similar to what was described previously. Note that this data does not contain any counts for spike-in transcripts, so the spike-in percentage is not used here.

sce.gse86469 <- calculateQCMetrics(sce.gse86469, compact=TRUE)
qc.mat <- cbind(
    NFeatures=isOutlier(sce.gse86469$scater_qc$all$total_features_by_counts, 
        log=TRUE, type="lower", nmads=3),
    LibSize=isOutlier(sce.gse86469$scater_qc$all$total_counts, 
        log=TRUE, type="lower", nmads=3)
)
colSums(qc.mat)
## NFeatures   LibSize 
##         0         3
discard <- rowMeans(qc.mat) > 0
sce.gse86469 <- sce.gse86469[,!discard]
summary(discard)
##    Mode   FALSE    TRUE 
## logical     635       3

We compute size factors with the deconvolution method from the scran package (Lun, Bach, and Marioni 2016). Pre-clustering is performed using quickCluster() to avoid pooling together very different cells. Note the use of IrlbaParam() from BiocSingular to speed up the PCA calculations.

library(scran)
library(BiocSingular)
clusters <- quickCluster(sce.gse86469, BSPARAM=IrlbaParam())
table(clusters)
## clusters
##   1   2   3 
## 242 270 123
sce.gse86469 <- computeSumFactors(sce.gse86469, clusters=clusters)
summary(sizeFactors(sce.gse86469))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2912  0.7694  0.9596  1.0000  1.2007  2.9185

There is no need for spike-in normalization, as there are actually no spike-in counts. We thus proceed directly to calculation of the log-transformed normalized expression values for endogenous genes.

# Ignore warnings due to no spike-in size factors.
suppressWarnings(sce.gse86469 <- normalize(sce.gse86469))

We save this object for use elsewhere.

saveRDS(sce.gse86469, file="gse86469_sce.rds")

2.1.3 Modelling variability

Given that no spike-ins are available, we fit a mean-dependent trend to the variances of the endogenous genes to model technical noise. This requires the assumptions that have been stated elsewhere.

fit <- trendVar(sce.gse86469, use.spikes=FALSE, 
    block=sce.gse86469$Donor, loess.args=list(span=0.05))
dec.gse86469 <- decomposeVar(sce.gse86469, fit)
dec.gse86469[order(dec.gse86469$bio, decreasing=TRUE),]
## DataFrame with 26616 rows and 6 columns
##                             mean             total               bio
##                        <numeric>         <numeric>         <numeric>
## ENSG00000115263 10.1536955344348  43.9282794110398  40.7827047057491
## ENSG00000254647 8.11583744343173  38.6406987550627  33.8414096768278
## ENSG00000169903  5.7360754103119  29.0826673417615  22.6937447312962
## ENSG00000118271 12.0056875945825  20.6222933766225  18.6428162898466
## ENSG00000197249 4.12395232568971  26.0384575981412   18.574407947614
## ...                          ...               ...               ...
## ENSG00000214105 3.57156880461368  1.47686769949045 -6.26100443295033
## ENSG00000144218 3.74525076548908  1.38625572839096   -6.295430365288
## ENSG00000261177 4.20932168304822  1.24423250040548 -6.35184676330201
## ENSG00000167995  4.4658056980619 0.975647920550666 -6.49349680189003
## ENSG00000141150 4.04259310145711  1.12290389575419 -6.54082049078115
##                             tech               p.value                   FDR
##                        <numeric>             <numeric>             <numeric>
## ENSG00000115263 3.14557470529066                     0                     0
## ENSG00000254647 4.79928907823488                     0                     0
## ENSG00000169903 6.38892261046531 2.41407654662433e-308 1.28506122729906e-304
## ENSG00000118271 1.97947708677594                     0                     0
## ENSG00000197249  7.4640496505272 5.76405676392468e-174  1.1801241140663e-170
## ...                          ...                   ...                   ...
## ENSG00000214105 7.73787213244079                     1                     1
## ENSG00000144218 7.68168609367896                     1                     1
## ENSG00000261177  7.5960792637075                     1                     1
## ENSG00000167995 7.46914472244069                     1                     1
## ENSG00000141150 7.66372438653534                     1                     1

Figure 1 shows the strong mean-variance relationship that is typical of read count data.

plot(fit$mean, fit$var, xlab="Mean log-expression",
    ylab="Variance of log-expression")
curve(fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)