1 Overview

In this workflow, we examine a heterogeneous dataset from a study of cell types in the mouse brain (Zeisel et al. 2015). This contains approximately 3000 cells of varying types such as oligodendrocytes, microglia and neurons. Individual cells were isolated using the Fluidigm C1 microfluidics system (Pollen et al. 2014) and library preparation was performed on each cell using a UMI-based protocol. After sequencing, expression was quantified by counting the number of unique molecular identifiers (UMIs) mapped to each gene. Our analysis aims to characterize the heterogeneity in transcriptional profiles across different cell types in this dataset.

2 Setting up the data

Public data often requires considerable pre-processing before it is suitable for further analysis. Fortunately, this work has already been done in the scRNAseq package, which conveniently provides the Zeisel et al. (2015) count data in the form of a SingleCellExperiment object.

library(scRNAseq)
sce <- ZeiselBrainData()
table(rowData(sce)$featureType)
## 
##       ERCC endogenous       mito     repeat 
##         57      19972         34       1072

We remove the repeat regions as these are not of interest. We will use the spike-in transcripts and the mitochondrial proportions later for quality control.

sce <- sce[rowData(sce)$featureType!="repeat",]

In this particular dataset, some genes are represented by multiple rows corresponding to alternative genomic locations. We sum the counts for all rows corresponding to a single gene for ease of interpretation, and create a new SingleCellExperiment with these aggregated counts.

library(scater)
raw.names <- sub("_loc[0-9]+$", "", rownames(sce))
table(table(raw.names))
## 
##     1     2     3     4     6     9    10 
## 19785    83    17     6     3     1     1
new.counts <- rowsum(counts(sce), raw.names, reorder=FALSE)
dim(new.counts)
## [1] 19896  3005
# Replacing the count matrix in 'sce'.
sce <- sce[match(rownames(new.counts), raw.names),]
rownames(sce) <- rownames(new.counts)
counts(sce) <- new.counts

We also determine the Ensembl identifier for each row.

library(org.Mm.eg.db)
ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce), 
    keytype="SYMBOL", column="ENSEMBL")
rowData(sce)$ENSEMBL <- ensembl
sce
## class: SingleCellExperiment 
## dim: 19896 3005 
## metadata(0):
## assays(1): counts
## rownames(19896): Tspan12 Tshz1 ... ERCC-00170 ERCC-00171
## rowData names(2): featureType ENSEMBL
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## spikeNames(1): ERCC

3 Quality control on the cells

The original authors of the study have already removed low-quality cells prior to data publication. Nonetheless, we compute some quality control metrics with scater (McCarthy et al. 2017) to check whether the remaining cells are satisfactory.

library(scater)
sce <- calculateQCMetrics(sce, feature_controls=list(
    Mt=rowData(sce)$featureType=="mito"))

We examine the distribution of the QC metrics across all cells (Figure 1). The library sizes here are at least one order of magnitude lower than observed in the 416B dataset. This is consistent with the use of UMI counts rather than read counts, as each transcript molecule can only produce one UMI count but can yield many reads after fragmentation. In addition, the spike-in proportions are more variable than observed in the 416B dataset. This may reflect a greater variability in the total amount of endogenous RNA per cell when many cell types are present.

par(mfrow=c(2,2), mar=c(5.1, 4.1, 0.1, 0.1))
hist(sce$total_counts/1e3, xlab="Library sizes (thousands)", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features_by_counts, xlab="Number of expressed genes", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$pct_counts_ERCC, xlab="ERCC proportion (%)",
    ylab="Number of cells", breaks=20, main="", col="grey80")
hist(sce$pct_counts_Mt, xlab="Mitochondrial proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")
Histograms of QC metrics including the library sizes, number of expressed genes and proportion of UMIs assigned to spike-in transcripts or mitochondrial genes for all cells in the brain dataset.

Figure 1: Histograms of QC metrics including the library sizes, number of expressed genes and proportion of UMIs assigned to spike-in transcripts or mitochondrial genes for all cells in the brain dataset.

We remove small outliers for the library size and the number of expressed features, and large outliers for the spike-in proportions. Again, the presence of spike-in transcripts means that we do not have to use the mitochondrial proportions.

libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features_by_counts, nmads=3, type="lower", log=TRUE)
spike.drop <- isOutlier(sce$pct_counts_ERCC, nmads=3, type="higher")

Removal of low-quality cells is then performed by combining the filters for all of the metrics. The majority of cells are retained, which suggests that the original quality control procedures were generally adequate.

sce <- sce[,!(libsize.drop | feature.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), 
    BySpike=sum(spike.drop), Remaining=ncol(sce))
##   ByLibSize ByFeature BySpike Remaining
## 1         8         3       8      2989

We could improve our cell filtering procedure further by setting batch in isOutlier to one or more known factors, e.g., mouse/plate of origin. As previously mentioned, this would avoid inflation of the MAD and improve power to remove low-quality cells. However, for simplicity, we will not do this as sufficient quality control has already been performed.

4 Cell cycle classification

Application of cyclone (Scialdone et al. 2015) to the brain dataset suggests that most of the cells are in G1 phase (Figure 2). This requires the use of the Ensembl identifiers to match up with the pre-defined classifier.

library(scran)
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran"))
assignments <- cyclone(sce, mm.pairs, gene.names=rowData(sce)$ENSEMBL)
table(assignments$phase)
## 
##   G1  G2M    S 
## 2981    7    1
plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16)
Cell cycle phase scores from applying the pair-based classifier on the brain dataset, where each point represents a cell.

Figure 2: Cell cycle phase scores from applying the pair-based classifier on the brain dataset, where each point represents a cell.

However, the intepretation of this result requires some caution due to differences between the training and test datasets. The classifier was trained on C1 SMARTer data and accounts for the biases in that protocol. The brain dataset uses UMI counts, which has a different set of biases, e.g., 3’-end coverage only, no length bias, no amplification noise. Furthermore, many neuronal cell types are expected to lie in the G0 resting phase, which is distinct from the other phases of the cell cycle (Coller, Sang, and Roberts 2006). cyclone will generally assign such cells to the closest known phase in the training set, which would be G1.

5 Examining gene-level metrics

Figure 3 shows the most highly expressed genes across the cell population in the brain dataset. This is mostly occupied by spike-in transcripts, reflecting the use of spike-in concentrations that span the entire range of expression. There are also a number of constitutively expressed genes, as expected.

fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotHighestExprs(sce, n=50) + fontsize