About This Document »

Package name simpleSingleCell
Built with Bioconductor (R) 3.6 (3.4.2)
Last Built Fri, 15 Dec 2017 15:26:00 -0800
Last Modified Fri, 15 Dec 2017 14:47:50 -0800 (r132255)
Source Package ../simpleSingleCell_1.0.5.tar.gz
Windows Binary ../NOT_SUPPORTED
R Script part1.R

To install this workflow under Bioconductor 3.6, start R and enter:

source("http://bioconductor.org/workflows.R")
workflowInstall("simpleSingleCell")

Analyzing single-cell RNA-seq data with Bioconductor (read counts)

Aaron T. L. Lun1, Davis J. McCarthy2,3 and John C. Marioni1,2,4

1Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, United Kingdom
2EMBL European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
3St Vincent's Institute of Medical Research, 41 Victoria Parade, Fitzroy, Victoria 3065, Australia
4Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom

Table of Contents
Setting up the data
Top

1 Overview

In this workflow, we use a relatively simple dataset (A. T. L. Lun et al. 2017) to introduce most of the concepts of scRNA-seq data analysis. This dataset contains two plates of 416B cells (an immortalized mouse myeloid progenitor cell line), processed using the Smart-seq2 protocol (Picelli et al. 2014). A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences. Counts for all genes/transcripts in each cell were obtained from ArrayExpress using the accession number E-MTAB-5522.

Table of Contents
Quality control on the cells
Overview
Top
Incorporating cell-based annotation

2 Setting up the data

2.1 Loading in the count matrix

Our first task is to load the count matrices into memory. One matrix was generated for each plate of cells used in the study. In each matrix, each row represents an endogenous gene or a spike-in transcript, and each column represents a cell. Subsequently, the count in each entry of the matrix represents the number of reads mapped to a particular gene/transcript in a particular cell.

plate1 <- read.delim("counts_Calero_20160113.tsv", 
    header=TRUE, row.names=1, check.names=FALSE)
plate2 <- read.delim("counts_Calero_20160325.tsv", 
    header=TRUE, row.names=1, check.names=FALSE)
gene.lengths <- plate1$Length # First column is the gene length.
plate1 <- as.matrix(plate1[,-1]) # Discarding gene length (as it is not a cell).
plate2 <- as.matrix(plate2[,-1])
rbind(Plate1=dim(plate1), Plate2=dim(plate2))
##         [,1] [,2]
## Plate1 46703   96
## Plate2 46703   96

We combine the two matrices into a single object for further processing. This is done after verifying that the genes are in the same order between the two matrices.

stopifnot(identical(rownames(plate1), rownames(plate2)))
all.counts <- cbind(plate1, plate2)

For convenience, the count matrix is stored in a SingleCellExperiment object from the SingleCellExperiment package. This allows different types of row- and column-level metadata to be stored alongside the counts for synchronized manipulation throughout the workflow.

library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=all.counts))
rowData(sce)$GeneLength <- gene.lengths
sce
## class: SingleCellExperiment 
## dim: 46703 192 
## metadata(0):
## assays(1): counts
## rownames(46703): ENSMUSG00000102693 ENSMUSG00000064842 ... SIRV7
##   CBFB-MYH11-mcherry
## rowData names(1): GeneLength
## colnames(192): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
##   SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ...
##   SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1
##   SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):

We identify the rows corresponding to ERCC spike-in transcripts from the row names. We store this information in the SingleCellExperiment object for future use. This is necessary as spike-ins require special treatment in downstream steps such as normalization.

isSpike(sce, "ERCC") <- grepl("^ERCC", rownames(sce))
summary(isSpike(sce, "ERCC"))
##    Mode   FALSE    TRUE 
## logical   46611      92

This dataset is slightly unusual in that it contains information from another set of spike-in transcripts, the Spike-In RNA Variants (SIRV) set. For simplicity, we will only use the ERCC spike-ins in this analysis. Thus, we must remove the rows corresponding to the SIRV transcripts prior to further analysis, which can be done simply by subsetting the SingleCellExperiment object.

is.sirv <- grepl("^SIRV", rownames(sce))
sce <- sce[!is.sirv,] 
summary(is.sirv)
##    Mode   FALSE    TRUE 
## logical   46696       7

Comments from Aaron:

  • Some feature-counting tools will report mapping statistics in the count matrix (e.g., the number of unaligned or unassigned reads). While these values can be useful for quality control, they would be misleading if treated as gene expression values. Thus, they should be removed (or at least moved to the colData) prior to further analyses.
  • Be aware of using the ^ERCC regular expression for human data where the row names of the count matrix are gene symbols. An ERCC gene family actually exists in human annotation, so this would result in incorrect identification of genes as spike-in transcripts. This problem can be avoided by publishing count matrices with standard identifiers (e.g., Ensembl, Entrez).

Table of Contents
Incorporating gene-based annotation
Loading in the count matrix
Setting up the data

2.2 Incorporating cell-based annotation

We load in the metadata for each library/cell from the sdrf.txt file. It is important to check that the rows of the metadata table are in the same order as the columns of the count matrix. Otherwise, incorrect metadata will be assigned to each cell.

metadata <- read.delim("E-MTAB-5522.sdrf.txt", check.names=FALSE, header=TRUE)
m <- match(colnames(sce), metadata[["Source Name"]]) # Enforcing identical order.
stopifnot(all(!is.na(m))) # Checking that nothing's missing.
metadata <- metadata[m,]
head(colnames(metadata))
## [1] "Source Name"                "Comment[ENA_SAMPLE]"       
## [3] "Comment[BioSD_SAMPLE]"      "Characteristics[organism]" 
## [5] "Characteristics[cell line]" "Characteristics[cell type]"

We only retain relevant metadata fields to avoid storing unnecessary information in the colData of the SingleCellExperiment object. In particular, we keep the plate of origin (i.e., block) and phenotype of each cell. The second field is relevant as all of the cells contain a CBFB-MYH11 oncogene, but the expression of this oncogene is only induced in a subset of the cells.

colData(sce)$Plate <- factor(metadata[["Factor Value[block]"]])
pheno <- metadata[["Factor Value[phenotype]"]]
levels(pheno) <- c("induced", "control")
colData(sce)$Oncogene <- pheno
table(colData(sce)$Oncogene, colData(sce)$Plate)
##          
##           20160113 20160325
##   induced       48       48
##   control       48       48

Table of Contents
Incorporating cell-based annotation
Setting up the data

2.3 Incorporating gene-based annotation

Feature-counting tools typically report genes in terms of standard identifiers from Ensembl or Entrez. These identifiers are used as they are unambiguous and highly stable. However, they are difficult to interpret compared to the gene symbols which are more commonly used in the literature. Given the Ensembl identifiers, we obtain the corresponding gene symbols using annotation packages like org.Mm.eg.db.

library(org.Mm.eg.db)
symb <- mapIds(org.Mm.eg.db, keys=rownames(sce), keytype="ENSEMBL", column="SYMBOL")
rowData(sce)$ENSEMBL <- rownames(sce)
rowData(sce)$SYMBOL <- symb
head(rowData(sce))
## DataFrame with 6 rows and 3 columns
##   GeneLength            ENSEMBL      SYMBOL
##    <integer>        <character> <character>
## 1       1070 ENSMUSG00000102693          NA
## 2        110 ENSMUSG00000064842          NA
## 3       6094 ENSMUSG00000051951        Xkr4
## 4        480 ENSMUSG00000102851          NA
## 5       2819 ENSMUSG00000103377          NA
## 6       2233 ENSMUSG00000104017          NA

It is often desirable to rename the row names of sce to the gene symbols, as these are easier to interpret. However, this requires some work to account for missing and duplicate symbols. The code below will replace missing symbols with the Ensembl identifier and concatenate duplicated symbols with the (unique) Ensembl identifiers.

new.names <- rowData(sce)$SYMBOL
missing.name <- is.na(new.names)
new.names[missing.name] <- rowData(sce)$ENSEMBL[missing.name]
dup.name <- new.names %in% new.names[duplicated(new.names)]
new.names[dup.name] <- paste0(new.names, "_", rowData(sce)$ENSEMBL)[dup.name]
rownames(sce) <- new.names
head(rownames(sce))
## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "Xkr4"              
## [4] "ENSMUSG00000102851" "ENSMUSG00000103377" "ENSMUSG00000104017"

We also determine the chromosomal location for each gene using the TxDb.Mmusculus.UCSC.mm10.ensGene package. This will be useful later as several quality control metrics will be computed from rows corresponding to mitochondrial genes.

library(TxDb.Mmusculus.UCSC.mm10.ensGene)
location <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=rowData(sce)$ENSEMBL, 
    column="CDSCHROM", keytype="GENEID")
rowData(sce)$CHR <- location
summary(location=="chrM")
##    Mode   FALSE    TRUE    NA's 
## logical   22428      13   24255

Alternatively, annotation from BioMart resources can be directly added to the object using the getBMFeatureAnnos function from scater. This may be more convenient than the approach shown above, but depends on an available internet connection to the BioMart databases.

Table of Contents
Classification of cell cycle phase
Setting up the data
Top
Identifying outliers for each metric

3 Quality control on the cells

3.1 Defining the quality control metrics

Low-quality cells need to be removed to ensure that technical effects do not distort downstream analysis results. We use several quality control (QC) metrics:

  • The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with small library sizes are of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation.
  • The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
  • The proportion of reads mapped to spike-in transcripts is calculated relative to the library size for each cell. High proportions are indicative of poor-quality cells, where endogenous RNA has been lost during processing (e.g., due to cell lysis or RNA degradation). The same amount of spike-in RNA to each cell, so an enrichment in spike-in counts is symptomatic of loss of endogenous RNA.
  • In the absence of spike-in transcripts, the proportion of reads mapped to genes in the mitochondrial genome can also be used. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

For each cell, we calculate these quality control metrics using the calculateQCMetrics function from the scater package (???). These are stored in the row- and column-wise metadata of the SingleCellExperiment for future reference.

library(scater)
sce <- calculateQCMetrics(sce, feature_controls=list(ERCC=isSpike(sce),
    Mt=which(rowData(sce)$CHR=="chrM")))
head(colnames(colData(sce)), 10)
##  [1] "Plate"                       "Oncogene"                   
##  [3] "total_features"              "log10_total_features"       
##  [5] "total_counts"                "log10_total_counts"         
##  [7] "pct_counts_top_50_features"  "pct_counts_top_100_features"
##  [9] "pct_counts_top_200_features" "pct_counts_top_500_features"

The distributions of these metrics are shown in Figure 1. The aim is to remove putative low-quality cells that have low library sizes, low numbers of expressed features, and high spike-in (or mitochondrial) proportions.

par(mfrow=c(2,2), mar=c(5.1, 4.1, 0.1, 0.1))
hist(sce$total_counts/1e6, xlab="Library size (millions)", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features, xlab="Number of expressed genes", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$pct_counts_ERCC, xlab="ERCC proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")
hist(sce$pct_counts_Mt, xlab="Mitochondrial proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")
Histograms of various QC metrics for all cells in the 416B data set. This includes the library sizes, number of expressed genes, and proportion of reads mapped to spike-in transcripts or mitochondrial genes.

Figure 1: Histograms of various QC metrics for all cells in the 416B data set. This includes the library sizes, number of expressed genes, and proportion of reads mapped to spike-in transcripts or mitochondrial genes.

It is also valuable to examine how the QC metrics behave with respect to each other (Figure 2). Generally, they will be in rough agreement, i.e., cells with low total counts will also have low numbers of expressed features and high ERCC/mitochondrial proportions. Clear discrepancies may correspond to technical differences between batches of cells (see below) or genuine biological differences in RNA content.

par(mfrow=c(1,3))
plot(sce$total_features, sce$total_counts/1e6, xlab="Number of expressed genes",
    ylab="Library size (millions)")
plot(sce$total_features, sce$pct_counts_ERCC, xlab="Number of expressed genes",
    ylab="ERCC proportion (%)")
plot(sce$total_features, sce$pct_counts_Mt, xlab="Number of expressed genes",
    ylab="Mitochondrial proportion (%)")
Behaviour of each QC metric compared to the total number of expressed features. Each point represents a cell in the 416B dataset.

Figure 2: Behaviour of each QC metric compared to the total number of expressed features. Each point represents a cell in the 416B dataset.

Table of Contents
Assumptions of outlier identification
Defining the quality control metrics
Quality control on the cells

3.2 Identifying outliers for each metric

Picking a threshold for these metrics is not straightforward as their absolute values depend on the experimental protocol. For example, sequencing to greater depth will lead to more reads and more expressed features, regardless of the quality of the cells. Similarly, using more spike-in RNA in the protocol will result in higher spike-in proportions. To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells, and identify cells that are outliers for the various QC metrics.

Outliers are defined based on the median absolute deviation (MADs) from the median value of each metric across all cells. We remove cells with log-library sizes that are more than 3 MADs below the median log-library size. A log-transformation improves resolution at small values, especially when the MAD of the raw values is comparable to or greater than the median. We also remove cells where the log-transformed number of expressed genes is 3 MADs below the median value.

libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features, nmads=3, type="lower", log=TRUE)

We identify outliers for the proportion-based metrics in a similar manner. Here, no transformation is required as we are identifying large outliers, for which the distinction should be fairly clear on the raw scale. (We do not use the mitochondrial proportions as we already have the spike-in proportions for this data set. This avoids potential problems with genuine differences in mitochondrial content between cell types that may confound outlier identification.)

spike.drop <- isOutlier(sce$pct_counts_ERCC, nmads=3, type="higher")

Subsetting by column will retain only the high-quality cells that pass each filter described above. We examine the number of cells removed by each filter as well as the total number of retained cells. Removal of a substantial proportion of cells (> 10%) may be indicative of an overall issue with data quality.

keep <- !(libsize.drop | feature.drop | spike.drop)
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
    BySpike=sum(spike.drop), Remaining=sum(keep))
##   ByLibSize ByFeature BySpike Remaining
## 1         4         0       1       188

At this point, we could apply keep to subset sce to only retain high-quality cells. However, the above approach ignores prior information about the cells, which can be incorporated to improve the filtering procedure. We describe why this is important and how it can be achieved in the following sections.

Table of Contents
Blocking on known conditions
Identifying outliers for each metric
Quality control on the cells

3.3 Assumptions of outlier identification

We have already mentioned the assumption that most cells are of high quality. This is usually reasonable, and can be experimentally supported in some situations by visually checking that the cells are intact (e.g., on the microwell plate or in the microfluidics system). Another assumption is that the QC metrics are independent on the biological state of each cell. This ensures that any outlier values for these metrics are driven by technical factors rather than biological processes. Thus, removing cells based on the metrics will not misrepresent the biology in downstream analyses.

The second assumption is most likely to be violated in highly heterogeneous cell populations. For example, some cell types may naturally have less RNA or express fewer genes than other cell types. Such cell types are more likely to be considered outliers and removed, even if they are of high quality. The use of the MAD mitigates this problem by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics among high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types (at the cost of reducing power to remove low-quality cells). Nonetheless, filtering based on outliers may not be appropriate in extreme cases where one cell type is very different from the others.

We can explore the effect of known biological factors on the QC metrics using the plotPhenoData function. Figure 3 demonstrates that induction of the CBFB-MYH11 oncogene results in some modest changes to the QC metric distributions. This suggests that we could improve our QC step by considering the condition of the cell during outlier identification. Analyzing all conditions together would unnecessarily inflate the MAD and compromise the removal of low-quality cells.

multiplot(
    plotPhenoData(sce, aes_string(y="total_counts", x="Oncogene")),
    plotPhenoData(sce, aes_string(y="total_features", x="Oncogene")),
    plotPhenoData(sce, aes_string(y="pct_counts_ERCC", x="Oncogene")),
    plotPhenoData(sce, aes_string(y="pct_counts_Mt", x="Oncogene")),
    cols=2)
Distribution of each QC metric for control and oncogene-induced cells in the 416B data set.

Figure 3: Distribution of each QC metric for control and oncogene-induced cells in the 416B data set.

Table of Contents
Assumptions of outlier identification
Quality control on the cells

3.4 Blocking on known conditions

Systematic differences in the QC metrics can be handled to some extent using the batch argument in the isOutlier function. Setting batch to the plate of origin will identify outliers within each level of batch, using plate-specific median and MAD estimates. This is obviously useful for batch effects caused by known differences in experimental processing, e.g., sequencing at different depth or had different amounts of spike-in added. We can also include known biological factors in batch, if those factors could result in systematically fewer expressed genes or lower RNA content.

blocking <- paste0(sce$Plate, sce$Oncogene)
libsize.drop2 <- isOutlier(sce$total_counts, nmads=3, type="lower", 
    log=TRUE, batch=blocking)
feature.drop2 <- isOutlier(sce$total_features, nmads=3, type="lower", 
    log=TRUE, batch=blocking)
spike.drop2 <- isOutlier(sce$pct_counts_ERCC, nmads=3, type="higher", 
    batch=blocking)

keep2 <- !(libsize.drop2 | feature.drop2 | spike.drop2)
data.frame(ByLibSize=sum(libsize.drop2), ByFeature=sum(feature.drop2),
    BySpike=sum(spike.drop2), Remaining=sum(keep2))
##   ByLibSize ByFeature BySpike Remaining
## 1         5         4       6       183

The use of this blocking approach results in a small increase in the number of discarded cells. This is expected given that the variability within each level of batch is lower, resulting in more power to detect outliers. We then subset the SingleCellExperiment object to retain only the putative high-quality cells.

sce <- sce[,keep2]
dim(sce)
## [1] 46696   183

Table of Contents
Examining gene-level expression metrics
Quality control on the cells
Top

4 Classification of cell cycle phase

We use the prediction method described by Scialdone et al. (2015) to classify cells into cell cycle phases based on the gene expression data. Using a training dataset, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases were chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another.

This approach is implemented in the cyclone function from the scran package. The package contains a pre-trained set of marker pairs for mouse data, which we can load in the the readRDS function. We use the Ensembl identifiers for each gene in our data set to match up with the names in the pre-trained set of gene pairs.

set.seed(100)
library(scran)
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", 
    package="scran"))
assignments <- cyclone(sce, mm.pairs, gene.names=rowData(sce)$ENSEMBL)

The cyclone result for each cell in the HSC dataset is shown in Figure 4. Each cell is assigned a score for each phase, with a higher score corresponding to a higher probability that the cell is in that phase. We focus on the G1 and G2/M scores as these are the most informative for classification.

plot(assignments$score$G1, assignments$score$G2M, 
    xlab="G1 score", ylab="G2/M score", pch=16)
Cell cycle phase scores from applying the pair-based classifier on the 416B dataset. Each point represents a cell, plotted according to its scores for G1 and G2/M phases.

Figure 4: Cell cycle phase scores from applying the pair-based classifier on the 416B dataset. Each point represents a cell, plotted according to its scores for G1 and G2/M phases.

Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5. Here, the vast majority of cells are classified as being in G1 phase. We save these assignments into the SingleCellExperiment object for later use.

sce$phases <- assignments$phases
table(sce$phases)
## 
##  G1 G2M   S 
##  99  62  22

Pre-trained classifiers are available in scran for human and mouse data. While the mouse classifier used here was trained on data from embryonic stem cells, it is still accurate for other cell types (Scialdone et al. 2015). This may be due to the conservation of the transcriptional program associated with the cell cycle (Bertoli, Skotheim, and Bruin 2013; Conboy et al. 2007). The pair-based method is also a non-parametric procedure that is robust to most technical differences between datasets.

Comments from Aaron:

  • To remove confounding effects due to cell cycle phase, we can filter the cells to only retain those in a particular phase (usually G1) for downstream analysis. Alternatively, if a non-negligible number of cells are in other phases, we can use the assigned phase as a blocking factor. This protects against cell cycle effects without discarding information, and will be discussed later in more detail.
  • The classifier may not be accurate for data that are substantially different from those used in the training set, e.g., due to the use of a different protocol. In such cases, users can construct a custom classifier from their own training data using the sandbag function. This will also be necessary for other model organisms where pre-trained classifiers are not available.
  • Do not filter out low-abundance genes before applying cyclone. Even if a gene is not expressed in any cell, it may still be useful for classification if it is phase-specific. Its lack of expression relative to other genes will still yield informative pairs, and filtering them out would reduce power.

Table of Contents
Normalization of cell-specific biases
Classification of cell cycle phase
Top
Filtering out low-abundance genes

5 Examining gene-level expression metrics

5.1 Inspecting the most highly expressed genes

We examine the identities of the most highly expressed genes (Figure 5). This should generally be dominated by constitutively expressed transcripts, such as those for ribosomal or mitochondrial proteins. The presence of other classes of features may be cause for concern if they are not consistent with expected biology. For example, a top set containing many spike-in transcripts suggests that too much spike-in RNA was added during library preparation, while the absence of ribosomal proteins and/or the presence of their pseudogenes are indicative of suboptimal alignment.

fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotQC(sce, type = "highest-expression", n=50) + fontsize
Percentage of total counts assigned to the top 50 most highly-abundant features in the 416B dataset. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.

Figure 5: Percentage of total counts assigned to the top 50 most highly-abundant features in the 416B dataset. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.

Table of Contents
Inspecting the most highly expressed genes
Examining gene-level expression metrics

5.2 Filtering out low-abundance genes

Low-abundance genes are problematic as zero or near-zero counts do not contain much information for reliable statistical inference (Bourgon, Gentleman, and Huber 2010). These genes typically do not provide enough evidence to reject the null hypothesis during testing, yet they still increase the severity of the multiple testing correction. In addition, the discreteness of the counts may interfere with statistical procedures, e.g., by compromising the accuracy of continuous approximations. Thus, low-abundance genes are often removed in many RNA-seq analysis pipelines before the application of downstream methods.

The “optimal” choice of filtering strategy depends on the downstream application. A more aggressive filter is usually required to remove discreteness (e.g., for normalization) compared to that required for removing underpowered tests. For hypothesis testing, the filter statistic should also be independent of the test statistic under the null hypothesis. Thus, we (or the relevant function) will filter at each step as needed, rather than applying a single filter for the entire analysis.

Several metrics can be used to define low-abundance genes. The most obvious is the average count for each gene, computed across all cells in the data set. We calculate this using the calcAverage function, which also performs some adjustment for library size differences between cells We typically observe a peak of moderately expressed genes following a plateau of lowly expressed genes (Figure 6).

ave.counts <- calcAverage(sce)
hist(log10(ave.counts), breaks=100, main="", col="grey80", 
    xlab=expression(Log[10]~"average count"))
Histogram of log-average counts for all genes in the 416B dataset.

Figure 6: Histogram of log-average counts for all genes in the 416B dataset.

A minimum threshold can be applied to this value to filter out genes that are lowly expressed. The example below demonstrates how we could remove genes with average counts less than 1. The number of TRUE values in demo.keep corresponds to the number of retained rows/genes after filtering.

demo.keep <- ave.counts >= 1
filtered.sce <- sce[demo.keep,]
summary(demo.keep)
##    Mode   FALSE    TRUE 
## logical   33490   13206

We also examine the number of cells that express each gene. This is closely related to the average count for most genes, as expression in many cells will result in a higher average (Figure 7). Genes expressed in very few cells are often uninteresting as they are driven by amplification artifacts (though they may also also arise from rare populations). We could then remove genes that are expressed in fewer than n cells.

num.cells <- nexprs(sce, byrow=TRUE)
smoothScatter(log10(ave.counts), num.cells, ylab="Number of cells", 
    xlab=expression(Log[10]~"average count"))
The number of cells expressing each gene in the 416B data set, plotted against the log-average count. Intensity of colour corresponds to the number of genes at any given location.

Figure 7: The number of cells expressing each gene in the 416B data set, plotted against the log-average count. Intensity of colour corresponds to the number of genes at any given location.

As mentioned above, we will apply these filters at each step rather than applying them globally by subsetting sce. This ensures that the most appropriate filter is used in each application. Nonetheless, we remove genes that are not expressed in any cell to reduce computational work in downstream steps. Such genes provide no information and would be removed by any filtering strategy.

to.keep <- num.cells > 0
sce <- sce[to.keep,]
summary(to.keep)
##    Mode   FALSE    TRUE 
## logical   22833   23863

Table of Contents
Modelling the technical noise in gene expression
Examining gene-level expression metrics
Top
Computing separate size factors for spike-in transcripts

6 Normalization of cell-specific biases

6.1 Using the deconvolution method to deal with zero counts

Read counts are subject to differences in capture efficiency and sequencing depth between cells (Stegle, Teichmann, and Marioni 2015). Normalization is required to eliminate these cell-specific biases prior to downstream quantitative analyses. This is often done by assuming that most genes are not differentially expressed (DE) between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias and is removed by scaling. More specifically, “size factors” are calculated that represent the extent to which counts should be scaled in each library.

Size factors can be computed with several different approaches, e.g., using the estimateSizeFactorsFromMatrix function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014), or with the calcNormFactors function (Robinson and Oshlack 2010) in the edgeR package. However, single-cell data can be problematic for these bulk data-based methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the count size for accurate size factor estimation (A. T. Lun, Bach, and Marioni 2016). Pool-based size factors are then “deconvolved” into cell-based factors for cell-specific normalization.

sce <- computeSumFactors(sce)
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3432  0.7287  0.9224  1.0000  1.1485  3.5348

The size factors are well-correlated with the library sizes for all cells (Figure 8). This suggests that most of the systematic differences between cells are driven by differences in capture efficiency or sequencing depth. Any DE between cells would yield a non-linear trend between the total count and size factor, and/or increased scatter around the trend. We observe some evidence of this after oncogene induction, where the size factors after induction are systematically lower. This is consistent with composition biases (Robinson and Oshlack 2010) introduced by upregulation of genes after induction.

plot(sce$total_counts/1e6, sizeFactors(sce), log="xy",
    xlab="Library size (millions)", ylab="Size factor",
    col=c("red", "black")[sce$Oncogene], pch=16)
legend("bottomright", col=c("red", "black"), pch=16, cex=1.2,
    legend=levels(sce$Oncogene))
Size factors from deconvolution, plotted against library sizes for all cells in the 416B dataset. Axes are shown on a log-scale. Wild-type cells are shown in black and oncogene-induced cells are shown in red.

Figure 8: Size factors from deconvolution, plotted against library sizes for all cells in the 416B dataset. Axes are shown on a log-scale. Wild-type cells are shown in black and oncogene-induced cells are shown in red.

Comments from Aaron:

  • While the deconvolution approach is robust to the high frequency of zeroes in scRNA-seq data, it will eventually fail if too many counts are zero. This manifests as negative size factors, which are obviously nonsensical. To avoid this, the computeSumFactors function will automatically remove low-abundance genes prior to the calculation of size factors. Genes with an average count below a specified threshold (min.mean) are ignored. For read count data, the default value of 1 is usually satisfactory. For UMI data, counts are lower so a threshold of 0.1 is recommended.
  • Cell-based QC should always be performed prior to normalization, to remove cells with very low numbers of expressed genes. If this is not done, the computeSumFactors function may yield negative size factors for low-quality cells.
  • The sizes argument can be used to specify the number of pool sizes to use to compute the size factors. More sizes yields more precise estimates at the cost of some computational time and memory. In general, sizes should not be below 20 cells, to ensure that there are sufficient non-zero expression values in each pool. We also recommend that the total number of cells should be at least 100 for effective pooling.
  • For highly heterogeneous data sets, it is advisable to perform a rough clustering of the cells. This can be done with the quickCluster function and the results passed to computeSumFactors via the cluster argument. Cells in each cluster are normalized separately, and the size factors are rescaled to be comparable across clusters. This avoids the need to assume that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters. We demonstrate this approach later with a larger dataset, as there are not enough cells in the 416B dataset.

Table of Contents
Applying the size factors to normalize gene expression
Using the deconvolution method to deal with zero counts
Normalization of cell-specific biases

6.2 Computing separate size factors for spike-in transcripts

Size factors computed from the counts for endogenous genes are usually not appropriate for normalizing the counts for spike-in transcripts. Consider an experiment without library quantification, i.e., the amount of cDNA from each library is not equalized prior to pooling and multiplexed sequencing. Here, cells containing more RNA have greater counts for endogenous genes and thus larger size factors to scale down those counts. However, the same amount of spike-in RNA is added to each cell during library preparation. This means that the counts for spike-in transcripts are not subject to the effects of RNA content. Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification of expression. Similar reasoning applies in cases where library quantification is performed. For a constant total amount of cDNA, any increases in endogenous RNA content will suppress the coverage of spike-in transcripts. As a result, the bias in the spike-in counts will be opposite to that captured by the gene-based size factor.

To ensure normalization is performed correctly, we compute a separate set of size factors for the spike-in set. For each cell, the spike-in-specific size factor is defined as the total count across all transcripts in the spike-in set. This assumes that none of the spike-in transcripts are differentially expressed, which is reasonable given that the same amount and composition of spike-in RNA should have been added to each cell (A. T. L. Lun et al. 2017). (See below for a more detailed discussion on spike-in normalization.) These size factors are stored in a separate field of the SingleCellExperiment object by setting general.use=FALSE in computeSpikeFactors. This ensures that they will only be used with the spike-in transcripts but not the endogenous genes.

sce <- computeSpikeFactors(sce, type="ERCC", general.use=FALSE)

Table of Contents
Computing separate size factors for spike-in transcripts
Normalization of cell-specific biases

6.3 Applying the size factors to normalize gene expression

The count data are used to compute normalized log-expression values for use in downstream analyses. Each value is defined as the log2-ratio of each count to the size factor for the corresponding cell, after adding a prior count of 1 to avoid undefined values at zero counts. Division by the size factor ensures that any cell-specific biases are removed. If spike-in-specific size factors are present in sce, they will be automatically applied to normalize the spike-in transcripts separately from the endogenous genes.

sce <- normalize(sce)

The log-transformation is useful as it means that any differences in the values directly represent log2-fold changes in expression between cells. This is usually more relevant than the absolute differences in coverage, which need to be interpreted in the context of the overall abundance. The log-transformation also provides some measure of variance stabilization (Law et al. 2014), so that high-abundance genes with large variances do not dominate downstream analyses. The computed values are stored as an "logcounts" matrix in addition to the other assay elements.

Table of Contents
Denoising expression values using PCA
Normalization of cell-specific biases
Top
Choosing the parameters of the trend fit

7 Modelling the technical noise in gene expression

7.1 Fitting a trend to the spike-in variances

Variability in the observed expression values across genes can be driven by genuine biological heterogeneity or uninteresting technical noise. To distinguish between these two possibiltiies, we need to model the technical component of the variance of the expression values for each gene. We do so using the set of spike-in transcripts, which were added in the same quantity to each cell. Thus, the spike-in transcripts should exhibit no biological variability, i.e., any variance in their counts should be technical in origin.

We use the trendVar function to fit a mean-dependent trend to the variances of the log-expression values for the spike-in transcripts. We set design to block on the plate of origin for each cell, to ensure that technical differences between plates do not inflate the variances. Given the mean abundance of a gene, the fitted value of the trend is then used as an estimate of the technical component for that gene. The biological component of the variance is finally calculated by subtracting the technical component from the total variance of each gene with the decomposeVar function.

design <- model.matrix(~sce$Plate)
var.fit <- trendVar(sce, parametric=TRUE, design=design, span=0.3)
var.out <- decomposeVar(sce, var.fit)
head(var.out)
##                           mean      total         bio       tech   p.value
## ENSMUSG00000103377 0.008029604 0.01179812 -0.02378567 0.03558380 1.0000000
## ENSMUSG00000103147 0.034571996 0.07202590 -0.08118252 0.15320842 1.0000000
## ENSMUSG00000103161 0.005210840 0.00496927 -0.01812296 0.02309223 1.0000000
## ENSMUSG00000102331 0.018575647 0.03262176 -0.04969763 0.08231939 1.0000000
## ENSMUSG00000102948 0.059116545 0.08826678 -0.17371266 0.26197944 1.0000000
## Rp1                0.097464502 0.45637197  0.02445072 0.43192124 0.2865203
##                          FDR
## ENSMUSG00000103377 1.0000000
## ENSMUSG00000103147 1.0000000
## ENSMUSG00000103161 1.0000000
## ENSMUSG00000102331 1.0000000
## ENSMUSG00000102948 1.0000000
## Rp1                0.8056285

We visually inspect the trend to confirm that it corresponds to the spike-in variances (Figure 9)). The wave-like shape is typical of the mean-variance trend for log-expression values. A linear increase in the variance is observed as the mean increases from zero, as larger variances are possible when the counts increase. At very high abundances, the effect of sampling noise decreases due to the law of large numbers, resulting in a decrease in the variance.

plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
curve(var.fit$trend(x), col="dodgerblue", lwd=2, add=TRUE)
cur.spike <- isSpike(sce)
points(var.out$mean[cur.spike], var.out$total[cur.spike], col="red", pch=16)
Variance of normalized log-expression values for each gene in the 416B dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

Figure 9: Variance of normalized log-expression values for each gene in the 416B dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

We check the distribution of expression values for the genes with the largest biological components. This ensures that the variance estimate is not driven by one or two outlier cells (Figure 10).

chosen.genes <- order(var.out$bio, decreasing=TRUE)[1:10]
plotExpression(sce, features=rownames(var.out)[chosen.genes]) + fontsize