1 Overview

Here, we describe some more detailed aspects of modelling variability in gene expression across the cell population. This includes the detection of significant highly variable genes (HVGs), as well as advanced modelling of the mean-variance trend in the presence of confounding factors.

2 Detecting highly variable genes

2.1 Overview

HVGs are defined as genes with biological components that are significantly greater than zero. These genes are interesting as they drive differences in the expression profiles between cells, and should be prioritized for further investigation. Formal detection of HVGs allows us to avoid genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation. This adds another level of statistical rigour to our previous analyses, in which we only modelled the technical component.

2.2 Setting up the data

2.2.1 Loading the dataset

To demonstrate, we use data from haematopoietic stem cells (HSCs) (Wilson et al. 2015), generated using the Smart-seq2 protocol (Picelli et al. 2014) with ERCC spike-ins. Counts were obtained from NCBI GEO as a supplementary file using the accession number GSE61533.

bfc <- BiocFileCache("raw_data", ask=FALSE)
wilson.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",

wilson.name2 <- "GSE61533_HTSEQ_count_results.xls"
gunzip(wilson.fname, destname=wilson.name2, remove=FALSE, overwrite=TRUE)

Our first task is to load the count matrix into memory. In this case, some work is required to retrieve the data from the Gzip-compressed Excel format.

all.counts <- read_excel(wilson.name2)
gene.names <- all.counts$ID
all.counts <- as.matrix(all.counts[,-1])
rownames(all.counts) <- gene.names

We store the results in a SingleCellExperiment object and identify the rows corresponding to the spike-ins based on the row names.

sce.hsc <- SingleCellExperiment(list(counts=all.counts))
## [1] 38498    96
is.spike <- grepl("^ERCC", rownames(sce.hsc))
isSpike(sce.hsc, "ERCC") <- is.spike
##    Mode   FALSE    TRUE 
## logical   38406      92

2.2.2 Quality control and normalization

For each cell, we calculate quality control metrics using the calculateQCMetrics function from scater (McCarthy et al. 2017) as previously described. We filter out HSCs that are outliers for any metric, under the assumption that these represent low-quality libraries.

sce.hsc <- calculateQCMetrics(sce.hsc)
libsize.drop <- isOutlier(sce.hsc$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce.hsc$total_features_by_counts, nmads=3, type="lower", log=TRUE)
spike.drop <- isOutlier(sce.hsc$pct_counts_ERCC, nmads=3, type="higher")
sce.hsc <- sce.hsc[,!(libsize.drop | feature.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
    BySpike=sum(spike.drop), Remaining=ncol(sce.hsc))
##   ByLibSize ByFeature BySpike Remaining
## 1         2         2       3        92

We remove genes that are not expressed in any cell to reduce computational work in downstream steps.

to.keep <- nexprs(sce.hsc, byrow=TRUE) > 0
sce.hsc <- sce.hsc[to.keep,]
##    Mode   FALSE    TRUE 
## logical   17229   21269

We apply the deconvolution method (Lun, Bach, and Marioni 2016) to compute size factors for the endogenous genes (Lun, Bach, and Marioni 2016). Separate size factors for the spike-in transcripts are also calculated, as previously discussed. We then calculate log-transformed normalized expression values for further use.

sce.hsc <- computeSumFactors(sce.hsc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4075  0.8058  0.9604  1.0000  1.1503  2.0414
sce.hsc <- computeSpikeFactors(sce.hsc, type="ERCC", general.use=FALSE)
summary(sizeFactors(sce.hsc, "ERCC"))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2562  0.6198  0.8623  1.0000  1.2122  3.0289
sce.hsc <- normalize(sce.hsc)

2.3 Testing for significantly positive biological components

We fit a mean-variance trend to the spike-in transcripts to quantify the technical component of the variance, as previously described. The biological component for each gene is defined as the difference between its total variance and the fitted value of the trend (Figure 1).

var.fit <- trendVar(sce.hsc, parametric=TRUE, loess.args=list(span=0.3))
var.out <- decomposeVar(sce.hsc, var.fit)
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
curve(var.fit$trend(x), col="dodgerblue", lwd=2, add=TRUE)
cur.spike <- isSpike(sce.hsc)
points(var.out$mean[cur.spike], var.out$total[cur.spike], col="red", pch=16)
Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

Figure 1: Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

We define HVGs as those genes that have a biological component that is significantly greater than zero. We use a false discovery rate (FDR) of 5% after correcting for multiple testing with the Benjamini-Hochberg method.

hvg.out <- var.out[which(var.out$FDR <= 0.05),]
## [1] 511

We rank the results to focus on genes with larger biological components. This highlights an interesting aspect of the underlying hypothesis test, which is based on the ratio of the total variance to the expected technical variance. Ranking based on \(p\)-value tends to prioritize HVGs that are more likely to be true positives but, at the same time, less likely to be interesting. This is because the ratio can be very large for HVGs that have very low total variance and do not contribute much to the cell-cell heterogeneity.

hvg.out <- hvg.out[order(hvg.out$bio, decreasing=TRUE),] 
write.table(file="hsc_hvg.tsv", hvg.out, sep="\t", quote=FALSE, col.names=NA)
## DataFrame with 6 rows and 6 columns
##                      mean            total              bio             tech
##                 <numeric>        <numeric>        <numeric>        <numeric>
## Fos      6.46229730082989 19.5774140760543 12.8221002879735 6.75531378808077
## Dusp1    6.82314467206793 15.6360357795109 10.1162290203638 5.51980675914714
## Rgs1     5.31345464503953 20.3107030102909 10.0119450815048  10.298757928786
## Ppp1r15a 6.66579727019324 14.5266651189811 8.47596930729373 6.05069581168739
## Ly6a     8.40354443058819   10.05833414214 8.05800100918705 2.00033313295293
## Egr1     6.71592505528811  13.857027821927 7.97752724423428 5.87950057769273
##                       p.value                  FDR
##                     <numeric>            <numeric>
## Fos      1.01066135594786e-18 7.14571267367002e-16
## Dusp1    7.24908882891304e-18 4.15568711216418e-15
## Rgs1      9.4342535672312e-08 1.10557984759415e-05
## Ppp1r15a 1.73432258509514e-12 4.90489551366039e-10
## Ly6a     2.99530600782523e-50  9.0762051045687e-47
## Egr1     5.70569022066076e-12 1.49411599099303e-09

We check the distribution of expression values for the genes with the largest biological components. We see that the variance estimate is not driven by one or two outlier cells (Figure 2).

fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotExpression(sce.hsc, features=rownames(hvg.out)[1:10]) + fontsize