1 Overview

Here, we describe some of the more theoretical aspects of detecting differential expression (DE) from single-cell RNA sequencing (scRNA-seq) data. This includes the basis of blocking on uninteresting factors of variation in findMarkers(); the use of Wilcoxon rank sum tests in overlapExprs(); incorporating other DE analysis results with combineMarkers(); and some caveats on the interpretation of DE \(p\)-values in scRNA-seq contexts.

2 Blocking on uninteresting factors of variation

2.1 Using the block= argument

Previous workflows (here and here) used the block= argument in findMarkers() to account for uninteresting factors of variation. This will instruct findMarkers() to perform pairwise \(t\)-tests between clusters using only cells on the same level of the blocking factor. It will then combine \(p\)-values from different plates using Stouffer’s Z method to obtain a single \(p\)-value per gene.

library(SingleCellExperiment)
sce.pancreas <- readRDS("pancreas_data.rds") 

# Same code as in pancreas MNN correction workflow.
library(scran)
m.out <- findMarkers(sce.pancreas, sce.pancreas$Cluster, 
    block=sce.pancreas$batch, direction="up", assay.type="original") 
demo <- m.out[["1"]] 
as.data.frame(demo[demo$Top <= 5,1:3])
##          Top       p.value           FDR
## TMSB4X     1  0.000000e+00  0.000000e+00
## ANXA2      1 2.856151e-306 1.047279e-302
## KRT8       1 4.133937e-288 1.212649e-284
## TACSTD2    1 1.080810e-247 1.981529e-244
## CFTR       1 2.944056e-233 3.321574e-230
## SLC4A4     1 4.683639e-232 4.906781e-229
## KIAA1522   1 2.452923e-215 1.998724e-212
## IGFBP7     1 7.003013e-183 2.567830e-180
## VAMP8      1 2.517592e-173 6.838059e-171
## COL18A1    1 1.146431e-157 2.128444e-155
## CD24       2  0.000000e+00  0.000000e+00
## ANXA4      2 4.054275e-314 1.982135e-310
## TM4SF1     2 8.303170e-274 1.739751e-270
## BACE2      2 1.170391e-234 1.560557e-231
## SERPING1   2 2.126506e-220 2.079297e-217
## EPCAM      2 2.598941e-201 1.411803e-198
## DDR1       2 5.138729e-173 1.345888e-170
## ONECUT2    2 2.471746e-160 4.899067e-158
## ANXA3      2 3.836901e-139 5.024627e-137
## LSR        2 8.778561e-111 6.405729e-109
## S100A11    3 4.494620e-274 1.098710e-270
## CLDN7      3 1.302434e-207 7.347229e-205
## CDH1       3 9.071719e-182 3.167974e-179
## CTSH       3 1.513027e-174 4.351287e-172
## CLDN4      3 7.633508e-142 1.056233e-139
## CIB1       3  3.900792e-73  1.091850e-71
## LITAF      4 1.583840e-214 1.222641e-211
## KRT7       4 4.426852e-209 2.705360e-206
## PTPRF      4 8.732258e-180 2.784262e-177
## KRT19      4 2.534131e-172 6.520719e-170
## MYRF       4 5.739823e-170 1.426881e-167
## CLDN10     4 6.173173e-166 1.331499e-163
## RAB25      4 2.680846e-143 3.931997e-141
## UQCR10     4  4.585739e-71  1.192536e-69
## PMEPA1     5 2.501078e-214 1.834165e-211
## KRT18      5 4.318874e-174 1.195187e-171
## NFIB       5 4.083758e-169 9.819095e-167
## PROM1      5 7.321851e-113 5.564228e-111
## LMAN2      5  1.467657e-70  3.737175e-69

Intra-batch comparisons with block= are robust to difference in the log-fold changes or variance between batches. However, we need to assume that each pair of clusters is present in at least one batch. In scenarios where cells from two clusters never co-occur in the same batch, the comparison will be impossible and NAs will be reported in the output.

2.2 Using the design= argument

Another approach is to define a design matrix containing the batch of origin as the sole factor. findMarkers() will then fit a linear model to the log-expression values, similar to the use of limma for bulk RNA sequencing data (Ritchie et al. 2015). This handles situations where multiple batches contain unique clusters, as comparisons can be implicitly performed via shared cell types in each batch. There is also a slight increase in power when information is shared across clusters for variance estimation.

# Setting up the design matrix (we remove intercept for full rank
# in the final design matrix with the cluster-specific terms).
design <- model.matrix(~sce.pancreas$batch)
design <- design[,-1,drop=FALSE]

m.alt <- findMarkers(sce.pancreas, sce.pancreas$Cluster, 
    design=design, direction="up", assay.type="original")
demo <- m.alt[["1"]]
as.data.frame(demo[demo$Top <= 5,1:3])
##          Top       p.value           FDR
## CFTR       1  0.000000e+00  0.000000e+00
## ANXA4      1  0.000000e+00  0.000000e+00
## SLC4A4     2  0.000000e+00  0.000000e+00
## SERPING1   2  0.000000e+00  0.000000e+00
## CD24       2  0.000000e+00  0.000000e+00
## KRT8       2  0.000000e+00  0.000000e+00
## B2M        2 6.785276e-166 5.156458e-164
## SPP1       3  0.000000e+00  0.000000e+00
## KRT19      3  0.000000e+00  0.000000e+00
## ANXA2      3  0.000000e+00  0.000000e+00
## RPL15      3 2.004173e-153 1.348404e-151
## TACSTD2    4  0.000000e+00  0.000000e+00
## EPCAM      4 1.572563e-267 2.957024e-265
## TMBIM6     4 1.740368e-149 1.105021e-147
## PMEPA1     5  0.000000e+00  0.000000e+00
## ALDH1A3    5  0.000000e+00  0.000000e+00
## CD59       5  0.000000e+00  0.000000e+00

The use of a linear model makes a few some strong assumptions, necessitating some caution when interpreting the results. The batch effect across cell types is assumed to be homogeneous. If this is not the case, the variance will be inflated and the log-fold change estimates will be distorted. Variances are also assumed to be equal across groups, which is not true in general. In particular, the presence of clusters in which a gene is silent will shrink the residual variance towards zero, preventing the model from penalizing genes with high variance in other clusters. Thus, we generally recommend the use of block= where possible.

3 Alternative testing methods

3.1 Using the Wilcoxon rank sum test

The overlapExprs() function uses the Wilcoxon rank sum test to detect uneven mixing of the distributions of expression values between clusters. The effect size is reported as the “area under the curve”, i.e., the probability of randomly sampling one observation in one cluster that is greater than a random observation in another cluster. This prioritizes genes where there is clear separation between the distributions of expression values of different clusters. We demonstrate the use of overlapExprs() on the 416B data set from the previous workflow, detecting DE genes between clusters while blocking on the plate of origin.

sce.416b <- readRDS("416B_data.rds")
o.out <- overlapExprs(sce.416b, group=sce.416b$cluster, block=sce.416b$Plate)
head(o.out[["1"]]) # top DEGs for cluster 1 against the others.
## DataFrame with 6 rows and 7 columns
##                          Top              p.value                  FDR
##                    <integer>            <numeric>            <numeric>
## ENSMUSG00000107771         1 6.24804570972308e-22 1.48597271114344e-17
## Hbb-b1                     1 8.26717542905217e-21 6.74596321819895e-17
## ENSMUSG00000074026         1 4.93542635295853e-20 1.30421383280458e-16
## Oip5                       1 1.76459269663039e-15 1.83263354165767e-13
## ENSMUSG00000084220         2 5.70788175385249e-20 1.35750551751874e-16
## Eef1akmt1                  2 1.89928885228432e-17 5.23470093472323e-15
##                                 AUC.2                AUC.3
##                             <numeric>            <numeric>
## ENSMUSG00000107771 0.0436177036561899    0.306451612903226
## Hbb-b1             0.0301475304682489   0.0930875576036866
## ENSMUSG00000074026 0.0814624759461193 0.000921658986175133
## Oip5                0.235086593970494    0.223963133640553
## ENSMUSG00000084220 0.0490699166132136  0.00368663594470042
## Eef1akmt1          0.0647851186658114   0.0829493087557603
##                                 AUC.4              AUC.5
##                             <numeric>          <numeric>
## ENSMUSG00000107771  0.443514644351464   0.11605415860735
## Hbb-b1             0.0993723849372385  0.470019342359768
## ENSMUSG00000074026  0.178870292887029  0.131528046421663
## Oip5                0.381799163179916 0.0783365570599613
## ENSMUSG00000084220  0.326359832635983 0.0947775628626693
## Eef1akmt1          0.0899581589958159  0.421663442940039

Effect sizes close to zero indicate that the gene is downregulated, while effect sizes close to unity correspond to upregulation. The top DE genes all exhibit strong separation between cluster 1 and the others (Figure 1).

library(scater)
plotExpression(sce.416b, x="cluster", colour_by="Plate",
    features=head(rownames(o.out[[1]])))