1 Overview

In this workflow, we use a relatively simple dataset (Lun et al. 2017) to introduce most of the concepts of scRNA-seq data analysis. This dataset contains two plates of 416B cells (an immortalized mouse myeloid progenitor cell line), processed using the Smart-seq2 protocol (Picelli et al. 2014). A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences.

Counts for all genes/transcripts in each cell are available from ArrayExpress using the accession number E-MTAB-5522. We download both the count tables (in the “processed files”) as well as the metadata file using the BiocFileCache package. This saves the files to a local cache (raw_data) and avoids re-downloading them if they are already present.

bfc <- BiocFileCache("raw_data", ask = FALSE)
lun.zip <- bfcrpath(bfc, 
lun.sdrf <- bfcrpath(bfc, 
unzip(lun.zip, exdir=tempdir())

2 Setting up the data

2.1 Loading in the count matrix

Our first task is to load the count matrices into memory. One matrix was generated for each plate of cells used in the study. In each matrix, each row represents an endogenous gene or a spike-in transcript, and each column represents a cell. Subsequently, the count in each entry of the matrix represents the number of reads mapped to a particular gene/transcript in a particular cell.

plate1 <- read.delim(file.path(tempdir(), "counts_Calero_20160113.tsv"), 
    header=TRUE, row.names=1, check.names=FALSE)
plate2 <- read.delim(file.path(tempdir(), "counts_Calero_20160325.tsv"), 
    header=TRUE, row.names=1, check.names=FALSE)

gene.lengths <- plate1$Length # First column is the gene length.
plate1 <- as.matrix(plate1[,-1]) # Discarding gene length (as it is not a cell).
plate2 <- as.matrix(plate2[,-1])
rbind(Plate1=dim(plate1), Plate2=dim(plate2))
##         [,1] [,2]
## Plate1 46703   96
## Plate2 46703   96

We combine the two matrices into a single object for further processing. This is done after verifying that the genes are in the same order between the two matrices.

stopifnot(identical(rownames(plate1), rownames(plate2)))
all.counts <- cbind(plate1, plate2)

For convenience, the count matrix is stored in a SingleCellExperiment object from the SingleCellExperiment package. This allows different types of row- and column-level metadata to be stored alongside the counts for synchronized manipulation throughout the workflow.

sce <- SingleCellExperiment(list(counts=all.counts))
rowData(sce)$GeneLength <- gene.lengths
## class: SingleCellExperiment 
## dim: 46703 192 
## metadata(0):
## assays(1): counts
## rownames(46703): ENSMUSG00000102693 ENSMUSG00000064842 ... SIRV7
##   CBFB-MYH11-mcherry
## rowData names(1): GeneLength
## colnames(192): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
##   SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ...
##   SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1
##   SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):

We identify the rows corresponding to ERCC spike-in transcripts from the row names. We store this information in the SingleCellExperiment object for future use. This is necessary as spike-ins require special treatment in downstream steps such as normalization.

isSpike(sce, "ERCC") <- grepl("^ERCC", rownames(sce))
summary(isSpike(sce, "ERCC"))
##    Mode   FALSE    TRUE 
## logical   46611      92

This dataset is slightly unusual in that it contains information from another set of spike-in transcripts, the Spike-In RNA Variants (SIRV) set. For simplicity, we will only use the ERCC spike-ins in this analysis. Thus, we must remove the rows corresponding to the SIRV transcripts prior to further analysis, which can be done simply by subsetting the SingleCellExperiment object.

is.sirv <- grepl("^SIRV", rownames(sce))
sce <- sce[!is.sirv,] 
##    Mode   FALSE    TRUE 
## logical   46696       7

Comments from Aaron:

  • Some feature-counting tools will report mapping statistics in the count matrix (e.g., the number of unaligned or unassigned reads). While these values can be useful for quality control, they would be misleading if treated as gene expression values. Thus, they should be removed (or at least moved to the colData) prior to further analyses.
  • Be aware of using the ^ERCC regular expression for human data where the row names of the count matrix are gene symbols. An ERCC gene family actually exists in human annotation, so this would result in incorrect identification of genes as spike-in transcripts. This problem can be avoided by publishing count matrices with standard identifiers (e.g., Ensembl, Entrez).

2.2 Incorporating cell-based annotation

We load in the metadata for each library/cell from the sdrf.txt file. It is important to check that the rows of the metadata table are in the same order as the columns of the count matrix. Otherwise, incorrect metadata will be assigned to each cell.

metadata <- read.delim(lun.sdrf, check.names=FALSE, header=TRUE)
m <- match(colnames(sce), metadata[["Source Name"]]) # Enforcing identical order.
stopifnot(all(!is.na(m))) # Checking that nothing's missing.
metadata <- metadata[m,]
## [1] "Source Name"                "Comment[ENA_SAMPLE]"       
## [3] "Comment[BioSD_SAMPLE]"      "Characteristics[organism]" 
## [5] "Characteristics[cell line]" "Characteristics[cell type]"

We only retain relevant metadata fields to avoid storing unnecessary information in the colData of the SingleCellExperiment object. In particular, we keep the plate of origin (i.e., block) and phenotype of each cell. The second field is relevant as all of the cells contain a CBFB-MYH11 oncogene, but the expression of this oncogene is only induced in a subset of the cells.

colData(sce)$Plate <- factor(metadata[["Factor Value[block]"]])
pheno <- metadata[["Factor Value[phenotype]"]]
levels(pheno) <- c("induced", "control")
colData(sce)$Oncogene <- pheno
table(colData(sce)$Oncogene, colData(sce)$Plate)
##           20160113 20160325
##   induced       48       48
##   control       48       48

2.3 Incorporating gene-based annotation

Feature-counting tools typically report genes in terms of standard identifiers from Ensembl or Entrez. These identifiers are used as they are unambiguous and highly stable. However, they are difficult to interpret compared to the gene symbols which are more commonly used in the literature. Given the Ensembl identifiers, we obtain the corresponding gene symbols using annotation packages like org.Mm.eg.db.

symb <- mapIds(org.Mm.eg.db, keys=rownames(sce), keytype="ENSEMBL", column="SYMBOL")
rowData(sce)$ENSEMBL <- rownames(sce)
rowData(sce)$SYMBOL <- symb
## DataFrame with 6 rows and 3 columns
##                    GeneLength            ENSEMBL      SYMBOL
##                     <integer>        <character> <character>
## ENSMUSG00000102693       1070 ENSMUSG00000102693          NA
## ENSMUSG00000064842        110 ENSMUSG00000064842          NA
## ENSMUSG00000051951       6094 ENSMUSG00000051951        Xkr4
## ENSMUSG00000102851        480 ENSMUSG00000102851          NA
## ENSMUSG00000103377       2819 ENSMUSG00000103377          NA
## ENSMUSG00000104017       2233 ENSMUSG00000104017          NA

It is often desirable to rename the row names of sce to the gene symbols, as these are easier to interpret. However, this requires some work to account for missing and duplicate symbols. The code below will replace missing symbols with the Ensembl identifier and concatenate duplicated symbols with the (unique) Ensembl identifiers.

rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ENSEMBL, rowData(sce)$SYMBOL)
## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "Xkr4"              
## [4] "ENSMUSG00000102851" "ENSMUSG00000103377" "ENSMUSG00000104017"

We also determine the chromosomal location for each gene using the TxDb.Mmusculus.UCSC.mm10.ensGene package. This will be useful later as several quality control metrics will be computed from rows corresponding to mitochondrial genes.

location <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=rowData(sce)$ENSEMBL, 
    column="CDSCHROM", keytype="GENEID")
rowData(sce)$CHR <- location
##    Mode   FALSE    TRUE    NA's 
## logical   22428      13   24255

Alternatively, annotation from BioMart resources can be directly added to the object using the getBMFeatureAnnos function from scater. This may be more convenient than the approach shown above, but depends on an available internet connection to the BioMart databases.

3 Quality control on the cells

3.1 Defining the quality control metrics

Low-quality cells need to be removed to ensure that technical effects do not distort downstream analysis results. We use several quality control (QC) metrics:

  • The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with small library sizes are of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation.
  • The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
  • The proportion of reads mapped to spike-in transcripts is calculated relative to the library size for each cell. High proportions are indicative of poor-quality cells, where endogenous RNA has been lost during processing (e.g., due to cell lysis or RNA degradation). The same amount of spike-in RNA to each cell, so an enrichment in spike-in counts is symptomatic of loss of endogenous RNA.
  • In the absence of spike-in transcripts, the proportion of reads mapped to genes in the mitochondrial genome can also be used. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

For each cell, we calculate these quality control metrics using the calculateQCMetrics function from the scater package (McCarthy et al. 2017). These are stored in the row- and column-wise metadata of the SingleCellExperiment for future reference.

mito <- which(rowData(sce)$CHR=="chrM")
sce <- calculateQCMetrics(sce, feature_controls=list(Mt=mito))
head(colnames(colData(sce)), 10)
##  [1] "Plate"                          "Oncogene"                      
##  [3] "is_cell_control"                "total_features_by_counts"      
##  [5] "log10_total_features_by_counts" "total_counts"                  
##  [7] "log10_total_counts"             "pct_counts_in_top_50_features" 
##  [9] "pct_counts_in_top_100_features" "pct_counts_in_top_200_features"

The distributions of these metrics are shown in Figure 1, stratified by oncogene induction status and plate of origin. The aim is to remove putative low-quality cells that have low library sizes, low numbers of expressed features, and high spike-in (or mitochondrial) proportions. Such cells can interfere with downstream analyses, e.g., by forming distinct clusters that complicate interpretation of the results.

sce$PlateOnco <- paste0(sce$Oncogene, ".", sce$Plate)
    plotColData(sce, y="total_counts", x="PlateOnco"),
    plotColData(sce, y="total_features_by_counts", x="PlateOnco"),
    plotColData(sce, y="pct_counts_ERCC", x="PlateOnco"),
    plotColData(sce, y="pct_counts_Mt", x="PlateOnco"),