1 Workflow version information

R version: R version 3.6.0 (2019-04-26)

Bioconductor version: 3.10

Package: 1.9.8

2 Motivation

Single-cell RNA sequencing (scRNA-seq) is widely used to measure the genome-wide expression profile of individual cells. From each cell, mRNA is isolated and reverse transcribed to cDNA for high-throughput sequencing (Stegle, Teichmann, and Marioni 2015). This can be done using microfluidics platforms like the Fluidigm C1 (Pollen et al. 2014), protocols based on microtiter plates like Smart-seq2 (Picelli et al. 2014), or droplet-based technologies like inDrop (Klein et al. 2015; Macosko et al. 2015). The number of reads mapped to each gene is then used to quantify its expression in each cell. Alternatively, unique molecular identifiers (UMIs) can be used to directly measure the number of transcript molecules for each gene (Islam et al. 2014). Count data are analyzed to detect highly variable genes (HVGs) that drive heterogeneity across cells in a population, to find correlations between genes and cellular phenotypes, or to identify new subpopulations via dimensionality reduction and clustering. This provides biological insights at a single-cell resolution that cannot be achieved with conventional bulk RNA sequencing of cell populations.

Strategies for scRNA-seq data analysis differ markedly from those for bulk RNA-seq. One technical reason is that scRNA-seq data are much noisier than bulk data (Brennecke et al. 2013; Marinov et al. 2014). Reliable capture (i.e., conversion) of transcripts into cDNA for sequencing is difficult with the low quantity of RNA in a single cell. This increases the frequency of drop-out events where none of the transcripts for a gene are captured. Dedicated steps are required to deal with this noise during analysis, especially during quality control. In addition, scRNA-seq data can be used to study cell-to-cell heterogeneity, e.g., to identify new cell subtypes, to characterize differentiation processes, to assign cells into their cell cycle phases, or to identify HVGs driving variability across the population (Vallejos, Marioni, and Richardson 2015; Fan et al. 2016; Trapnell et al. 2014). This is simply not possible with bulk data, meaning that custom methods are required to perform these analyses.

3 scRNA-seq data analysis with Bioconductor

This package contains a set of computational workflows for basic analysis of scRNA-seq data, using software from the open-source Bioconductor project (Huber et al. 2015). The workflows start from a count matrix and describe a number of key steps for scRNA-seq data analysis, including:

  • quality control to remove problematic cells;
  • normalization of cell-specific biases, with and without spike-ins;
  • correction for batch effects;
  • cell cycle phase classification from gene expression data;
  • data exploration to identify putative subpopulations;
  • and finally, HVG and marker gene identification to prioritize interesting genes.

The application of these procedures will be demonstrated on several public scRNA-seq datasets involving immortalized myeloid progenitors, brain cells, haematopoietic stem cells, T-helper cells and mouse embryonic stem cells, generated with a range of experimental protocols and platforms (Lun et al. 2017; Wilson et al. 2015; Zeisel et al. 2015; Islam et al. 2011; Buettner et al. 2015; Zheng et al. 2017). The aim is to provide a variety of modular usage examples that can be applied by readers to construct custom analysis pipelines for their own experiments.

See the simpleSingleCell landing page for links to individual workflows and for instructions on how to install the required packages. To cite any of these workflows, please refer to http://f1000research.com/articles/5-2122/v2 for instructions.

4 Obtaining a count matrix

All of these workflows start from a publicly available count matrix. For simplicity, we forego a description of the read processing steps required to generate the count matrix, i.e., read alignment and counting into features. For SMART-seq2 data (Picelli et al. 2014), quantification procedures developed for bulk RNA-seq are generally satisfactory (Love et al. 2015; Chen, Lun, and Smyth 2016). Users favouring an R-based approach to read alignment and counting might consider using the methods in the Rsubread package (Liao, Smyth, and Shi 2013, 2014).

Many other scRNA-seq protocols contain bespoke sequence structures that require careful processing:

  • Unique molecular identifiers (UMIs) (Islam et al. 2014) are widely used to mitigate the effects of amplification biases. Reads with the same UMI mapping to the same gene represent a single underlying transcript molecule and only increment the count of that gene by one. Processing of this data requires extraction of the UMI sequence from each read or read pair, and a method to collapse UMI-based duplicates into a single count (Smith, Heger, and Sudbery 2017).
  • Protocols may also use custom cell barcodes to improve multiplexing efficiency beyond that offered by the standard sequencing barcodes (e.g., from Illumina). This includes data generated from droplet-based experiments (Zheng et al. 2017) or from very high-throughput plate-based protocols like MARS-seq (Jaitin et al. 2014). Processing of this data usually requires a separate step to extract the barcode sequence from each read and to allocate the read to the correct per-cell sequencing library.

For these data sets, the scPipe package (Tian et al. 2018) provides an R-based processing pipeline for obtaining a count matrix.

If spike-in RNA was added, the sequences of the spike-in transcripts can be included as additional FASTA files during genome index building prior to alignment. Similarly, genomic intervals for both spike-in transcripts and endogenous genes can be concatenated into a single GTF file prior to counting.

5 Quick start

The code below contains a no-frills analysis of the Zeisel et al. (2015) data set, culminating in Figure 1. This is only provided as a demonstration of the overall structure of a basic scRNA-seq analysis - for each step, more sophisticated procedures will be presented in the following workflows. (While this simple analysis is unlikely to do too poorly if applied directly to a userโ€™s data set, it is still helpful to read the workflows to gain an understanding of the potential problems at each step and their solutions.)

# Preparing data:
sce <- ZeiselBrainData()
sce <- sce[!isSpike(sce),] # not using the spike-ins, for simplicity.

# Dimensionality reduction:
sce <- normalize(sce) # normalizing by library size for speed.
sce <- runPCA(sce, scale=FALSE) # auto-selects the most variable genes.
sce <- runTSNE(sce, use_dimred="PCA")

# Defining clusters and markers:
snn.gr <- buildSNNGraph(sce, use.dimred="PCA")
sce$cluster <- factor(igraph::cluster_walktrap(snn.gr)$membership)
markers <- findMarkers(sce, sce$cluster)
plotTSNE(sce, colour_by="cluster")