This document provides advice for users of early versions of scater who will need to transition from the use of the SCESet class to the SingleCellExperiment class.

As of July 2017, scater has switched from the SCESet class previously defined within the package to the more widely applicable SingleCellExperiment class. From Bioconductor 3.6 (October 2017), the release version of scater will use SingleCellExperiment.

SingleCellExperiment is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages. Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets.

It should be straight-forward to convert existing scripts based on SCESet objects to SingleCellExperiment objects, with key changes outlined immediately below.

1 Executive summary

Potential “gotchas”:

2 A note on terminology

In Bioconductor terminology we assay numerous “features” for a number of “samples”. Features, in the context of scater, correspond most commonly to genes or transcripts, but could be any general genomic or transcriptomic regions (e.g. exon) of interest for which we take measurements. Samples correspond to cells.

With the switch to using the SingleCellExperiment class, the terminology has become more general again. Now we have “rows” representing features and “cols” representing samples (cells). Thus, applying the rownames function returns the names of the features defined for a SingleCellExperiment object, which in typical scater usage would correspond to gene IDs. In much of what follows, it may be more intuitive to mentally replace “feature” with “gene” or “transcript” (depending on the context of the study) wherever “feature” appears.

In the scater context, “samples” refer to individual cells that we have assayed. This differs from common usage of “sample” in other contexts, where we might usually use “sample” to refer to an individual subject, a biological replicate or similar. A “sample” in this sense in scater may be referred to as a “block” in the more classical statistical sense. Within a “block” ( e.g. individual) we may have assayed numerous cells. Thus, the function colnames, when applied to a SingleCellExperiment object returns the cell IDs.

3 The SingleCellExperiment class and methods

In scater we organise single-cell expression data in objects of the SingleCellExperiment class. The class inherits the Bioconductor SummarizedExperiment class, which provides a common interface across many Bioconductor packages. For more details about other features inherited from Bioconductor’s SummarizedExperiment class, type ?SummarizedExperiment at the R prompt.

The class only requires some “assay data” (i.e. expression values of some sort) as input. Most commonly, these will be “counts” (e.g. molecule or read counts) and/or log2-scale transformed counts.

Cell metadata can be supplied as a DataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.). Feature metadata can be supplied as a DataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as Ensembl ID, biotype, gc content, etc.

We can create a minimal SingleCellExperiment object as follows:

example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts))
## class: SingleCellExperiment 
## dim: 2000 40 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):

The requirements for the SingleCellExperiment class (as with other S4 classes in R and Bioconductor) are strict. The idea is that strictness with generating a valid class object ensures that downstream methods applied to the class will work reliably.

Thus, if we supply colData and/or rowData when building an obejct, the expression value matrix must have the same number of columns as the colData DataFrame has rows, and it must have the same number of rows as the rowData DataFrame has rows. Row names of the colData object need to match the column names of the expression matrix and row names of the rowData object need to match row names of the expression matrix.

We can create a new SingleCellExperiment object with count data, cell metadata and gene metadata as follows.

gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), 
                                    colData = sc_example_cell_info, 
                                    rowData = gene_df)
## class: SingleCellExperiment 
## dim: 2000 40 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(1): Gene
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):

Frequently (typically), we will want both raw counts and log2-scale counts in our SingleCellExperiment object. It is straight-forward to add log2-counts-per-million to an object containing counts.

We can use the normalise (or, if you prefer, normalize) function:

example_sce <- normalise(example_sce)
## Warning in .local(object, ...): using library sizes as size factors

(This gives a warning to let us know that as size factors for normalisation have not yet been defined, library sizes (total counts) are used instead. This function can also be used for more sophisticated size-factor normalisation once size factors have been calculated.)

Or, we use calculateCPM directly (with equivalent results):

logcounts(example_sce) <- log2(calculateCPM(example_sce, 
                                            use.size.factors = FALSE) + 1)

The log-scale count data is stored in the logcounts assay slot of a SingleCellExperiment object. The exprs getter/setter function also accesses this logcounts slot, to enable equivalent usage as in previous versions of scater.

4 Subsetting, accessing and assigning data in a SingleCellExperiment object

We have accessor functions to access elements of the SingleCellExperiment object. Furthermore, subsetting SingleCellExperiment objects is straightforward and reliable, using the usual R [] notation, with rows representing features and columns representing cells.

counts(example_sce)[1:3, 1:6]
##           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001        0      123        2        0        0        0
## Gene_0002      575       65        3     1561     2311      160
## Gene_0003        0        0        0        0     1213        0
exprs(example_sce)[1:3, 1:6]
##           Cell_001 Cell_002 Cell_003 Cell_004  Cell_005 Cell_006
## Gene_0001 0.000000 8.192430 1.828628  0.00000  0.000000 0.000000
## Gene_0002 9.033633 7.276677 2.271422 11.07878 10.103749 8.492693
## Gene_0003 0.000000 0.000000 0.000000  0.00000  9.174997 0.000000
assay(example_sce, "counts")[1:3, 1:6]

Similarly we can assign a new (say, transformed) expression matrix to an SingleCellExperiment object using assay as follows:

assay(example_sce, "counts") <- counts(example_sce)

For convenience (and backwards compatibility) getters and setters are provided as follows: exprs, tpm, cpm, fpkm and versions of these with the prefix “norm_”):

Handily, it is also easy to replace other data in slots of the SCESet object using generic accessor and replacement functions.

gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
## replace rowData (previously featureData)
rowData(example_sce) <- gene_df
## replace colData (previously phenotype data)
colData(example_sce) <- DataFrame(sc_example_cell_info)

After gaining familiarity with creating and manipulating SingleCellExperiment objects, see the other scater vignettes for guidance on using scater for quality control, data visualisation and more.