# Contents

This document provides advice for users of early versions of scater who will need to transition from the use of the SCESet class to the SingleCellExperiment class.

As of July 2017, scater has switched from the SCESet class previously defined within the package to the more widely applicable SingleCellExperiment class. From Bioconductor 3.6 (October 2017), the release version of scater will use SingleCellExperiment.

SingleCellExperiment is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages. Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets.

It should be straight-forward to convert existing scripts based on SCESet objects to SingleCellExperiment objects, with key changes outlined immediately below.

# 1 Executive summary

• The functions toSingleCellExperiment and updateSCESet (for backwards compatibility) can be used to convert an old SCESet object to a SingleCellExperiment object;
• Create a new SingleCellExperiment object with the function SingleCellExperiment (actually less fiddly than creating a new SCESet);
• scater functions have been refactored to take SingleCellExperiment objects, so once data is in a SingleCellExperiment object, the user experience is almost identical to that with the SCESet class.

Potential “gotchas”:

• Cell names can now be accessed/assigned with the colnames function (instead of sampleNames or cellNames for an SCESet object);
• Feature (gene/transcript) names should now be accessed/assigned with the rownames function (instead of featureNames);
• Cell metadata, stored as phenoData in an SCESet, corresponds to colData in a SingleCellExperiment object and is accessed/assigned with the colData function (this replaces the pData function);
• Individual cell-level variables can still be accessed with the $ operator (e.g. sce$total_counts);
• Feature metadata, stored as featureData in an SCESet, corresponds to rowData in a SingleCellExperiment object and is accessed/assigned with the rowData function (this replaces the fData function);
• plotScater, which produces a cumulative expression, overview plot, replaces the generic plot function for SCESet objects.

# 2 A note on terminology

In Bioconductor terminology we assay numerous “features” for a number of “samples”. Features, in the context of scater, correspond most commonly to genes or transcripts, but could be any general genomic or transcriptomic regions (e.g. exon) of interest for which we take measurements. Samples correspond to cells.

With the switch to using the SingleCellExperiment class, the terminology has become more general again. Now we have “rows” representing features and “cols” representing samples (cells). Thus, applying the rownames function returns the names of the features defined for a SingleCellExperiment object, which in typical scater usage would correspond to gene IDs. In much of what follows, it may be more intuitive to mentally replace “feature” with “gene” or “transcript” (depending on the context of the study) wherever “feature” appears.

In the scater context, “samples” refer to individual cells that we have assayed. This differs from common usage of “sample” in other contexts, where we might usually use “sample” to refer to an individual subject, a biological replicate or similar. A “sample” in this sense in scater may be referred to as a “block” in the more classical statistical sense. Within a “block” ( e.g. individual) we may have assayed numerous cells. Thus, the function colnames, when applied to a SingleCellExperiment object returns the cell IDs.

# 3 The SingleCellExperiment class and methods

In scater we organise single-cell expression data in objects of the SingleCellExperiment class. The class inherits the Bioconductor SummarizedExperiment class, which provides a common interface across many Bioconductor packages. For more details about other features inherited from Bioconductor’s SummarizedExperiment class, type ?SummarizedExperiment at the R prompt.

The class only requires some “assay data” (i.e. expression values of some sort) as input. Most commonly, these will be “counts” (e.g. molecule or read counts) and/or log2-scale transformed counts.

Cell metadata can be supplied as a DataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.). Feature metadata can be supplied as a DataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as Ensembl ID, biotype, gc content, etc.

We can create a minimal SingleCellExperiment object as follows:

data("sc_example_counts")
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts))
example_sce
## class: SingleCellExperiment
## dim: 2000 40
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):

The requirements for the SingleCellExperiment class (as with other S4 classes in R and Bioconductor) are strict. The idea is that strictness with generating a valid class object ensures that downstream methods applied to the class will work reliably.

Thus, if we supply colData and/or rowData when building an obejct, the expression value matrix must have the same number of columns as the colData DataFrame has rows, and it must have the same number of rows as the rowData DataFrame has rows. Row names of the colData object need to match the column names of the expression matrix and row names of the rowData object need to match row names of the expression matrix.

We can create a new SingleCellExperiment object with count data, cell metadata and gene metadata as follows.

data("sc_example_cell_info")
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), colData = sc_example_cell_info, rowData = gene_df) example_sce ## class: SingleCellExperiment ## dim: 2000 40 ## metadata(0): ## assays(1): counts ## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000 ## rowData names(1): Gene ## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040 ## colData names(4): Cell Mutation_Status Cell_Cycle Treatment ## reducedDimNames(0): ## spikeNames(0): Frequently (typically), we will want both raw counts and log2-scale counts in our SingleCellExperiment object. It is straight-forward to add log2-counts-per-million to an object containing counts. We can use the normalise (or, if you prefer, normalize) function: example_sce <- normalise(example_sce) ## Warning in .local(object, ...): using library sizes as size factors (This gives a warning to let us know that as size factors for normalisation have not yet been defined, library sizes (total counts) are used instead. This function can also be used for more sophisticated size-factor normalisation once size factors have been calculated.) Or, we use calculateCPM directly (with equivalent results): logcounts(example_sce) <- log2(calculateCPM(example_sce, use.size.factors = FALSE) + 1) The log-scale count data is stored in the logcounts assay slot of a SingleCellExperiment object. The exprs getter/setter function also accesses this logcounts slot, to enable equivalent usage as in previous versions of scater. # 4 Subsetting, accessing and assigning data in a SingleCellExperiment object We have accessor functions to access elements of the SingleCellExperiment object. Furthermore, subsetting SingleCellExperiment objects is straightforward and reliable, using the usual R [] notation, with rows representing features and columns representing cells. • counts(object): returns the matrix of read counts. As you can see above, if no counts are defined for the object, then the counts matrix slot is simpy NULL. counts(example_sce)[1:3, 1:6] ## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006 ## Gene_0001 0 123 2 0 0 0 ## Gene_0002 575 65 3 1561 2311 160 ## Gene_0003 0 0 0 0 1213 0 • exprs(object): returns the matrix of (log-counts) expression values, in fact accessing the logcounts slot of the object (synonym for logcounts). Typically these should be log2(counts-per-million) values or log2(reads-per-kilobase-per-million-mapped), appropriately normalised of course. The package will generally assume that these are the values to use for expression. exprs(example_sce)[1:3, 1:6] ## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006 ## Gene_0001 0.000000 8.192430 1.828628 0.00000 0.000000 0.000000 ## Gene_0002 9.033633 7.276677 2.271422 11.07878 10.103749 8.492693 ## Gene_0003 0.000000 0.000000 0.000000 0.00000 9.174997 0.000000 • Generically, we can access any assay data from the object with the assay function. We simply supply the function with the SingleCellExperiment object and the name of the desired expression matrix: assay(example_sce, "counts")[1:3, 1:6] Similarly we can assign a new (say, transformed) expression matrix to an SingleCellExperiment object using assay as follows: assay(example_sce, "counts") <- counts(example_sce) For convenience (and backwards compatibility) getters and setters are provided as follows: exprs, tpm, cpm, fpkm and versions of these with the prefix “norm_”): Handily, it is also easy to replace other data in slots of the SCESet object using generic accessor and replacement functions. gene_df <- DataFrame(Gene = rownames(sc_example_counts)) rownames(gene_df) <- gene_df$Gene
## replace rowData (previously featureData)
rowData(example_sce) <- gene_df
## replace colData (previously phenotype data)
colData(example_sce) <- DataFrame(sc_example_cell_info)

After gaining familiarity with creating and manipulating SingleCellExperiment objects, see the other scater vignettes for guidance on using scater for quality control, data visualisation and more.