Contents

Splatter logo

Splatter logo

Welcome to Splatter! Splatter is an R package for the simple simulation of single-cell RNA sequencing data. This vignette gives an overview and introduction to Splatter’s functionality.

1 Installation

Splatter can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("splatter")

To install the most recent development version from Github use:

BiocManager::install("Oshlack/splatter", dependencies = TRUE,
         build_vignettes = TRUE)

2 Quickstart

Assuming you already have a matrix of count data similar to that you wish to simulate there are two simple steps to creating a simulated data set with Splatter. Here is an example using the example dataset in the scater package:

# Load package
library(splatter)

# Load example data
library(scater)
data("sc_example_counts")
# Estimate parameters from example data
params <- splatEstimate(sc_example_counts)
# Simulate data using estimated parameters
sim <- splatSimulate(params)
## Getting parameters...
## Creating simulation object...
## Simulating library sizes...
## Simulating gene means...
## Simulating BCV...
## Simulating counts...
## Simulating dropout (if needed)...
## Done!

These steps will be explained in detail in the following sections but briefly the first step takes a dataset and estimates simulation parameters from it and the second step takes those parameters and simulates a new dataset.

3 The Splat simulation

Before we look at how we estimate parameters let’s first look at how Splatter simulates data and what those parameters are. We use the term ‘Splat’ to refer to the Splatter’s own simulation and differentiate it from the package itself. The core of the Splat model is a gamma-Poisson distribution used to generate a gene by cell matrix of counts. Mean expression levels for each gene are simulated from a gamma distribution and the Biological Coefficient of Variation is used to enforce a mean-variance trend before counts are simulated from a Poisson distribution. Splat also allows you to simulate expression outlier genes (genes with mean expression outside the gamma distribution) and dropout (random knock out of counts based on mean expression). Each cell is given an expected library size (simulated from a log-normal distribution) that makes it easier to match to a given dataset.

Splat can also simulate differential expression between groups of different types of cells or differentiation paths between different cells types where expression changes in a continuous way. These are described further in the simulating counts section.

4 The SplatParams object

All the parameters for the Splat simulation are stored in a SplatParams object. Let’s create a new one and see what it looks like.

params <- newSplatParams()
params
## A Params object of class SplatParams 
## Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
## 
## Global: 
## (Genes)  (Cells)   [Seed] 
##   10000      100   114527 
## 
## 28 additional parameters 
## 
## Batches: 
##     [Batches]  [Batch Cells]     [Location]        [Scale] 
##             1            100            0.1            0.1 
## 
## Mean: 
##  (Rate)  (Shape) 
##     0.3      0.6 
## 
## Library size: 
## (Location)     (Scale)      (Norm) 
##         11         0.2       FALSE 
## 
## Exprs outliers: 
## (Probability)     (Location)        (Scale) 
##          0.05              4            0.5 
## 
## Groups: 
##      [Groups]  [Group Probs] 
##             1              1 
## 
## Diff expr: 
## [Probability]    [Down Prob]     [Location]        [Scale] 
##           0.1            0.5            0.1            0.4 
## 
## BCV: 
## (Common Disp)          (DoF) 
##           0.1             60 
## 
## Dropout: 
##     [Type]  (Midpoint)     (Shape) 
##       none           0          -1 
## 
## Paths: 
##         [From]         [Steps]          [Skew]    [Non-linear] 
##              0             100             0.5             0.1 
## [Sigma Factor] 
##            0.8

As well as telling us what type of object we have (“A Params object of class SplatParams”) and showing us the values of the parameter this output gives us some extra information. We can see which parameters can be estimated by the splatEstimate function (those in parentheses), which can’t be estimated (those in brackets) and which have been changed from their default values (those in ALL CAPS). For more details about the parameters of the Splat simulation refer to the Splat parameters vignette.

4.1 Getting and setting

If we want to look at a particular parameter, for example the number of genes to simulate, we can extract it using the getParam function:

getParam(params, "nGenes")
## [1] 10000

Alternatively, to give a parameter a new value we can use the setParam function:

params <- setParam(params, "nGenes", 5000)
getParam(params, "nGenes")
## [1] 5000

If we want to extract multiple parameters (as a list) or set multiple parameters we can use the getParams or setParams functions:

# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
## $nGenes
## [1] 8000
## 
## $mean.rate
## [1] 0.5
## 
## $mean.shape
## [1] 0.6
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
## A Params object of class SplatParams 
## Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
## 
## Global: 
## (GENES)  (Cells)   [Seed] 
##    8000      100   114527 
## 
## 28 additional parameters 
## 
## Batches: 
##     [Batches]  [Batch Cells]     [Location]        [Scale] 
##             1            100            0.1            0.1 
## 
## Mean: 
##  (RATE)  (SHAPE) 
##     0.5      0.5 
## 
## Library size: 
## (Location)     (Scale)      (Norm) 
##         11         0.2       FALSE 
## 
## Exprs outliers: 
## (Probability)     (Location)        (Scale) 
##          0.05              4            0.5 
## 
## Groups: 
##      [Groups]  [Group Probs] 
##             1              1 
## 
## Diff expr: 
## [PROBABILITY]    [Down Prob]     [Location]        [Scale] 
##           0.2            0.5            0.1            0.4 
## 
## BCV: 
## (Common Disp)          (DoF) 
##           0.1             60 
## 
## Dropout: 
##     [Type]  (Midpoint)     (Shape) 
##       none           0          -1 
## 
## Paths: 
##         [From]         [Steps]          [Skew]    [Non-linear] 
##              0             100             0.5             0.1 
## [Sigma Factor] 
##            0.8

The parameters with have changed are now shown in ALL CAPS to indicate that they been changed form the default.

We can also set parameters directly when we call newSplatParams:

params <- newSplatParams(lib.loc = 12, lib.scale = 0.6)
getParams(params, c("lib.loc", "lib.scale"))
## $lib.loc
## [1] 12
## 
## $lib.scale
## [1] 0.6

5 Estimating parameters

Splat allows you to estimate many of it’s parameters from a data set containing counts using the splatEstimate function.

# Check that sc_example counts is an integer matrix
class(sc_example_counts)
## [1] "matrix"
typeof(sc_example_counts)
## [1] "integer"
# Check the dimensions, each row is a gene, each column is a cell
dim(sc_example_counts)
## [1] 2000   40
# Show the first few entries
sc_example_counts[1:5, 1:5]
##           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005
## Gene_0001        0      123        2        0        0
## Gene_0002      575       65        3     1561     2311
## Gene_0003        0        0        0        0     1213
## Gene_0004        0        1        0        0        0
## Gene_0005        0        0       11        0        0
params <- splatEstimate(sc_example_counts)

Here we estimated parameters from a counts matrix but splatEstimate can also take a SingleCellExperiment object. The estimation process has the following steps:

  1. Mean parameters are estimated by fitting a gamma distribution to the mean expression levels.
  2. Library size parameters are estimated by fitting a log-normal distribution to the library sizes.
  3. Expression outlier parameters are estimated by determining the number of outliers and fitting a log-normal distribution to their difference from the median.
  4. BCV parameters are estimated using the estimateDisp function from the edgeR package.
  5. Dropout parameters are estimated by checking if dropout is present and fitting a logistic function to the relationship between mean expression and proportion of zeros.

For more details of the estimation procedures see ?splatEstimate.

6 Simulating counts

Once we have a set of parameters we are happy with we can use splatSimulate to simulate counts. If we want to make small adjustments to the parameters we can provide them as additional arguments, alternatively if we don’t supply any parameters the defaults will be used:

sim <- splatSimulate(params, nGenes = 1000)
## Getting parameters...
## Creating simulation object...
## Simulating library sizes...
## Simulating gene means...
## Simulating BCV...
## Simulating counts...
## Simulating dropout (if needed)...
## Done!
sim
## class: SingleCellExperiment 
## dim: 1000 40 
## metadata(1): Params
## assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
## rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
## rowData names(4): Gene BaseGeneMean OutlierFactor GeneMean
## colnames(40): Cell1 Cell2 ... Cell39 Cell40
## colData names(3): Cell Batch ExpLibSize
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

Looking at the output of splatSimulate we can see that sim is SingleCellExperiment object with 1000 features (genes) and 40 samples (cells). The main part of this object is a features by samples matrix containing the simulated counts (accessed using counts), although it can also hold other expression measures such as FPKM or TPM. Additionally a SingleCellExperiment contains phenotype information about each cell (accessed using colData) and feature information about each gene (accessed using rowData). Splatter uses these slots, as well as assays, to store information about the intermediate values of the simulation.

# Access the counts
counts(sim)[1:5, 1:5]
##       Cell1 Cell2 Cell3 Cell4 Cell5
## Gene1     0     0     0     0     0
## Gene2  7391     0    98     3   437
## Gene3     0   283    21     0     0
## Gene4     0     0     2     0     0
## Gene5     1     0     0     0     0
# Information about genes
head(rowData(sim))
## DataFrame with 6 rows and 4 columns
##           Gene      BaseGeneMean OutlierFactor          GeneMean
##       <factor>         <numeric>     <numeric>         <numeric>
## Gene1    Gene1 0.144399326768841             1 0.144399326768841
## Gene2    Gene2  479.829867069926             1  479.829867069926
## Gene3    Gene3  218.164519873479             1  218.164519873479
## Gene4    Gene4  3.18934736017947             1  3.18934736017947
## Gene5    Gene5  26.5049670122472             1  26.5049670122472
## Gene6    Gene6  152.121261695714             1  152.121261695714
# Information about cells
head(colData(sim))
## DataFrame with 6 rows and 3 columns
##           Cell       Batch       ExpLibSize
##       <factor> <character>        <numeric>
## Cell1    Cell1      Batch1 509604.619698246
## Cell2    Cell2      Batch1 476526.862535086
## Cell3    Cell3      Batch1 174314.794343492
## Cell4    Cell4      Batch1 475552.976691851
## Cell5    Cell5      Batch1 202039.330544699
## Cell6    Cell6      Batch1 272163.090583856
# Gene by cell matrices
names(assays(sim))
## [1] "BatchCellMeans" "BaseCellMeans"  "BCV"            "CellMeans"     
## [5] "TrueCounts"     "counts"
# Example of cell means matrix
assays(sim)$CellMeans[1:5, 1:5]
##              Cell1        Cell2        Cell3        Cell4        Cell5
## Gene1 4.881823e-08 3.053771e-15 3.209655e-25 2.298869e-10 3.309099e-01
## Gene2 7.281759e+03 9.494515e-10 9.398240e+01 7.642241e-01 4.639846e+02
## Gene3 6.056596e-05 2.810254e+02 1.834295e+01 7.059391e-02 1.398545e-04
## Gene4 1.261653e-08 2.204427e-02 1.801516e+00 2.761605e-07 5.706076e-09
## Gene5 1.494638e+00 5.804280e-04 2.240843e-04 4.678670e-42 8.762725e-08

An additional (big) advantage of outputting a SingleCellExperiment is that we get immediate access to other analysis packages, such as the plotting functions in scater. For example we can make a PCA plot:

# Use scater to calculate logcounts
sim <- normalize(sim)
## Warning in .local(object, ...): using library sizes as size factors
## Warning: 'centreSizeFactors' is deprecated.
## See help("Deprecated")
# Plot PCA
plotPCA(sim)
## Warning: call 'runPCA' explicitly to compute results

(NOTE: Your values and plots may look different as the simulation is random and produces different results each time it is run.)

For more details about the SingleCellExperiment object refer to the [vignette] SCE-vignette. For information about what you can do with scater refer to the scater documentation and vignette.

The splatSimulate function outputs the following additional information about the simulation:

Values that have been added by Splatter are named using UpperCamelCase to separate them from the underscore_naming used by scater and other packages. For more information on the simulation see ?splatSimulate.

6.1 Simulating groups

So far we have only simulated a single population of cells but often we are interested in investigating a mixed population of cells and looking to see what cell types are present or what differences there are between them. Splatter is able to simulate these situations by changing the method argument Here we are going to simulate two groups, by specifying the group.prob parameter and setting the method parameter to "groups":

(NOTE: We have also set the verbose argument to FALSE to stop Splatter printing progress messages.)

sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
                            verbose = FALSE)
sim.groups <- normalize(sim.groups)
## Warning in .local(object, ...): using library sizes as size factors
## Warning: 'centreSizeFactors' is deprecated.
## See help("Deprecated")
plotPCA(sim.groups, colour_by = "Group")
## Warning: call 'runPCA' explicitly to compute results