Contents

# For analysis of scRNAseq data
library(aggregateBioVar)
library(SummarizedExperiment, quietly = TRUE)
library(SingleCellExperiment, quietly = TRUE)
library(DESeq2, quietly = TRUE)

# For data transformation and visualization
library(magrittr, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(ggtext, quietly = TRUE)

1 Introduction

Single cell RNA sequencing (scRNA-seq) studies allow gene expression quantification at the level of individual cells, and these studies introduce multiple layers of biological complexity. These include variations in gene expression between cell states within a sample (e.g., T cells versus macrophages), between samples within a population (e.g., biological or technical replicates), and between populations (e.g., healthy versus diseased individuals). Because many early scRNA-seq studies involved analysis of only a single sample, many bioinformatics tools operate on the first layer, comparing gene expression between cells within a sample. This software is aimed at organizing scRNA-seq data to permit analysis in the latter two layers, comparing gene expression between samples and between populations. An example is given with an implementation of differential gene expression analysis between populations. From scRNA-seq data stored as a SingleCellExperiment (Lun and Risso, 2020) object with pre-defined cell states, aggregateBioVar() stratifies data as a list of SummarizedExperiment (Morgan et al., 2020) objects, a standard Bioconductor data structure for downstream analysis of RNA-seq data.

2 Case Study: Small Airway Epithelium in Cystic Fibrosis

To illustrate the utility of biological replication for scRNA-seq sequencing experiments, consider a set of single cell data from porcine small airway epithelium. In this study, small airway (< 2 mm) tissue samples were collected from newborn pigs (Sus scrofa) to investigate gene expression patterns and cellular composition in a cystic fibrosis phenotype. Single cell sequencing samples were prepared using a 10X Genomics Chromium controller and sequenced on an Illumina HiSeq4000. Data obtained from seven individuals include both non-CF (CFTR+/+; genotype WT; n=4) and CFTR-knockout subjects expressing a cystic fibrosis phenotype (CFTR-/-; genotype CFTRKO; n=3). Cell types were determined following a standard scRNA-seq pipeline using Seurat (Stuart et al., 2019), including cell count normalization, scaling, determination of highly variable genes, dimension reduction via principal components analysis, and shared nearest neighbor clustering. Both unsupervised marker detection (via Seurat::FindMarkers()) and a list of known marker genes were used to annotate cell types. The full data set has been uploaded to the Gene Expression Omnibus as accession number GSE150211.

2.1 The small_airway Dataset

A subset of 1339 genes from 2722 cells assigned as secretory, endothelial, and immune cell types are available in the small_airway data set. The data are formatted as a SingleCellExperiment class S4 object (Amezquita et al., 2020), an extension of the RangedSummarizedExperiment class from the SummarizedExperiment package.

small_airway
#> class: SingleCellExperiment 
#> dim: 1339 2722 
#> metadata(0):
#> assays(1): counts
#> rownames(1339): MPC1 PRKN ... OTOP1 UNC80
#> rowData names(0):
#> colnames(2722): SWT1_AAAGAACAGACATAAC SWT1_AAAGGATTCTCCGAAA ...
#>   SCF3_TTTGGAGGTAAGGCTG SCF3_TTTGGTTCAATAGTAG
#> colData names(6): orig.ident nCount_RNA ... Region celltype
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

The primary data in SingleCellExperiment objects are stored in the assays slot. Here, a single assay counts contains gene counts from the single cell sequencing data. Each column of the assay count matrix represents a cell and each row a feature (e.g., gene). Assay slot data can be obtained by SummarizedExperiment::assay(), indicating theSingleCellExperiment object and name of the assay slot (here, "counts"). In the special case of the assay being named "counts", the data can be accessed with SingleCellExperiment::counts().

assays(small_airway)
#> List of length 1
#> names(1): counts

# Dimensions of gene-by-cell count matrix
dim(counts(small_airway))
#> [1] 1339 2722

# Access dgCMatrix with gene counts
counts(small_airway)[1:5, 1:30]
#> 5 x 30 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 30 column names 'SWT1_AAAGAACAGACATAAC', 'SWT1_AAAGGATTCTCCGAAA', 'SWT1_AAAGGTATCCACGTAA' ... ]]
#>                                                                       
#> MPC1    . 1 2 8 15 4 . 1 7 3 . 3 1 4 3 1 7 8 4 2 2 1  3 2 4 . 3 2 3  2
#> PRKN    . . . .  . . . . . . . . . . . . . . . . . .  . . . . . . .  .
#> SLC22A3 . . . .  . . . . . . . . . . . . . . . . . .  . . . . . . .  .
#> CNKSR3  . . 1 .  1 . . . . . . . . . . . 1 . 1 . . .  2 . . . . . .  3
#> UTRN    2 2 . 4  6 1 . . 1 . 3 6 1 4 1 . 9 2 1 8 1 1 12 5 4 1 2 5 1 14

2.2 Cell Metadata

SingleCellExperiment objects may also include column metadata with additional information annotating individual cells. Here, the metadata variable orig.ident identifies the biological sample of the cell while celltype indicates the assigned cellular identity. Of the 7 individual subjects, 3 are CF (SCF1, SCF2, SCF3) and 4 are non-CF (SWT1, SWT2, SWT3, SWT4). Genotype indicates the sample genotype, one of WT or CFTRKO for non-CF and CF subjects, respectively. Column metadata from SingleCellExperiment objects can be accessed with the $ operator, where the length of a metadata column variable is equal to the number of columns (i.e., cells) in the feature count matrix from the assays slot.

# Subject values
table(small_airway$orig.ident)
#> 
#> SCF1 SCF2 SCF3 SWT1 SWT2 SWT3 SWT4 
#>  450  283  758  324  294  258  355

# Cell type values
table(small_airway$celltype)
#> 
#> Endothelial cell      Immune cell   Secretory cell 
#>              915              585             1222

# Subject genotype
table(small_airway$Genotype)
#> 
#> CFTRKO     WT 
#>   1491   1231

The experiment metadata are included as an S4 DataFrame object. To access the full column metadata from SingleCellExperiment objects, use colData() from the SummarizedExperiment package. Here, metadata include 6 variables for each of 2722 cells. In addition to the biological sample identifier, cell type, and genotype, metadata include total unique molecular identifiers (UMIs) and number of detected features.

colData(small_airway)
#> DataFrame with 2722 rows and 6 columns
#>                        orig.ident nCount_RNA nFeature_RNA    Genotype
#>                       <character>  <numeric>    <integer> <character>
#> SWT1_AAAGAACAGACATAAC        SWT1        594          169          WT
#> SWT1_AAAGGATTCTCCGAAA        SWT1       1133          301          WT
#> SWT1_AAAGGTATCCACGTAA        SWT1       1204          281          WT
#> SWT1_AACACACCAATCCTTT        SWT1       1998          365          WT
#> SWT1_AACCACAGTTACTCAG        SWT1       3268          427          WT
#> ...                           ...        ...          ...         ...
#> SCF3_TTTGGAGAGCAGTAAT        SCF3       1953          301      CFTRKO
#> SCF3_TTTGGAGAGCGTCTGC        SCF3        215           95      CFTRKO
#> SCF3_TTTGGAGCATATACCG        SCF3       1154          234      CFTRKO
#> SCF3_TTTGGAGGTAAGGCTG        SCF3        703          187      CFTRKO
#> SCF3_TTTGGTTCAATAGTAG        SCF3       2319          360      CFTRKO
#>                            Region       celltype
#>                       <character>       <factor>
#> SWT1_AAAGAACAGACATAAC       Small Immune cell   
#> SWT1_AAAGGATTCTCCGAAA       Small Secretory cell
#> SWT1_AAAGGTATCCACGTAA       Small Secretory cell
#> SWT1_AACACACCAATCCTTT       Small Secretory cell
#> SWT1_AACCACAGTTACTCAG       Small Secretory cell
#> ...                           ...            ...
#> SCF3_TTTGGAGAGCAGTAAT       Small Secretory cell
#> SCF3_TTTGGAGAGCGTCTGC       Small Secretory cell
#> SCF3_TTTGGAGCATATACCG       Small Secretory cell
#> SCF3_TTTGGAGGTAAGGCTG       Small Secretory cell
#> SCF3_TTTGGTTCAATAGTAG       Small Secretory cell

2.3 Aggregating Gene Counts

The main functionality of this package involves two generalizable operations:

  1. Aggregate within-subject gene counts
  2. Summarize metadata to retain inter-subject variation

A wrapper function aggregateBioVar() abstracts away these operations and applies them on a by-cell type basis. For input, a SingleCellExperiment object containing gene counts should contain metadata variables for the subject by which to aggregate cells (e.g., biological sample) and the assigned cell types.

2.3.1 Gene-by-subject Count Matrix

The first operation involves summing all gene counts by subject. For each gene, counts from all cells within each subject are combined. A gene-by-cell count matrix is converted into a gene-by-subject count matrix.

countsBySubject(scExp = small_airway, subjectVar = "orig.ident")
#> DataFrame with 1339 rows and 7 columns
#>              SWT1      SWT2      SWT3      SCF1      SCF2      SWT4      SCF3
#>         <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> MPC1          977      1039      1053      1100       889       712      2907
#> PRKN            2         0        13         4         3         3        18
#> SLC22A3        13         4        19        10        18         1        13
#> CNKSR3        170       172       400       245       242       249       324
#> UTRN         1322      1060      1355      1617       730       979      1338
#> ...           ...       ...       ...       ...       ...       ...       ...
#> CNNM1           0         0         0         0         0         0         0
#> GABRQ           0         0         0         0         0         0         1
#> GIF             0         0         0         0         0         0         0
#> OTOP1           0         0         0         0         0         0         0
#> UNC80           0         0         0         0         0         0         0

2.3.2 Subject Metadata

The second operation removes metadata variables with intrasubject variation. This effectively retains inter-subject metadata and eliminates variables with intrasubject (i.e., intercellular) variation (e.g., feature or gene counts by cell). This summarized metadata is used for modeling a differential expression design matrix.

subjectMetaData(scExp = small_airway, subjectVar = "orig.ident")
#> DataFrame with 7 rows and 3 columns
#>       orig.ident    Genotype      Region
#>      <character> <character> <character>
#> SWT1        SWT1          WT       Small
#> SWT2        SWT2          WT       Small
#> SWT3        SWT3          WT       Small
#> SCF1        SCF1      CFTRKO       Small
#> SCF2        SCF2      CFTRKO       Small
#> SWT4        SWT4          WT       Small
#> SCF3        SCF3      CFTRKO       Small

2.3.3 Return SummarizedExperiment

Both the gene count aggregation and metadata collation steps are combined in summarizedCounts(). This function returns a SummarizedExperiment object with the gene-by-subject count matrix in the assays slot, and the summarized inter-subject metadata as colData. Notice the column names now correspond to the subject level, replacing the cellular barcodes in the SingleCellExperiment following aggregation of gene counts from within-subject cells.

summarizedCounts(scExp = small_airway, subjectVar = "orig.ident")
#> class: SummarizedExperiment 
#> dim: 1339 7 
#> metadata(0):
#> assays(1): counts
#> rownames(1339): MPC1 PRKN ... OTOP1 UNC80
#> rowData names(0):
#> colnames(7): SWT1 SWT2 ... SWT4 SCF3
#> colData names(3): orig.ident Genotype Region

2.4 aggregateBioVar()

These operations are applied to each cell type subset with aggregateBioVar(). The full SingleCellExperiment object is subset by cell type (e.g., secretory, endothelial, and immune cell), the gene-by-subject aggregate count matrix and collated metadata are tabulated, and a SummarizedExperiment object for that cell type is constructed. A list of SummarizedExperiment objects output by summarizedCounts() is the returned to the user. The first element contains the aggregate SummarizedExperiment across all cells, and subsequent list elements correspond to the cell type indicated by the metadata variable cellVar:

aggregateBioVar(scExp = small_airway,
                subjectVar = "orig.ident", cellVar = "celltype")
#> Coercing metadata variable to character: celltype
#> $AllCells
#> class: SummarizedExperiment 
#> dim: 1339 7 
#> metadata(0):
#> assays(1): counts
#> rownames(1339): MPC1 PRKN ... OTOP1 UNC80
#> rowData names(0):
#> colnames(7): SWT1 SWT2 ... SWT4 SCF3
#> colData names(3): orig.ident Genotype Region
#> 
#> $`Immune cell`
#> class: SummarizedExperiment 
#> dim: 1339 7 
#> metadata(0):
#> assays(1): counts
#> rownames(1339): MPC1 PRKN ... OTOP1 UNC80
#> rowData names(0):
#> colnames(7): SWT1 SWT2 ... SWT4 SCF3
#> colData names(4): orig.ident Genotype Region celltype
#> 
#> $`Secretory cell`
#> class: SummarizedExperiment 
#> dim: 1339 7 
#> metadata(0):
#> assays(1): counts
#> rownames(1339): MPC1 PRKN ... OTOP1 UNC80
#> rowData names(0):
#> colnames(7): SWT1 SWT2 ... SWT4 SCF3
#> colData names(4): orig.ident Genotype Region celltype
#> 
#> $`Endothelial cell`
#> class: SummarizedExperiment 
#> dim: 1339 7 
#> metadata(0):
#> assays(1): counts
#> rownames(1339): MPC1 PRKN ... OTOP1 UNC80
#> rowData names(0):
#> colnames(7): SWT1 SWT2 ... SWT4 SCF3
#> colData names(4): orig.ident Genotype Region celltype

3 Application to Differential Gene Expression (DGE)

In this case, we want to test for differential expression between non-CF and CF pigs in the Secretory cell subset. To do so, aggregateBioVar() is run on the SingleCellExperiment object by indicated the metadata variables representing the subject-level (subjectVar) and assigned cell type (cellVar). If multiple assays are included in the input scExp object, the first assay slot is used.

# Perform aggregation of counts and metadata by subject and cell type.
aggregate_counts <-
    aggregateBioVar(
        scExp = small_airway,
        subjectVar = "orig.ident", cellVar = "celltype"
    )
#> Coercing metadata variable to character: celltype

3.1 Exploratory Data Analysis

To visualize the gene-by-subject count aggregation, consider a function to calculate log2 counts per million cells and display a heatmap of normalized expression using pheatmap (Kolde, 2019). RColorBrewer (Neuwirth, 2014) and viridis (Garnier, 2018) are used to generate discrete and continuous color scales, respectively.

#' Single-cell Counts `pheatmap`
#'
#' @param sumExp `SummarizedExperiment` or `SingleCellExperiment` object
#'   with individual cell or aggregate counts by-subject.
#' @param logSample Subset of log2 values to include for clustering.
#' @param ... Forwarding arguments to pheatmap
#' @inheritParams aggregateBioVar
#'
scPHeatmap <- function(sumExp, subjectVar, gtVar, logSample = 1:100, ...) {
    orderSumExp <- sumExp[, order(sumExp[[subjectVar]])]
    sumExpCounts <- as.matrix(
        SummarizedExperiment::assay(orderSumExp, "counts")
    )
    logcpm <- log2(
        1e6*t(t(sumExpCounts) / colSums(sumExpCounts)) + 1
    )
    annotations <- data.frame(
        Genotype = orderSumExp[[gtVar]],
        Subject = orderSumExp[[subjectVar]]
    )
    rownames(annotations) <- colnames(orderSumExp)

    singleCellpHeatmap <- pheatmap::pheatmap(
        mat = logcpm[logSample, ], annotation_col = annotations,
        cluster_cols = FALSE, show_rownames = FALSE, show_colnames = FALSE,
        scale = "none", ...
    )
    return(singleCellpHeatmap)
}

Without aggregation, bulk RNA-seq methods for differential expression analysis would be applied at the cell level (here, secretory cells; Figure 1).

# Subset `SingleCellExperiment` secretory cells.
sumExp <- small_airway[, small_airway$celltype == "Secretory cell"]

# List of annotation color specifications for pheatmap.
ann_colors <- list(
    Genotype = c(CFTRKO = "red", WT = "black"),
    Subject = c(RColorBrewer::brewer.pal(7, "Accent"))
)
ann_names <- unique(sumExp[["orig.ident"]])
names(ann_colors$Subject) <- ann_names[order(ann_names)]

# Heatmap of log2 expression across all cells.
scPHeatmap(
    sumExp = sumExp, logSample = 1:100,
    subjectVar = "orig.ident", gtVar = "Genotype",
    color = viridis::viridis(75), annotation_colors = ann_colors,
    treeheight_row = 0, treeheight_col = 0
)
Gene-by-cell Count Matrix. Heatmap of *secretory cell* gene expression with log~2~ counts per million cells. Includes 1222 cells from 7 subjects with genotypes `WT` (n=4, cells=406) and `CFTRKO` (n=3, cells=816).

Figure 1: Gene-by-cell Count Matrix
Heatmap of secretory cell gene expression with log2 counts per million cells. Includes 1222 cells from 7 subjects with genotypes WT (n=4, cells=406) and CFTRKO (n=3, cells=816).


Summation of gene counts across all cells creates a “pseudo-bulk” data set on which a subject-level test of differential expression is applied (Figure 2).

# List of `SummarizedExperiment` objects with aggregate subject counts.
scExp <-
    aggregateBioVar(
        scExp = small_airway,
        subjectVar = "orig.ident", cellVar = "celltype"
    )
#> Coercing metadata variable to character: celltype

# Heatmap of log2 expression from aggregate gene-by-subject count matrix.
scPHeatmap(
    sumExp = aggregate_counts$`Secretory cell`, logSample = 1:100,
    subjectVar = "orig.ident", gtVar = "Genotype",
    color = viridis::viridis(75), annotation_colors = ann_colors,
    treeheight_row = 0, treeheight_col = 0
)
Gene-by-subject Count Matrix. Heatmap of gene counts aggregated by subject with `aggregateBioVar()`.The `SummarizedExperiment` object with the *Secretory cells* subset contains gene counts summed by subject. Aggregate gene-by-subject counts are used as input for bulk RNA-seq tools.

Figure 2: Gene-by-subject Count Matrix
Heatmap of gene counts aggregated by subject with aggregateBioVar().The SummarizedExperiment object with the Secretory cells subset contains gene counts summed by subject. Aggregate gene-by-subject counts are used as input for bulk RNA-seq tools.

3.2 DGE with DESeq2

To run DESeq2 (Love et al., 2014), a DESeqDataSet object can be constructed using DESeqDataSetFromMatrix(). Here, the aggregate counts and subject metadata from the secretory cell subset are modeled by the variable Genotype. Differential expression analysis is performed with DESeq and a results table is extracted by results() to obtain log2 fold changes with p-values and adjusted p-values.

subj_dds_dataset <-
    DESeqDataSetFromMatrix(
        countData = assay(aggregate_counts$`Secretory cell`, "counts"),
        colData = colData(aggregate_counts$`Secretory cell`),
        design = ~ Genotype
    )
#> converting counts to integer mode
#> Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
#> design formula are characters, converting to factors

subj_dds <- DESeq(subj_dds_dataset)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing

subj_dds_results <-
    results(subj_dds, contrast = c("Genotype", "WT", "CFTRKO"))

For comparison of differential expression with and without aggregation of gene-by-subject counts, a subset of all secretory cells is used to construct a DESeqDataSet and analysis of differential expression is repeated.

cells_secretory <-
    small_airway[, which(
        as.character(small_airway$celltype) == "Secretory cell")]
cells_secretory$Genotype <- as.factor(cells_secretory$Genotype)

cell_dds_dataset <-
    DESeqDataSetFromMatrix(
        countData = assay(cells_secretory, "counts"),
        colData = colData(cells_secretory),
        design = ~ Genotype
    )
#> converting counts to integer mode

cell_dds <- DESeq(cell_dds_dataset)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> -- note: fitType='parametric', but the dispersion trend was not well captured by the
#>    function: y = a/x + b, and a local regression fit was automatically substituted.
#>    specify fitType='local' or 'mean' to avoid this message next time.
#> final dispersion estimates
#> fitting model and testing
#> -- replacing outliers and refitting for 2 genes
#> -- DESeq argument 'minReplicatesForReplace' = 7 
#> -- original counts are preserved in counts(dds)
#> estimating dispersions
#> fitting model and testing

cell_dds_results <-
    results(cell_dds, contrast = c("Genotype", "WT", "CFTRKO"))

Add a new variable with log10-transformed adjusted P-values.

subj_dds_transf <- as.data.frame(subj_dds_results) %>%
    bind_cols(feature = rownames(subj_dds_results)) %>%
    mutate(log_padj = - log(.data$padj, base = 10))

cell_dds_transf <- as.data.frame(cell_dds_results) %>%
    bind_cols(feature = rownames(cell_dds_results)) %>%
    mutate(log_padj = - log(.data$padj, base = 10))

3.3 Results

DGE is summarized by volcano plot ggplot (Wickham and Chang et al., 2020) to show cell-level (Figure 3A) and subject-level tests (Figure 3B). Aggregation of gene counts by subject reduced the number of genes with both an adjusted p-value < 0.05 and an absolute log2 fold change > 1 from 65 genes to 2 (Figure 3).

# Function to add theme for ggplots of DESeq2 results.
deseq_themes <- function() {
    list(
        theme_classic(),
        lims(x = c(-4, 5), y = c(0, 80)),
        labs(
            x = "log<sub>2</sub> (fold change)",
            y = "-log<sub>10</sub> (p<sub>adj</sub>)"
        ),
        ggplot2::theme(
            axis.title.x = ggtext::element_markdown(),
            axis.title.y = ggtext::element_markdown())
    )
}

# Build ggplots to visualize subject-level differential expression in scRNA-seq
ggplot_full <- ggplot(data = cell_dds_transf) +
    geom_point(aes(x = log2FoldChange, y = log_padj), na.rm = TRUE) +
    geom_point(
        data = filter(
            .data = cell_dds_transf,
            abs(.data$log2FoldChange) > 1, .data$padj < 0.05
        ),
        aes(x = log2FoldChange, y = log_padj), color = "red"
    ) +
    deseq_themes()

ggplot_subj <- ggplot(data = subj_dds_transf) +
    geom_point(aes(x = log2FoldChange, y = log_padj), na.rm = TRUE) +
    geom_point(
        data = filter(
            .data = subj_dds_transf,
            abs(.data$log2FoldChange) > 1, .data$padj < 0.05
        ),
        aes(x = log2FoldChange, y = log_padj), color = "red"
    ) +
    geom_label(
        data = filter(
            .data = subj_dds_transf,
            abs(.data$log2FoldChange) > 1, .data$padj < 0.05
        ),
        aes(x = log2FoldChange + 0.5, y = log_padj + 5, label = feature)
    ) +
    deseq_themes()

cowplot::plot_grid(ggplot_full, ggplot_subj, ncol = 2, labels = c("A", "B"))