This lab demonstrates how to use the iSEE package to create and configure interactive applications for the exploration of various types of genomics data sets (e.g., bulk and single-cell RNA-seq, CyTOF, gene expression microarray).
We use example data from the TENxPBMCData package. This package provides an R / Bioconductor resource for representing and manipulating different single-cell RNA-seq data sets profiling peripheral blood mononuclear cells (PBMC) generated by 10X Genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets).
The man page for the TENxPBMCData()
function gives an idea of the datasets that are available from this package. It can be opened with the following command.
Here, we use the "pbmc3k"
dataset, which contains gene expression profiles for 2,700 single peripheral blood mononuclear cells. The first time this dataset is loaded, this command downloads the dataset to a local cache, which takes some time, depending on the speed of your internet connection. Subsequent times, the same command loads the dataset directly from the local cache.
At this point we can inspect the dataset in the console.
sce
#> class: SingleCellExperiment
#> dim: 32738 2700
#> metadata(0):
#> assays(1): counts
#> rownames(32738): ENSG00000243485 ENSG00000237613 ...
#> ENSG00000215616 ENSG00000215611
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(0):
#> spikeNames(0):
The dataset is provided as an object of the SingleCellExperiment
class. In particular, this summary view indicates that the following pieces of information are available:
"counts"
NULL
"Barcode"
) and the donor identifier ("Individual"
).Note that a SingleCellExperiment
object (or, more generally, any SummarizedExperiment
object) like this one already contains sufficient information to launch an interactive application instance to visualize the available data and metadata, using the iSEE()
function.
For the purpose of this workshop, we first apply some preprocessing to the SingleCellExperiment
object, in order to populate it with more information that can be visualized with iSEE
. The steps below roughly follow those outlined in the simpleSingleCell Bioconductor workflow.
We start by adding column names to the object, and use gene symbols instead of Ensembl IDs as row names. In the case where multiple Ensembl identifiers correspond to the same gene symbol, the scater::uniquifyFeatureNames
function concatenates the Ensembl ID and the gene symbol in order to generate unique feature names.
library(scater)
colnames(sce) <- paste0("Cell", seq_len(ncol(sce)))
rownames(sce) <- scater::uniquifyFeatureNames(
ID = rowData(sce)$ENSEMBL_ID,
names = rowData(sce)$Symbol_TENx
)
head(rownames(sce))
#> [1] "MIR1302-10" "FAM138A" "OR4F5" "RP11-34P13.7"
#> [5] "RP11-34P13.8" "AL627309.1"
Next, we use the scater package to calculate gene- and cell-level quality metrics. These metrics are added as columns to the rowData
and colData
slots of the SingleCellExperiment
object, respectively.
MT <- rownames(sce)[grep("^MT-", rownames(sce))]
sce <- scater::calculateQCMetrics(object = sce,
feature_controls = list(MT = MT))
sce
#> class: SingleCellExperiment
#> dim: 32738 2700
#> metadata(0):
#> assays(1): counts
#> rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
#> rowData names(11): ENSEMBL_ID Symbol_TENx ... total_counts
#> log10_total_counts
#> colnames(2700): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(47): Sample Barcode ...
#> pct_counts_in_top_200_features_MT
#> pct_counts_in_top_500_features_MT
#> reducedDimNames(0):
#> spikeNames(0):
We filter out a few cells with a large fraction of the counts coming from mitochondrial genes, since these may be damaged cells. Notice the reduced number of columns in the dataset below.
(sce <- sce[, sce$pct_counts_MT < 5])
#> class: SingleCellExperiment
#> dim: 32738 2643
#> metadata(0):
#> assays(1): counts
#> rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
#> rowData names(11): ENSEMBL_ID Symbol_TENx ... total_counts
#> log10_total_counts
#> colnames(2643): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(47): Sample Barcode ...
#> pct_counts_in_top_200_features_MT
#> pct_counts_in_top_500_features_MT
#> reducedDimNames(0):
#> spikeNames(0):
Next, we calculate size factors and normalized and log-transformed expression values, using the scran and scater packages. Note that it is typically recommended to pre-cluster the cells before computing the size factors, as follows:
# set.seed(1000)
# clusters <- scran::quickCluster(sce, BSPARAM = IrlbaParam())
# sce <- scran::computeSumFactors(sce, cluster = clusters, min.mean = 0.1)
However, for time reasons, we will skip the pre-clustering step in this workshop.
library(scran)
sce <- scran::computeSumFactors(sce, min.mean = 0.1)
summary(sizeFactors(sce))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.1164 0.7309 0.9004 1.0000 1.1635 9.8580
sce <- scater::normalize(sce)
In order to extract the most informative genes, we first model the mean-variance trend and decompose the variance into biological and technical components.
logcounts(sce) <- as.matrix(logcounts(sce))
new.trend <- scran::makeTechTrend(x = sce)
fit <- scran::trendVar(sce, use.spikes = FALSE, loess.args = list(span = 0.05))
fit$trend <- new.trend
dec <- scran::decomposeVar(fit = fit)
top.dec <- dec[order(dec$bio, decreasing = TRUE), ]
head(top.dec)
#> DataFrame with 6 rows and 6 columns
#> mean total bio
#> <numeric> <numeric> <numeric>
#> LYZ 1.63949764898989 3.92208536906663 3.27404036212034
#> S100A9 1.07322748956473 3.36899492417173 2.73849817385183
#> FTL 3.67845532888203 2.88498930976467 2.662250196795
#> HLA-DRA 1.56936865352987 3.16577098114059 2.51202239465566
#> CD74 2.23652914180461 2.75839122768224 2.21465618657833
#> FTH1 3.52182996359555 2.20828385314013 1.95849254164614
#> tech p.value FDR
#> <numeric> <numeric> <numeric>
#> LYZ 0.648045006946287 0 0
#> S100A9 0.630496750319901 0 0
#> FTL 0.22273911296967 0 0
#> HLA-DRA 0.653748586484937 0 0
#> CD74 0.543735041103904 0 0
#> FTH1 0.249791311493993 0 0
Next, we apply Principal Components Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) to generate low-dimensional representations of the cells in our data set. These low-dimensional representations are added to the reducedDim
slot of the SingleCellExperiment
object.
library(BiocSingular)
set.seed(1000)
sce <- scran::denoisePCA(sce, technical = new.trend, BSPARAM = IrlbaParam())
ncol(reducedDim(sce, "PCA"))
#> [1] 32
set.seed(1000)
sce <- scater::runTSNE(sce, use_dimred = "PCA", perplexity = 30)
sce
#> class: SingleCellExperiment
#> dim: 32738 2643
#> metadata(1): log.exprs.offset
#> assays(2): counts logcounts
#> rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
#> rowData names(11): ENSEMBL_ID Symbol_TENx ... total_counts
#> log10_total_counts
#> colnames(2643): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(47): Sample Barcode ...
#> pct_counts_in_top_200_features_MT
#> pct_counts_in_top_500_features_MT
#> reducedDimNames(2): PCA TSNE
#> spikeNames(0):
Finally, we cluster the cells using a graph-based algorithm, and find ‘marker genes’ for each cluster as the genes that are significantly upregulated in the cluster compared to each of the other inferred clusters. The adjusted p-values from this test, for each cluster, are added to the rowData
slot of the object.
snn.gr <- scran::buildSNNGraph(sce, use.dimred = "PCA")
clusters <- igraph::cluster_walktrap(snn.gr)
sce$Cluster <- factor(clusters$membership)
table(sce$Cluster)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12
#> 197 692 161 31 150 331 162 339 137 422 10 11
markers <- scran::findMarkers(sce, clusters = sce$Cluster,
direction = "up", pval.type = "all")
for (i in names(markers)) {
rowData(sce)[, paste0("FDR_cluster", i)] <-
markers[[i]]$FDR[match(rownames(sce),
rownames(markers[[i]]))]
}
sce
#> class: SingleCellExperiment
#> dim: 32738 2643
#> metadata(1): log.exprs.offset
#> assays(2): counts logcounts
#> rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
#> rowData names(23): ENSEMBL_ID Symbol_TENx ... FDR_cluster11
#> FDR_cluster12
#> colnames(2643): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(48): Sample Barcode ...
#> pct_counts_in_top_500_features_MT Cluster
#> reducedDimNames(2): PCA TSNE
#> spikeNames(0):
This concludes the preparation of the data. We have now a SingleCellExperiment
object that contains different types of abundance values, representations in reduced dimensions, as well as a range of row (feature) and column (cell) metadata.
We can launch an iSEE instance for exploring this data set using the iSEE()
function:
The main argument to the iSEE()
function is a SummarizedExperiment
object, or an object of any class extending SummarizedExperiment
(such as SingleCellExperiment
, in this case). No other restrictions are made on the type of data stored in the object, and iSEE can be used for interactive visualization of many different types of data. It is also worth noting that for various types of data, Bioconductor packages provides functionality for directly importing quantifications generated by external software packages into a SummarizedExperiment
object. For example, the DropletUtils package can read quantifications from the 10x Genomics CellRanger pipeline for single-cell RNA-seq data, and the tximeta package can be used to read data from transcript quantification pipelines into a SummarizedExperiment
object.
This section provides an overview of the graphical interface of iSEE applications. To follow along, make sure that you have launched an iSEE instance as described in the code block at the end of the previous section. Note that in the default configuration, the panels do not look exactly like the ones shown in the screenshots below. For example, data points are not immediately colored, and the default annotation variables displayed by each panel may differ. Later sections of this workshop demonstrate how to modify the content of the panels and how they are displayed.
Note that for simplicity, we typically refer to a SummarizedExperiment
in this workshop; however, iSEE works seamlessly for objects of any class extending SummarizedExperiment
as well (e.g., SingleCellExperiment
, DESeqDataSet
). That said, some types of panels – such as the Reduced dimension plot
– are only available for objects that contain a reducedDim
slot (in particular, SingleCellExperiment
objects); the basic SummarizedExperiment
class does not contain this slot. In this workshop, we refer to the rows of the SummarizedExperiment
object as ‘features’ (these can be genes, transcripts, genomic regions, etc) and to the columns as ‘samples’ (which, in our example data set, are single cells).
The iSEE user interface consists of a number of panels, each displaying the data provided in the SummarizedExperiment
from a specific perspective. There are 8 standard panel types; 6 plot panels and 2 table panels, all showcased in the figure below. In the default configuration, one panel of each type is included when launching the iSEE user interface. However, users are free to rearrange, resize, remove or add panels freely, as described in a later section. We provide a brief overview of each panel type in the following subsections.
In addition to the 8 standard panel types, user can create custom panels (both plots and tables). The creation and configuration of custom panels is discussed later, in a dedicated section of this workshop.
The reduced dimension plot can display any reduced dimension representation that is present in the reducedDim
slot of the SingleCellExperiment
object.
Note that this slot is not defined for the base SummarizedExperiment
class, in which case the user interface does not allow the inclusion of panels of this type.