1 Installation and loading dependencies

To install this package, start R and enter (un-commented):

# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("CytoMDS")

We now load the packages needed in the current vignette:


2 Introduction

The CytoMDS package implements low dimensional visualization of cytometry samples, in order to visually assess distances between them. This, in turn, can greatly help the user to identify quality issues like batch effects or outlier samples, and/or check the presence of potential sample clusters that might align with the experimental design.

The CytoMDS algorithm combines, on the one hand, the concept of Earth Mover’s Distance (EMD) (Orlova et al. 2016), a.k.a. Wasserstein metric and, on the other hand, the metric Multi Dimensional Scaling (MDS) algorithm for the low dimensional projection (Leeuw and Mair 2009).

Besides projection itself, the package also provides some diagnostic tools for both checking the quality of the MDS projection, as well as interpreting the axes of the projection (see below sections).

3 Illustrative dataset

The illustrative dataset that is used throughout this vignette is a mass cytometry (CyTOF) dataset from (Bodenmiller et al. 2012), and provided in the Bioconductor HDCytoData data package (Weber and Soneson (2019)).

This dataset consists of 16 paired samples (8 times 2) of peripheral blood cells from healthy individuals. Among each sample pair, one sample - the reference - was left un-stimulated, while the other sample was stimulated with B cell receptor / Fc receptor cross-linker (BCR-XL).

This public dataset is known to contain a strong differential expression signal between the two conditions (stimulated vs un-stimulated) and as been used in recent work to benchmark differential analysis algorithms ((Weber et al. 2019)) and to design mass cytometry data analysis pipelines ((Nowicka et al. 2017)).

In the CytoMDSpackage, as in the current vignette, matrices of cytometry events intensities, corresponding to one sample, are stored as flowCore::flowFrame (Ellis et al. 2023) objects. Samples of a particular cytometry dataset are then stored as a flowCore::flowSet object, which is a collection of flowFrame’s, i.e. one flowFrame per sample. Therefore, we load the flowSet version of the BodenMiller2012 dataset, obtained from the HDCytoData package.

BCRXL_fs <- HDCytoData::Bodenmiller_BCR_XL_flowSet()
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## loading from cache
## Warning in updateObjectFromSlots(object, ..., verbose = verbose): dropping
## slot(s) 'colnames' from object = 'flowSet'
## A flowSet with 16 experiments.
## column names(39): Time Cell_length ... sample_id population_id

In regular flowSet’s, the experimental design information is typically
stored in the phenoData slot, and this is also the way CytoMDS expects to get its input. However, HDCytoData has chosen to store the experimental design information in a slightly different way, hence the need to convert the data as follows:

phenoData <- flowCore::pData(BCRXL_fs)
additionalPhenoData <- 
phenoData <- cbind(phenoData, additionalPhenoData)

flowCore::pData(BCRXL_fs) <- phenoData

We also select channels/markers that are biologically relevant, i.e. both the cell type and cell state markers, and store them for further use. We discard the typical housekeeping markers that are founds in flowFrames like time and Cell_length, etc. In total, these mass cytometry samples contain intensities for 24 biologically relevant markers.

markerInfo <- keyword(BCRXL_fs[[1]], "MARKER_INFO")$MARKER_INFO
chClass <- markerInfo$marker_class

## chClass
##  none  type state 
##    11    10    14
chLabels <- markerInfo$channel_name[chClass != "none"]
(chMarkers <- markerInfo$marker_name[chClass != "none"])
##  [1] "CD3"    "CD45"   "pNFkB"  "pp38"   "CD4"    "CD20"   "CD33"   "pStat5"
##  [9] "CD123"  "pAkt"   "pStat1" "pSHP2"  "pZap70" "pStat3" "CD14"   "pSlp76"
## [17] "pBtk"   "pPlcg2" "pErk"   "pLat"   "IgM"    "pS6"    "HLA-DR" "CD7"

The first step consists in scale transforming the raw data. Indeed, distances between samples make more sense with scaled transformed signal, in which distributional differences are more readable and usable for downstream analysis.

Here, since we are dealing with mass cytometry samples, we use the classical arcsinh() transformation with 5 as co-factor, as described elsewhere ((Nowicka et al. 2017)).

trans <- arcsinhTransform(
    a = 0, 
    b = 1/5, 
    c = 0)

BCRXL_fs_trans <- transform(
    transformList(chLabels, trans))

4 Calculating distances between samples

We can now calculate pairwise Earth Mover’s Distances (EMD) between all samples of our dataset.

This is done by calling the pairwiseEMDDist() method The simplest way to
use this method is by providing directly a flowCore::flowSet, containing all samples, as input parameter. Note that, for heavy datasets that include a lot of samples, this can create memory issues. To handle this case, CytoMDS provides other ways to call the pairwiseEMDDist() function (see ‘Handling heavy datasets’ section).

Using the channels argument, it is possible to restrict the EMD calculation to some of the channels. Here we simply pass as input the biologically relevant markers selected in the previous section.

pwDist <- pairwiseEMDDist(
    channels = chMarkers,
    verbose = FALSE

The calculated distance is a symmetric square matrix, with as many rows (columns) as input samples (extract shown here below for the scale-transformed Bodenmiller2012 dataset).

round(pwDist[1:10, 1:10], 2)
##        1     2     3     4     5     6     7     8     9    10
## 1   0.00 10.46  4.30 11.18  6.39 12.69  7.11 11.85  5.88 10.13
## 2  10.46  0.00 10.91  3.16 11.45  6.06 13.17 10.61  9.61  6.08
## 3   4.30 10.91  0.00 10.72  7.44 13.17  7.47 12.84  5.84 10.00
## 4  11.18  3.16 10.72  0.00 12.10  5.97 12.66  9.63 10.35  6.72
## 5   6.39 11.45  7.44 12.10  0.00  9.19  4.45 10.19  5.28  9.96
## 6  12.69  6.06 13.17  5.97  9.19  0.00 10.56  7.23 11.61  7.74
## 7   7.11 13.17  7.47 12.66  4.45 10.56  0.00  7.24  8.89 11.87
## 8  11.85 10.61 12.84  9.63 10.19  7.23  7.24  0.00 14.50 11.68
## 9   5.88  9.61  5.84 10.35  5.28 11.61  8.89 14.50  0.00  7.10
## 10 10.13  6.08 10.00  6.72  9.96  7.74 11.87 11.68  7.10  0.00

One relevant way to visualize this distance matrix is to draw the histogram of pairwise distances, as shown in the below plot.

distVec <- pwDist[upper.tri(pwDist)]
distVecDF <- data.frame(dist = distVec)
pHist <- ggplot(distVecDF, mapping = aes(x=dist)) + 
    geom_histogram(fill = "darkgrey", col = "black", bins = 15) + 
    theme_bw() + ggtitle("EMD distances for Bodenmiller2012 dataset")

5 Metric Multidimensional scaling

5.1 Calculating the MDS projection

Once the pairwise distance matrix has been calculated, computing the Multi Dimensional Scaling (MDS) projection is done by calling the computeMetricMDS() function. In its simplest form, only the distance matrix needs to be passed to the function. In that case, the number of dimensions to use in the MDS is automatically set in order to reach a specific value for a projection quality indicator, i.e. the target pseudo R square, which in turn is set by default to 0.95 (see Quality of projection - diagnostic tools section).

Note that the Smacof algorithm (Leeuw and Mair 2009), used to compute the MDS projection, is stochastic, so it is sensitive to the ‘seed’ used. Therefore, in cases where reproducible results from one run to another is required, it is advised to set the seed argument to a fixed value.

The returned value of the computeMetricMDS() function is an object of the MDS class. This object can be queried to get e.g. the number of dimensions that was effectively used, or the obtained pseudo RSquare, as shown in the following code chunk:

mdsObj <- computeMetricMDS(pwDist, seed = 0)
## MDS object containing MDS projection (using Smacof algorithm)  data:
## Nb of dimensions:  4 
## Nb of points:  16 
## Stress:  0.040668 
## Pseudo RSquare:  0.981489 
## Goodness of fit:  0.998346

5.2 Plotting the MDS projection

Plotting the obtained MDS projection is done using ggplotSampleMDS(). If no phenoData parameter is passed, then, by default, numbers are used as labels, and samples are represented as black dots.


However, by providing a ‘phenoData’ dataframe to the ggplotSampleMDS() function, the corresponding variables can be used for highlighting sample points with different colours and/or shapes. Here below, the previous plot is enhanced with red and blue colours, distinguishing samples from the two conditions. Also, we have added more meaningful labels to each data point, corresponding to patient id’s.

On this plot, one can clearly see a clear separation between samples of the two different conditions. These 2 sample groups are different along the x axis, which corresponds to the first projection direction, explaining 46.69 % of the variability contained in the MDS projection with 4 dimensions (as indicated in the subtitle). This clear separation between 2 condition clusters highlights the strong biological signal differentiating these two groups of samples. Further in this vignette, we shall try to assign a user interpretation to this x axis direction (see ‘Aid to interpreting projection axes’ section).

p12 <- ggplotSampleMDS(
    pData = phenoData,
    projectionAxes = c(1,2),
    pDataForColour = "group_id",
    pDataForLabel = "patient_id"