Contents

0.0.1 Abstract

Genome-wide data is used to stratify patients into classes using class discovery algorithms. However, we have observed systematic bias present in current state-of-the-art methods. This arises from not considering reference distributions while selecting the number of classes (K). As a solution, we developed a consensus clustering-based algorithm with a hypothesis testing framework called Monte Carlo consensus clustering (M3C). M3C uses a multi-core enabled Monte Carlo simulation to generate null distributions along the range of K which are used to calculate p values to select its value. P values beyond the limits of the simulation are estimated using a beta distribution. M3C can quantify structural relationships between clusters and uses spectral clustering to deal with non-gaussian and imbalanced structures.

0.0.2 Prerequisites

M3C recommended spec:

A relatively new and fast multi-core computer or cluster.

M3C requires:

A matrix or data frame of normalised expression data (e.g. microarray, RNA-seq, but also epigenetic or protein data) where columns equal samples and rows equal features. For RNA-seq data, VST or rlog transformed count data, log(CPM), log(TPM), and log(RPKM), are all acceptable forms of normalisation.

The data should be filtered to remove features with no or very low signal, and filtered using variance to reduce dimensionality (unsupervised), or p value from a statistical test (supervised).

M3C also accepts optionally:

Annotation data frame, where every row is a patient/sample and columns refer to meta-data, e.g. age, sex, etc. M3C will automatically rearrange your annotation to match the clustering output and add the consensus cluster grouping to it. This is done to speed up subsequent analyses. Note, this only works if the IDs (column names in data) match a column called “ID” in the annotation data frame.

0.0.3 Example workflow

The M3C package contains the GBM cancer microarray dataset for testing. There is an accepted cluster solution of 4. First we load the package which also loads the GBM data.

library(M3C)
library(NMF) # loading for aheatmap plotting function
library(gplots) # loading this for nice colour scale
library(ggsci) # more cool colours

# now we have loaded the mydata and desx objects (with the package automatically)
# mydata is the expression data for GBM
# desx is the annotation for this data

0.0.4 Exploratory data analysis

This is an important checking step prior to running M3C. Not just outliers are important to check for, also important are the underlying assumptions of PAM or K means. Spectral clustering has more flexible assumptions (it is considerably slower, however), all three methods require removal of extreme outliers. It is sensible to visually inspect a PCA (of samples) plot prior to clustering to verify;

  1. Clusters are approximately gaussian (normally distributed) - not severely elongated or non-linear
  2. That one cluster is not clearly a lot smaller than another cluster
  3. That extreme outliers have been removed

You can read more about the assumptions of K means here: (http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html)

In the PCA plot below we can see there are no obvious non linear structures, evidence of cluster imbalance, or extreme outliers, therefore we can proceed with running M3C on this dataset with PAM as the inner algorithm. If points i and ii were clearly not the case, spectral clustering should be used instead, but this normally is not a problem.

PCA1 <- pca(mydata)