0.0.1 Abstract

A central task in genomic data analyses for stratified medicine is class discovery which is accomplished through clustering. However, an unresolved problem with current clustering algorithms is they do not test the null hypothesis and derive p values. To solve this, we developed a novel hypothesis testing framework that uses consensus clustering called Monte Carlo Consensus Clustering (M3C). M3C use a multi-core enabled Monte Carlo simulation to generate a distribution of stability scores for each number of clusters using null datasets with the same gene-gene correlation structure as the real one. These distributions are used to derive p values and a beta distribution is fitted to the data to cheaply estimate p values beyond the limits of the simulation. M3C improves accuracy, allows rejection of the null hypothesis, removes systematic bias, and uses p values to make class number decisions. We believe M3C deals with a major pitfall in current automated class discovery tools.

0.0.2 Prerequisites

M3C recommended spec:

A relatively new and fast multi-core computer or cluster.

M3C requires:

A matrix or data frame of normalised genomic expression data (e.g. microarray or RNA-seq) where columns equal samples and rows equal features. For RNA-seq data, VST or rlog transformed count data, log(CPM), log(TPM), and log(RPKM), are all acceptable forms of normalisation.

The data should be filtered to remove features with no or very low signal, and filtered using variance to reduce dimensionality (unsupervised), or p value from a statistical test (supervised).

M3C also accepts optionally:

Annotation data frame, where every row is a patient/sample and columns refer to meta-data, e.g. age, sex, etc. M3C will automatically rearrange your annotation to match the clustering output and add the consensus cluster grouping to it. This is done to speed up subsequent analyses. Note, this only works if the IDs (column names in data) match a column called “ID” in the annotation data frame.

0.0.3 Example workflow

The M3C package contains the GBM cancer microarray dataset for testing. There is an accepted cluster solution of 4. First we load the package which also loads the GBM data.

library(NMF) # loading for aheatmap plotting function
library(gplots) # loading this for nice colour scale
library(ggsci) # more cool colours

# now we have loaded the mydata and desx objects (with the package automatically)
# mydata is the expression data for GBM
# desx is the annotation for this data

0.0.4 Running M3C

In our example, we run the algorithm using the default settings (100x monte carlo iterations and 100x inner replications). We have found the results generally stable using these parameters, although they may be increased or the algorithm may simply be run a few extra times to test stability. Plots from the tool and an .csv file with the numerical outputs may be printed into the working directory (by adding printres = TRUE). We will set the seed in this example, incase you wish to repeat our results exactly (seed = 123). We will add the annotation file for streamlined downstream analyses (des = desx).

We recommend saving the workspace after M3C if you are working with a large dataset because the runtimes can be quite long. M3C uses by default PAM with Euclidean distance in the consensus clustering loop because we have found this runs fast with better results. The reference method that generates the null datasets is best left to the default setting which is ‘reverse-PCA’, this maintains the correlation structure of the data using the principle component loadings. We will set the removeplots option to TRUE in this example to remove plots from the vignette, normally this is FALSE by default.

res <- M3C(mydata, cores=1, seed = 123, des = desx, removeplots = TRUE)

The scores and p values are contained within the res$scores object. So we can see the RCSI reaches a maxima at K = 4 (RSCI=0.33), the monte carlo p value supports this decision (p=0.033). This means we can reject the null hypothesis that K = 1 for this dataset because we have achieved significance versus a dataset with no clusters (but with an identical gene-gene correlation structure). For p values that can extend beyond the lower limits imposed by the monte carlo simulation we can estimate parameters from the simulation to generate a beta distribution, the BETA_P in this case study is 0.033. We may of course want to take a look at other significant or near significant clustering solutions and how they relate to our variables of investigation.

## 1  2 0.6734694 0.5773796 -0.15394262   0.66336634 0.69721054 0.1566361
## 2  3 0.4473469 0.5410857  0.19024326   0.18811881 0.19923640 0.7006313
## 3  4 0.3404082 0.4711755  0.32508528   0.03960396 0.03313314 1.4797374
## 4  5 0.3200000 0.4018041  0.22764361   0.08910891 0.07848793 1.1051971
## 5  6 0.3102041 0.3474204  0.11330519   0.23762376 0.22889997 0.6403543
## 6  7 0.2840816 0.3031673  0.06502332   0.31683168 0.33674564 0.4726980
## 7  8 0.2653061 0.2700408  0.01768878   0.45544554 0.46514316 0.3324134
## 8  9 0.2465306 0.2413469 -0.02125070   0.56435644 0.56967334 0.2443741
## 9 10 0.2130612 0.2190531  0.02773443   0.44554455 0.44629390 0.3503790

Now we will take a look at some of the plots M3C generates.

This is a CDF plot of the GBM data we feed into the algorithm. We are looking for the value of K with the flattest curve and this can be quantified using the PAC metric (Șenbabaoğlu et al., 2014). In the CDF and following PAC plot we can see the overfitting effect of consensus clustering where as K increases so does the apparent stability of the results for any given dataset, this we correct for by using a reference.