The ClusterSignificance package provides tools to assess if “class clusters” in dimensionality reduced data representations have a separation different from permuted data. The term “class clusters” here refers to, clusters of points representing known classes in the data. This is particularly useful to determine if a subset of the variables, e.g. genes in a specific pathway, alone can separate samples into these established classes. Evaluation of this is accomplished in a three stage process *Projection*, *Separation classification*, and *Score permutation*. To be able to compare class cluster separations, we give them a score based on this separation. First, all data points are projected onto a line (*Projection*), after which the best separation for two groups at a time is identified and scored (*Separation classification*). Finally, to get a p-value for the separation, we compare the separation score for our real data to the separation score for permuted data (*Score permutation*).

The package includes 2 different methods for accomplishing the Projection step outlined above, *Mean line projection* (**Mlp**) and *Principal curve projection* (**Pcp**). Here we will first underline the assumptions made by the ClusterSignificance method, followed by a description of the similarities of the three stages of ClusterSignificance independent of which Projection method is used. This is followed by an example of the Mlp and Pcp methods and a description of the unique features of each. Furthermore, we provide an example where ClusterSignificance is used to examine the separation between 2 class clusters downsteam of a principal component analysis (PCA).

The ClusterSignificance package operates based on three simple assumptions shown below:

The dimensionality reduction method was sufficiently capable of detecting the dissimilarities, characterised by features in high-dimensional space, that correspond to the class separations under investigation.

The obtained principal curve accurately depicts the data with respect to the class separations under consideration.

Under the null hypothesis, the joint distribution remains invariant under all rearrangements of the subscripts.

The inputs to the Projection methods are a matrix of the data representation after dimensionality reduction and the class labels. (Dimensionality reduction is not strictly necessary which is discussed further in common questions) The input matrix should be constructed with data points (or samples) on the rows, and dimensions (or principal components) in the columns. For the method to know which rows are in the same class, the class argument must be specified. The class argument should be a character vector without NA’s. The projection step utilises either the Mlp or Pcp method to project all data points onto a line, after which they are moved into a single dimension.

Now that all of the points are in one dimension, it allows us to easily define a perpendicular line that best separates the two classes. This is accomplished by utilising the user defined classes to calculate the sensitivity and specificity at each possible separation of the projected data. The score for each possible separation is then calculated based on the formula below which measures the complement of the Euclidean distance from each seperation to the ROC curves operating point:

\[score = 1 - \sqrt{(1-specificity)^2 + (1-sensitivity)^2}\]

In addition to the separation scores, the Separation classification stage also outputs the area under the curve (AUC) for each class separation under investigation. AUC serves as an intuitive measure of the class separation in the projected space.

The null-hypothesis of the permutation stage is that the class labels are independent of the features and, thus rejection of the null indicates a dependence of these variables and consequently, a separation between classes. Permutation is performed by randomly sampling the input matrix, with replacement, and assigning the data points to the groups. The projection and classification steps are then run for the sampled matrix. The max separation score is recorded and the next permutation is performed. The p-value can subsequently be calculated with the following formula:

\[p.value=\frac{count\ of\ permuted\ scores >= real\ score}{iterations}\]

If none of the permuted scores are greater that the real score the p-value is instead calculated as:

\[p.value < 10^{-log10(iterations)}\]

Due to the fact that the score permutation stage will typically be a monte carlo test, rather that a permutation test, the p-value is actually a p-value estimate with a confidence interval. A discussion concerning this and the number of permutations necessary can be found here.

As previously mentioned, there are two projection methods provided with the ClusterSignificance package; *Principal curve projection* (**Pcp**) and *Mean line projection* (**Mlp**) . These are outlined below together with the situations where each can be used.

The Pcp method is suitable in situations where more than 2 dimensions need to be considered simultaneously (although Pcp works with only 2 dimensions) and/or more than 2 class separations should be analysed. It is limited by the fact that, the Pcp method will not work for < 5 data points per class and must have > 5 unique values per dimension. The Pcp method utilises the principal_curve method from the princurve package to fit a principal curve to the data points utilising all the dimensions input by the user. A principal curve can be described as a “smooth curve that passes through the middle of the data in an orthogonal sense”. All data points are then projected onto the principal curve and transferred into one dimension for scoring.

The Mlp method is suitable for comparing clusters in a maximum of 2 dimensions when the separation between a maximum of 2 classes will be evaluated at a time. Briefly, Mlp functions by drawing a regression line through the means of the two classes and then projects all points onto that line. To project the points into one dimension, the points are then rotated onto the x-axis. A detailed description and graphical representation of the individual steps can be seen in the Mlp example below. It should be noted that, despite the fact that the Mlp method will work with as low as 2 data points per class, it may not be advisable to perform an analysis with such a small amount of data.

`library(ClusterSignificance)`

The ClusterSignificance package includes 2 small example data matrices, one used for demonstrating the Mlp method and one used for demonstrating the Pcp method. The Mlp demo matrix has 2 dimensions whereas, the Pcp demo matrix has 4. In addition, the class variable has been added as the rownames of each matrix, with the Mlp demo matrix having 2 classes and the Pcp demo matrix having 3 classes. Finally, colnames, indicating the dimension number, were also added to each matrix. Neither colnames or rownames are needed for the input matrix to ClusterSignificance functions and we have only added them here only for clarity.

`data(mlpMatrix)`

dim1 | dim2 | |
---|---|---|

class1 | 0.42 | 0.50 |

class1 | 0.86 | 0.85 |

class1 | 0.57 | 0.42 |

classes | Freq |
---|---|

class1 | 20 |

class2 | 20 |

`data(pcpMatrix)`

dim1 | dim2 | dim3 | |
---|---|---|---|

class1 | 0.17 | 0.82 | 0.72 |

class1 | 0.81 | 0.06 | 0.82 |

class1 | 0.38 | 0.80 | 0.82 |

classes | Freq |
---|---|

class1 | 20 |

class2 | 20 |

class3 | 20 |

To provide the readers with an easy to follow example, we generated *in silico* data representative of output from a dimensionality reduction algorithm. The demonstration data was simulated so that one cluster has a visible separation from the two other clusters, whereas two of the clusters are largely overlapping. Therefore, we can imagine that ClusterSignificance will find 2 separations to be significant where as the other will be insignificant.

```
## Create the group variable.
classes <- rownames(pcpMatrix)
## Run Pcp and plot.
prj <- pcp(pcpMatrix, classes)
plot(prj)
```