High-throughput, non-targeted, technologies such as transcriptomics, proteomics and metabolomics, are widely used to discover molecules which allow to efficiently discriminate between biological or clinical conditions of interest (e.g., disease vs control states). Powerful machine learning approaches such as Partial Least Square Discriminant Analysis (PLS-DA), Random Forest (RF) and Support Vector Machines (SVM) have been shown to achieve high levels of prediction accuracy. Feature selection, i.e., the selection of the few features (i.e., the molecular signature) which are of highest discriminating value, is a critical step in building a robust and relevant classifier (Guyon and Elisseeff 2003): First, dimension reduction is usefull to limit the risk of overfitting and increase the prediction stability of the model; second, intrepretation of the molecular signature is facilitated; third, in case of the development of diagnostic product, a restricted list is required for the subsequent validation steps (Rifai, Gillette, and Carr 2006).
Since the comprehensive analysis of all combinations of features is not computationally tractable, several selection techniques have been described, including filter (e.g., p-values thresholding), wrapper (e.g., recursive feature elimination), and embedded (e.g., sparse PLS) approaches (Saeys, Inza, and Larranaga 2007). The major challenge for such methods is to be fast and extract restricted and stable molecular signatures which still provide high performance of the classifier (Gromski et al. 2014; Determan 2015).
The biosigner package implements a new wrapper feature selection algorithm:
the dataset is split into training and testing subsets (by bootstraping, controling class proportion),
model is trained on the training set and balanced accuracy is evaluated on the test set,
the features are ranked according to their importance in the model,
the relevant feature subset at level f is found by a binary search: a feature subset is considered relevant if and only if, when randomly permuting the intensities of other features in the test subsets, the proportion of increased or equal prediction accuracies is lower than a defined threshold f,
the dataset is restricted to the selected features and steps 1 to 4 are repeated until the selected list of features is stable.
Three binary classifiers have been included in biosigner, namely PLS-DA, RF and SVM, as the performances of each machine learning approach may vary depending on the structure of the dataset (Determan 2015). The algorithm returns the tier of each feature for the selected classifer(s): tier S corresponds to the final signature, i.e., features which have been found significant in all the selection steps; features with tier A have been found significant in all but the last selection, and so on for tier B to D. Tier E regroup all previous round of selection.
As for a classical classification algorithm, the
biosign method takes
as input the
x samples times features data frame (or matrix) of
intensities, and the
y factor (or character vector) of class labels
(note that only binary classification is currently available). It returns the
signatureLs: selected feature names) and the trained model
modelLs) for each of the selected classifier. The
biosign objects enable to visualize the individual boxplots of the
selected features. Finally, the
predict method allows to apply the
trained classifier(s) on new datasets.
The algorithm has been successfully applied to transcriptomics and metabolomics data [Rinaudo et al. (2016); see also the Hands-on section below).
We first load the biosigner package:
We then use the diaplasma metabolomics dataset (Rinaudo et al. 2016) which results from the analysis of plasma samples from 69 diabetic patients were analyzed by reversed-phase liquid chromatography coupled to high-resolution mass spectrometry (LC-HRMS; Orbitrap Exactive) in the negative ionization mode. The raw data were pre-processed with XCMS and CAMERA (5,501 features), corrected for signal drift, log10 transformed, and annotated with an in-house spectral database. The patient’s age, body mass index, and diabetic type are recorded (Rinaudo et al. 2016).
We attach diaplasma to the search path and display a summary of the content of
the dataMatrix, sampleMetadata and variableMetadata with the
function from the (imported)
attach(diaplasma) library(ropls) ropls::view(dataMatrix)
## dim class mode typeof size NAs min mean median max ## 69 x 5,501 matrix numeric double 3.3 Mb 0 0 4.2 4.4 8.2 ## m096.009t01.6 m096.922t00.8 ... m995.603t10.2 m995.613t10.2 ## DIA001 2.98126177377087 6.08172882312848 ... 3.93442594703862 3.96424920154706 ## DIA002 0 6.13671997362279 ... 3.74201112636229 3.78128422428722 ## ... ... ... ... ... ... ## DIA077 0 6.12515971273103 ... 4.55458598372024 4.57310800324247 ## DIA078 4.69123816772499 6.134420482337 ... 4.1816445335704 4.20696191303494
ropls::view(sampleMetadata, standardizeL = TRUE)
## type age bmi ## factor numeric numeric ## nRow nCol size NAs ## 69 3 0 Mb 0 ## type age bmi ## DIA001 T2 70 31.6 ## DIA002 T2 67 28 ## ... ... ... ... ## DIA077 T2 50 27 ## DIA078 T2 65 29
## 1 data.frame 'factor' column(s) converted to 'numeric' for plotting.
## Standardization of the columns for plotting.
ropls::view(variableMetadata, standardizeL = TRUE)
## mzmed rtmed ... pcgroup spiDb ## numeric numeric ... numeric character ## nRow nCol size NAs ## 5,501 6 0.8 Mb 0 ## mzmed rtmed ... pcgroup ## m096.009t01.6 96.00899361 93.92633015 ... 1984 ## m096.922t00.8 96.92192011 48.93274877 ... 4 ## ... ... ... ... ... ## m995.603t10.2 995.6030195 613.4388762 ... 7160 ## m995.613t10.2 995.6134422 613.4446705 ... 7161 ## spiDb ## m096.009t01.6 N-Acetyl-L-aspartic acid_HMDB00812 ## m096.922t00.8 ## ... ... ## m995.603t10.2 ## m995.613t10.2
## 3 data.frame 'character' column(s) converted to 'numeric' for plotting. ## Standardization of the columns for plotting.
We see that the diaplasma list contains three objects:
dataMatrix: 69 samples x 5,501 matrix of numeric type containing the intensity profiles (log10 transformed),
sampleMetadata: a 69 x 3 data frame, with the patients’
type: diabetic type, factor
bmi: body mass index, numeric
variableMetadata: a 5,501 x 8 data frame, with the median m/z (‘mzmed’, numeric) and the median retention time in seconds (‘rtmed’, numeric) from XCMS, the ‘isotopes’ (character), ‘adduct’ (character) and ‘pcgroups’ (numeric) annotations from CAMERA, the names of the m/z and RT matching compounds from an in-house spectra of commercial metabolites (‘name_hmdb’, character), and the p-values resulting from the non-parametric hypothesis testing of difference in medians between types (‘type_wilcox_fdr’, numeric), and correlation with age (‘age_spearman_fdr’, numeric) and body mass index (‘bmi_spearman_fdr’, numeric), all corrected for multiple testing (False Discovery Rate).
We can observe that the 3 clinical covariates (diabetic type, age, and bmi) are stronlgy associated:
with(sampleMetadata, plot(age, bmi, cex = 1.5, col = ifelse(type == "T1", "blue", "red"), pch = 16)) legend("topleft", cex = 1.5, legend = paste0("T", 1:2), text.col = c("blue", "red"))
Figure 1: age, body mass index (bmi), and diabetic type of the patients from the diaplasma cohort.
Let us look for signatures of type in the diaplasma dataset by using the
biosign method. To speed up computations in this demo vignette, we restrict
the number of features (from 5,501 to about 500) and the number of bootstraps (5
instead of 50 [default]); the selection on the whole dataset, 50 bootstraps, and
the 3 classifiers, takes around 10 min.
featureSelVl <- variableMetadata[, "mzmed"] >= 450 & variableMetadata[, "mzmed"] < 500 sum(featureSelVl)
##  533
dataMatrix <- dataMatrix[, featureSelVl] variableMetadata <- variableMetadata[featureSelVl, ]
diaSign <- biosigner::biosign(dataMatrix, sampleMetadata[, "type"], bootI = 5)
## Significant features from 'S' groups: ## plsda randomforest svm ## m495.261t08.7 "C" "A" "S" ## m497.284t08.1 "S" "S" "E" ## m497.275t08.1 "A" "S" "E" ## m471.241t07.6 "B" "S" "E" ## Accuracy: ## plsda randomforest svm ## Full 0.797 0.835 0.824 ## AS 0.823 0.845 0.708 ## S 0.825 0.858 0.708
Figure 2: Relevant signatures for the PLS-DA, Random Forest, and SVM classifiers extracted from the diaplasma dataset. The S tier corresponds to the final metabolite signature, i.e., metabolites which passed through all the selection steps.
The arguments are:
x: the numerical matrix (or data frame) of intensities (samples as rows,
variables as columns),
y: the factor (or character) specifying the sample labels from the 2
methodVc: the classifier(s) to be used; here, the default all value means
that all classifiers available (plsda, randomforest, and svm) are
bootI: the number of bootstraps is set to 5 to speed up computations when
generating this vignette; we however recommend to keep the default 50 value
for your analyzes (otherwise signatures may be less stable).
set.seed argument ensures that the results from this
vignette can be reproduced exactly; by choosing alternative seeds (and the
bootI = 50), similar signatures are obtained, showing the stability
of the selection.
xmatrix/data frame contain missing values (NA), these features will be removed prior to modeling with Random Forest and SVM (in contrast, the NIPALS algorithm from PLS-DA can handle missing values),
The resulting signatures for the 3 selected classifiers are both printed and plotted as tiers from S, A, up to E by decreasing relevance. The (S) tier corresponds to the final signature, i.e. features which passed through all the backward selection steps. In contrast, features from the other tiers were discarded during the last (A) or previous (B to E) selection rounds.
Note that tierMaxC = ‘A’ argument in the print and plot methods can be used to view the features from the larger S+A signatures (especially when no S features have been found, or when the performance of the S model is much lower than the S+A model).
The performance of the model built with the input dataset (balanced accuracy: mean of the sensitivity and specificity), or the subset restricted to the S or S+A signatures are shown. We see that with 1 to 5 S feature signatures (i.e., less than 1% of the input), the 3 classifiers achieve good performances (even higher than the full Random Forest and SVM models). Furthermore, reducing the number of features decreases the risk of building non-significant models (i.e., models which do not perform significantly better than those built after randomly permuting the labels). The signatures from the 3 classifiers have some distinct features, which highlights the interest of comparing various machine learning approaches.
The individual boxplots of the features from the complete signature can be visualized with:
biosigner::plot(diaSign, typeC = "boxplot")