biosigner 1.34.0
High-throughput, non-targeted, technologies such as transcriptomics, proteomics and metabolomics, are widely used to discover molecules which allow to efficiently discriminate between biological or clinical conditions of interest (e.g., disease vs control states). Powerful machine learning approaches such as Partial Least Square Discriminant Analysis (PLS-DA), Random Forest (RF) and Support Vector Machines (SVM) have been shown to achieve high levels of prediction accuracy. Feature selection, i.e., the selection of the few features (i.e., the molecular signature) which are of highest discriminating value, is a critical step in building a robust and relevant classifier (Guyon and Elisseeff 2003): First, dimension reduction is usefull to limit the risk of overfitting and increase the prediction stability of the model; second, intrepretation of the molecular signature is facilitated; third, in case of the development of diagnostic product, a restricted list is required for the subsequent validation steps (Rifai, Gillette, and Carr 2006).
Since the comprehensive analysis of all combinations of features is not computationally tractable, several selection techniques have been described, including filter (e.g., p-values thresholding), wrapper (e.g., recursive feature elimination), and embedded (e.g., sparse PLS) approaches (Saeys, Inza, and Larranaga 2007). The major challenge for such methods is to be fast and extract restricted and stable molecular signatures which still provide high performance of the classifier (Gromski et al. 2014; Determan 2015).
The
biosigner
package implements a new wrapper feature selection algorithm:
the dataset is split into training and testing subsets (by bootstraping, controling class proportion),
model is trained on the training set and balanced accuracy is evaluated on the test set,
the features are ranked according to their importance in the model,
the relevant feature subset at level f is found by a binary search: a feature subset is considered relevant if and only if, when randomly permuting the intensities of other features in the test subsets, the proportion of increased or equal prediction accuracies is lower than a defined threshold f,
the dataset is restricted to the selected features and steps 1 to 4 are repeated until the selected list of features is stable.
Three binary classifiers have been included in
biosigner
,
namely PLS-DA, RF and SVM, as the performances of each
machine learning approach may vary depending on the structure of the
dataset (Determan 2015). The algorithm returns the tier of each
feature for the selected classifer(s): tier S corresponds to the
final signature, i.e., features which have been found significant in
all the selection steps; features with tier A have been found
significant in all but the last selection, and so on for tier B to
D. Tier E regroup all previous round of selection.
As for a classical classification algorithm, the biosign
method takes
as input the x
samples times features data frame (or matrix) of
intensities, and the y
factor (or character vector) of class labels
(note that only binary classification is currently available). It
returns the signature (signatureLs
: selected feature names) and the
trained model (modelLs
) for each of the selected classifier. The
plot
method for biosign
objects enable to visualize the individual
boxplots of the selected features. Finally, the predict
method allows
to apply the trained classifier(s) on new datasets.
The algorithm has been successfully applied to transcriptomics and metabolomics data [Rinaudo et al. (2016); see also the Hands-on section below).
We first load the
biosigner
package:
library(biosigner)
We then use the diaplasma
metabolomics dataset (Rinaudo et al. 2016)
which results from the analysis of plasma samples from 69 diabetic
patients were analyzed by reversed-phase liquid chromatography coupled
to high-resolution mass spectrometry (LC-HRMS; Orbitrap Exactive) in
the negative ionization mode. The raw data were pre-processed with XCMS
and CAMERA (5,501 features), corrected for signal drift, log10
transformed, and annotated with an in-house spectral database. The
patient’s age, body mass index, and diabetic type are
recorded (Rinaudo et al. 2016).
data(diaplasma)
We attach diaplasma to the search path and display a summary of the
content of the dataMatrix, sampleMetadata and variableMetadata
with the view
function from the (imported)
ropls
package:
attach(diaplasma)
library(ropls)
ropls::view(dataMatrix)
## dim class mode typeof size NAs min mean median max
## 69 x 5,501 matrix numeric double 3.3 Mb 0 0 4.2 4.4 8.2
## m096.009t01.6 m096.922t00.8 ... m995.603t10.2 m995.613t10.2
## DIA001 2.98126177377087 6.08172882312848 ... 3.93442594703862 3.96424920154706
## DIA002 0 6.13671997362279 ... 3.74201112636229 3.78128422428722
## ... ... ... ... ... ...
## DIA077 0 6.12515971273103 ... 4.55458598372024 4.57310800324247
## DIA078 4.69123816772499 6.134420482337 ... 4.1816445335704 4.20696191303494