1 Introduction

Among the main challenges in mass spectrometric metabolomic analysis is the high-throughput analysis of metabolic features, their fast detection and annotation. By contrast to the screening of known, previously characterized, metabolic features in these data, the putative annotation of unknown features is often cumbersome and requires a lot of manual work, hindering the biological information retrieval of these data. High-resolution mass spectrometric data is often very rich in information content and metabolic conversions, and reactions can be derived from structural properties of features (Breitling et al. 2006). In addition to that, statistical associations between features (based on their intensity values) can be a valuable resource to find co-synthesized or co-regulated metabolites, which are synthesized in the same biosynthetic pathways. Given that an analysis tool within the R framework is still lacking that is integrating the two features of mass spectrometric information commonly acquired with mass spectrometers (m/z and intensity values), I developed MetNet to close this gap. The MetNet package comprises functionalities to infer network topologies from high-resolution mass spectrometry data. MetNet combines information from both structural data (differences in m/z values of features) and statistical associations (intensity values of features per sample) to propose putative metabolic networks that can be used for further exploration.

The idea of using high-resolution mass spectrometry data for network construction was first proposed in Breitling et al. (2006) and followed soon afterwards by a Cytoscape plugin, MetaNetter (Jourdan et al. 2007), that is based on the inference of metabolic networks on molecular weight differences and correlation (Pearson correlation and partial correlation).

Inspired by the paper of Marbach et al. (2012) different algorithms for network were implemented in MetNet to account for biases that are inherent in these statistical methods, followed by the calculation of a consensus adjacency matrix using the differently computed individual adjacency matrices.

The two main functionalities of the package include the creation of adjacency matrices from structural properties, based on losses/addition of functional groups defined by the user, and statistical associations. Currently, the following statistical models are implemented to infer a statistical adjacency matrix: Least absolute shrinkage and selection operator (LASSO, L1-norm regression, (Tibshirani 1994)), Random Forest (Breiman 2001), Pearson and Spearman correlation (including partial and semipartial correlation, see Steuer (2006) for a discussion on correlation-based metabolic networks), correlation based on Gaussian Graphical Models (GGM, see Krumsiek et al. (2011);Benedetti et al. (2020) for the advantages of using GGM instead of Pearson and partial pearson correlation), context likelihood of relatedness (CLR, (Faith et al. 2007)), the algorithm for the reconstruction of accurate cellular networks (ARACNE, (Margolin et al. 2006)) and constraint-based structure learning (Bayes, (Scutari 2010)). Since all of these methods have advantages and disadvantages, the user has the possibility to select several of these methods, compute adjacency matrices from these models and create a consensus matrix from the different statistical frameworks.

After creating the statistical and structural adjacency matrices these two matrices can be combined to form a consensus matrix that has information from both structural and statistical properties of the data. This can be followed by network analyses (e.g. calculation of topological parameters), integration with other data sources (e.g. genomic information or transcriptomic data) and/or visualization.

Central to MetNet is the AdjacencyMatrix class, derived from the SummarizedExperiment S4 class. The AdjacencyMatrix host the adjacency matrices creates during the different steps within the assays slot. They will furthermore store information on the type of the AdjacencyMatrix, i.e. if it was derived from structural or statistical properties or if it used the combined information from these layers (combine). It also stores information if the information was thresholded, e.g. by applying the rtCorrection or threshold function. Furthermore, the AdjacencyMatrix object stores information on if the graphs are directed or undirected (within the directed slot).

Questions and bugs

MetNet is currently under active development. If you discover any bugs, typos or develop ideas of improving MetNet feel free to raise an issue via Github or send a mail to the developer.

2 Prepare the environment and load the data

To install MetNet enter the following to the R console

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MetNet")

Before starting with the analysis, load the MetNet package. This will also load the required packages glmnet, stabs, GENIE3, mpmi, parmigene, Hmisc, ppcor and bnlearn that are needed for functions in the statistical adjacency matrix inference.

library(MetNet)

The data format that is compatible with the MetNet framework is a xcms/CAMERA output-like \(m~\times~n\) matrix, where columns denote the different samples \(n\) and where \(m\) features are present. In such a matrix, information about the masses of the features and quantitative information of the features (intensity or concentration values) are needed. The information about the m/z values has to be stored in a vector of length \(\vert m \vert\) in the column "mz".

MetNet does not impose any requirements for data normalization, filtering, etc. However, the user has to make sure that the data is properly preprocessed. These include division by internal standard, log2 or vsn transformation, noise filtering, removal of features that do not represent mass features/metabolites, removal of isotopes, etc.

We will load here the object x_test that contains m/z values (in the column "mz"), together with the corresponding retention time (in the column "rt") and intensity values. We will use here the object x_test for guidance through the workflow of MetNet.

data("x_test", package = "MetNet")
x_test <- as.matrix(x_test)

3 Creating the structural adjacency

The function structural will create an AdjacencyMatrix object of type structural containing the adjacency matrices based on structural properties (m/z values) of the features.

The function expects a matrix with a column "mz" that contains the mass information of a feature (typically the m/z value). Furthermore, structural takes a data.frame object as argument transformation with the colnames "mass" and additional columns (e.g. "group", "formula" or "rt"). structural looks for transformations (in the sense of additions/losses of functional groups mediated by biochemical, enzymatic reactions) in the data using the mass information.

Following the work of Breitling et al. (2006) and Jourdan et al. (2007), molecular weight difference wX is defined by \(w_X = \vert w_A - w_B \vert\)

where wA is the molecular weight of substrate A, and wB is the molecular weight of product B (typically, m/z values will be used as a proxy for the molecular weight since the molecular weight is not directly derivable from mass spectrometric data). As exemplified in Jourdan et al. (2007), specific enzymatic reactions refer to specific changes in the molecular weight, e.g. carboxylation reactions will result in a mass difference of 43.98983 (molecular weight of CO2) between metabolic features.

The search space for these transformation is adjustable by the transformation argument in structural allowing to look for specific enzymatic transformations. Hereby, structural will take into account the ppm value, to adjust for inaccuracies in m/z values due to technical reasons according to the formula

\[ppm = \frac{m_{exp} - m_{calc}}{m_{exp}} \cdot 10^{-6}\]

with mexp the experimentally determined m/z value and mcalc the calculated accurate mass of a molecule. Within the function, a lower and upper range is calculated depending on the supplied ppm value, differences between the m/z feature values are calculated and matched against the "mass"es of the transformation argument. If any of the additions/losses defined in transformation is found in the data, it will be reported as an (unweighted) connection in the assay "binary" of the returned AdjacencyMatrix object.

Together with this assay, additional character adjacency matrices can be written to the assay slot of the AdjacencyMatri object. E.g. we can write the type of connection/transformation (derived e.g. from the column "group" in the transformation object) as a character matrix to the assay "group" by setting var = "group".

Before creating the structural AdjacencyMatrix, one must define the search space, i.e. the transformation that will be looked for in the mass spectrometric data, by creating here the transformations object.

## define the search space for biochemical transformation 
transformations <- rbind(
    c("Hydroxylation (-H)", "O", 15.9949146221, "-"),
    c("Malonyl group (-H2O)", "C3H2O3", 86.0003939305, "+"),
    c("D-ribose (-H2O) (ribosylation)", "C5H8O4", 132.0422587452, "-"),
    c("C6H10O6", "C6H10O6", 178.0477380536, "-"),
    c("Rhamnose (-H20)", "C6H10O4", 146.057910, "-"),
    c("Monosaccharide (-H2O)", "C6H10O5", 162.0528234315, "-"),
    c("Disaccharide (-H2O) #1", "C12H20O10", 324.105649, "-"),
    c("Disaccharide (-H2O) #2", "C12H20O11", 340.1005614851, "-"),
    c("Trisaccharide (-H2O)", "C18H30O15", 486.1584702945, "-"),
    c("Glucuronic acid (-H2O)", "C6H8O6", 176.0320879894, "?"),
    c("coumaroyl (-H2O)", "C9H6O2", 146.0367794368, "?"),
    c("feruloyl (-H2O)", "C9H6O2OCH2", 176.0473441231, "?"),
    c("sinapoyl (-H2O)", "C9H6O2OCH2OCH2", 206.0579088094, "?"),
    c("putrescine to spermidine (+C3H7N)", "C3H7N", 57.0578492299, "?"))

## convert to data frame
transformations <- data.frame(
    group = transformations[, 1],
    formula = transformations[, 2],
    mass = as.numeric(transformations[, 3]),
    rt = transformations[, 4])

The function structural will then check for those m/z differences that are stored in the column "mass" in the object transformations. To create the AdjacencyMatrix object derived from these structural information we enter

struct_adj <- structural(x = x_test, transformation = transformations, 
    var = c("group", "formula", "mass"), ppm = 10)

in the R console.

As we set var = c("group", "formula", "mass"), the AdjacencyMatrix object will contain the assays "group", "formula", and "mass" that store the character adjacency matrices with the information defined in the columns of transformations.

3.1 Advanced topic: Creating a directed structural graph

By default, the structural AdjacencyMatrix object and the contained adjacency matrices are undirected (the argument in structural is set to directed = FALSE by default; i.e. the matrices are symmetric). MetNet, however, also allows to include the information on the directionality of the transformation (e.g. to distinguish between additions and losses). This behaviour can be specified by setting directed = TRUE:

struct_adj_dir <- structural(x = x_test, transformation = transformations, 
    var = c("group", "formula", "mass"), ppm = 10, directed = TRUE)

In the following we will visualize the results from the undirected and directed structural network.

We will set the mode of the igraph object to "directed" in both cases to make the distinction between the returned outputs of structural for setting directed = FALSE and directed = TRUE. Alternatively, we could also set the mode for the first igraph object (using the undirected output of structural) to "undirected" which results in an igraph object where the directionality of the edges is not retained.

g_undirected <- igraph::graph_from_adjacency_matrix(
    assay(struct_adj, "binary"), mode = "directed", weighted = NULL)
plot(g_undirected, edge.width = 1, edge.arrow.size = 0.5,
     vertex.label.cex = 0.5, edge.color = "grey")