Discovery, Identification, and Screening of Lipids and Oxylipins in HPLC-MS Datasets Using LOBSTAHS

James Collins

2018-04-30

Introduction

This document describes the purpose and use of the R-package LOBSTAHS: Lipid and Oxylipin Biomarker Screening through Adduct Hierarchy Sequences. The package is described in Collins et al. 2016.1 In the sections below, LOBSTAHS package functions are applied to a model dataset using example code. LOBSTAHS requires the additional packages xcms2 and CAMERA3; the model dataset, consisting of lipid data from cultures of the marine diatom Phaeodactylum tricornutum, is contained in the R data package PtH2O2lipids.

Purpose

LOBSTAHS contains several functions to help scientists discover and identify lipid and oxidized lipid biomarkers in HPLC-MS data that have been pre-processed with the popular R packages xcms and CAMERA. First, LOBSTAHS uses exact mass to make initial compound assignments from a set of customizable onboard databases. LOBSTAHS allows users to easily generate custom databases containing entries for a wide range of lipids, oxidized lipids, and oxylipins; these are created using an automated, iterative computational approach based on structural criteria specified by the user in simple text tables/spreadsheets. After database identification, a series of orthogonal screening criteria are applied to refine and winnow the list of assignments. A basic workflow based on xcms, CAMERA, and LOBSTAHS is illustrated in the schematic. Each step in the figure is described in detail in the following paragraphs.

Installing the LOBSTAHS package

Install current production version

Users can the current production version of LOBSTAHS from Bioconductor by following the directions here (under “Installation”). The Bioconductor installation function will prompt you to install the latest versions of some other packages on which LOBSTAHS depends.

Note: At this point, you should verify that you have, in fact, installed the most current version of Bioconductor. You may be required to install the latest version of R itself before you can obtain the latest version of Bioc. If you fail to ensure you are running the latest version of Bioconductor at this point, you won’t necessarily be able to install the most up-to-date version of LOBSTAHS since the Bioc installer will only provide you will the most recent versions of packages attached to that version of Bioconductor. (If you are confused by any of this, check out this help section. Or, just install the current development version directly from GitHub using the directions, below.) Once you are satisfied that you’re running the latest version of Bioconductor, you can then install LOBSTAHS:

Users are also encouraged to download the PtH2O2lipids data package, which can be used for familiarization with the software.

Install “no warranties” development version with latest features

Following these directions, you will install the latest version of the software from the files present in this GitHub repository. Some features may be unstable.

Install RTools:

For windows: Download and install RTools from http://cran.r-project.org/bin/windows/Rtools/

For Unix: Install the R-development-packages (r-devel or r-base-dev)

Install packages needed for installation from Github:
Install LOBSTAHS:

The build_vignettes = TRUE argument is required if rendering of the full vignette is desired (recommended). install_github() does not render vignettes by default.

Install ‘PtH2O2lipids,’ containing example data & precursor xsAnnotate object:

“Operating instructions,” part 1: Pre-processing

Acquisition of HPLC-MS data suitable for LOBSTAHS

Data for LOBSTAHS should be acquired using a mass spectrometer with sufficiently high mass accuracy and resolution. Suitable MS acquisition platforms with which LOBSTAHS has been tested include Fourier transform ion cyclotron resonance (FT-ICR) and Orbitrap instruments. A quadrupole time-of-flight (Q-TOF) instrument could also be used if it possessed sufficient mass accuracy (i.e., < 5-7 ppm). While the software will accept any HPLC-MS data as input, insufficient mass accuracy and resolution will produce results with large numbers of database matches for each feature that cannot be distinguished from each other.

File conversion

After acquisition, each data file should be converted from manufacturer to open source (.mzXML) format in centroid (not profile) mode. If data were acquired on the instrument using ion mode switching, the data from each ionization mode (i.e., polarity) must also be extracted into a separate file. The msConvert tool (part of the ProteoWizard toolbox) can be used to accomplish these tasks. msConvert commands can be executed with either the provided GUI or at the command line; however, conversion of manufacuter file formats can only be accomplished using the Windows installation. An R script (Exactive_full_scan_process_ms1+.r) for batch conversion and extraction of data files acquired on a Thermo Exactive Orbitrap instrument is provided as part of the Van Mooy Lab Lipidomics Toolbox. This vignette is not intended as a comprehensive guide to mass spectrometer file conversion; users are encouraged to digest the msConvert documentation. However, assuming a hypothetical data file “Exactive_data.raw” was acquired on an Thermo Orbitrap instrument with ion mode switching, the following code could be within R used to convert and then extract positive and negative mode scans to separate files. (Code presumes you’ve already installed msConvert.)

Initial file conversion; saves converted file to a directory “mzXML_ms1_two_mode”:

Pre-processing with xcms

After all data files in a particular dataset have been converted and extracted into files of like polarity, the R-package xcms can then be used to perform feature detection, retention time correction, and peak grouping. While the paragraphs below contain basic instructions for preparation of data, this vignette is not intended as ae manual or guide to the complex world of mass spectrometer data processing in xcms and CAMERA; users should acquaint themselves with the manuals (here and here) and very helpful vignettes (here and here) for the two packages.

First, data files should be given intuitive file names (containing, e.g., a sample ID along with information on experimental timepoint and treatment) and placed into a directory structure according to an index variable; the xcms vignette LC/MS Preprocessing and Analysis with xcms contains a detailed explanation. Feature detection, retention time correction, and peak grouping should then be performed. Values of parameters for xcms functions can be obtained from several sources:

The R script prepOrbidata.R (part of the Van Mooy Lab Lipidomics Toolbox) contains code for complete preparation in xcms of the PtH2O2lipids (or similar) dataset. The script allows the user to apply parameter values assembled from the literature or obtain them from IPO optimization of a subset of samples. The final parameter values used in Collins et al. 20161 for analysis of the PtH2O2lipids dataset – obtained using HPLC-ESI-MS on an Exactive Orbitrap instrument – are given in Table S5 of the electronic supplement. The settings and values listed in the table could be used as a starting point for analysis of similar data.

The xcmsSet produced using these settings from the P. tricornutum data is stored in the PtH2O2lipids package.

Whatever settings are used in xcms, the processed data should contain high-quality features which have been aligned across samples; the quality of the processed data should be verified by individual inspection of a subset of features. At the conclusion of processing with xcms, the user should have a single xcmsSet object containing the dataset.

Final pre-processing with CAMERA

In the final pre-processing step, the R-package CAMERA should be used to (1) aggregate peak groups in the xcmsSet object into pseudospectra and (2) identify features in the data representing likely isotope peaks. Use of CAMERA to aggregate the peak groups into pseudospectra is critical because LOBSTAHS applies its various orthogonal screening criteria to only those peak groups within each pseudospectrum. CAMERA should not be used to eliminate adduct ions since the adduct ion hierarchy screening function in LOBSTAHS presumes that all adduct ions for a given analyte will be present in the dataset. We can create an xsAnnotate object from the P. tricornutum data by applying the wrapper function annotate() to the xcmsSet we created in the previous step:

library(xcms)
library(CAMERA)
library(LOBSTAHS)

# first, a necessary workaround to avoid a import error; see
# https://support.bioconductor.org/p/69414/

imports = parent.env(getNamespace("CAMERA"))
unlockBinding("groups", imports)
imports[["groups"]] = xcms::groups
lockBinding("groups", imports)

# create annotated xset using wrapper annotate(), allowing us to perform all
# CAMERA tasks at once

xsA = annotate(ptH2O2lipids$xsAnnotate@xcmsSet,
                   
               quick=FALSE,
               sample=NA, # use all samples
               nSlaves=1, # set to number of available cores or processors if 
                          # > 1
                   
               # group FWHM settings (defaults)
                   
               sigma=6,
               perfwhm=0.6,
                   
               # groupCorr settings (defaults)
                   
               cor_eic_th=0.75,
               graphMethod="hcs",
               pval=0.05,
               calcCiS=TRUE,
               calcIso=TRUE,
               calcCaS=FALSE,
                   
               # findIsotopes settings
                   
               maxcharge=4,
               maxiso=4,
               minfrac=0.5,
                   
               # adduct annotation settings
                   
               psg_list=NULL,
               rules=NULL,
               polarity="positive", # the PtH2O2lipids xcmsSet contains
                                    # positive-mode data
               multiplier=3,
               max_peaks=100,
                   
               # common to multiple tasks
                   
               intval="into",
               ppm=2.5,
               mzabs=0.0015
                   
               )
#> Start grouping after retention time.
#> Created 113 pseudospectra.
#> Generating peak matrix!
#> Run isotope peak annotation
#>  % finished: 10  20  30  40  50  60  70  80  90  100  
#> Found isotopes: 5692 
#> Start grouping after correlation.
#> Generating EIC's .. 
#> 
#> Calculating peak correlations in 113 Groups... 
#>  % finished: 10  20  30  40  50  60  70  80  90  100  
#> 
#> Calculating isotope assignments in 113 Groups... 
#>  % finished: 10  20  30  40  50  60  70  80  90  100  
#> Calculating graph cross linking in 113 Groups... 
#>  % finished: 10  20  30  40  50  60  70  80  90  100  
#> New number of ps-groups:  5080 
#> xsAnnotate has now 5080 groups, instead of 113 
#> Generating peak matrix for peak annotation!
#> 
#> Calculating possible adducts in 5080 Groups... 
#>  % finished: 10  20  30  40  50  60  70  80  90  100 

We now have an xsAnnotate object “xsA” to which we will next apply the screening and identification functions of LOBSTAHS. In the annotate() call, we set quick = FALSE because we want to run groupCorr(). This will also cause CAMERA to perform internal adduct annotation. While we will perform our own adduct annotation later with LOBSTAHS, allowing CAMERA to identify its own adducts doesn’t hurt, particularly if it leads to the creation of better pseudospectra.

“Operating instructions,” part 2: The LOBSTAHS databases

Before screening a dataset with LOBSTAHS, we must first decide whether to use one of two default databases, or to generate our own from the templates provided (instructions below). LOBSTAHS databases contain a mixture of in silico and empirical data for the different adduct ions of a wide range of intact polar diacylglycerols (IP-DAG), triacylglycerols (TAG), polyunsaturated aldehydes (PUAs), free fatty acids (FFA), and common photosynthetic pigments. In addition, the latest LOBSTAHS release includes support for lyso lipids under an “IP_MAG” species class. Functionality for other lipid classes is added regularly. The default databases (as of November 8, 2017) include 25,741 and 21,063 unique compounds that can be identifed in positive and negative ionization mode data, respectively. The databases can be easily customized (see below) if we wish to identify additional lipids in new lipid classes. LOBSTAHS databases are contained in LOBdbase objects that are generated or accessed by various package functions. Each LOBdbase is specific to a particular polarity (i.e., ion mode); when evaluating positive-mode data, we use a positive mode database (and vice versa).

A note about molecular diversity in the current databases

By virtue of the origin of LOBSTAHS — the study of lipids in marine algae and bacteria — most of the lipids currently in the database are of marine microbial origin. While algae and marine bacteria have many lipids in common with humans and other mammals, there are many classes which are not currently represented in the databases. If you find yourself adding new lipids to the default databases because the existing coverage is insufficient in your research area, consider sharing your additions with others via a pull request or by opening an issue on the package GitHub page. Some new lipid classes (e.g., sphingolipids and most non-glycerolipids) may not be amenable to in silico simulation in the same easy fashion as the existing lipid classes for which functionality already exists in the package code. In these cases, users should open a new issue request for new functionality or — in the spirit of open-source collaboration — modify the code in your own Git fork and then request incorporation of your new feature so others may also discover new lipids!

Accessing the default databases

We can access the default databases to examine their scope:

## 
## A positive polarity "LOBdbase" object.
## 
## Contains entries for 133273 possible adduct ions of 25741 unique parent compounds.
## 
## Parent lipid types ( 22 ): astaxanthin, bile_salt, DNPPE, fungalGSL, hapCER, hapGSL, hGSL, IP_DAG, IP_MAG, PDMS, pigment, plastoquinone_9, plastoquinone_9OH, plastoquinone_9OH2, PUA, reduced_scytonemin, scytonemin, sGSL, sterol, TAG, ubiquinone, vGSL 
## IP-DAG classes ( 14 ): WaxEster, DGCC, DGDG, DGTS_DGTA, DAG, PA, PE, PG, PC, SQDG, MGDG, BLL, PDPT, S_DGCC 
## IP-MAG classes ( 12 ): CoprostanolEsters, CholesterolEsters, LDGCC, LDGDG, LDGTS_DGTA, MAG, LPA, LPE, LPG, LPC, LSQDG, LMGDG 
## Photosynthetic pigments ( 22 ): Chl_a, 19prime_but_fuco, 19prime_hex_fuco, Allox, Alpha_carotene, Beta_carotenes, Chl_b, Chl_c2, Chl_c3, Chlide_a, Croco, Dd_Ddc, Dt, Echin, Fuco, Lut, Neox_Nos, Peri, Pheophytin_a, Pras, Viol, Zeax 
## Adducts ( 9 ): [M+2Na-H]+, [M+2Na+Cl]+, [M+C2H3Na2O2]+, [M+C4H10N3]+, [M+H]+, [M+K]+, [M+Na]+, [M+NH4]+, [M+NH4+ACN]+ 
## 
## m/z range: 97.0647914-2016.54119204 
## 
## Ranges of chemical parameters represented in molecules with acyl moieties:
## 
## Total number of acyl carbon atoms: 6-78 
## Total number of acyl carbon-carbon double bonds: 0-18 
## Number of additional oxygen atoms: 0-4 
## 
## Memory usage: 11.4 MB
## 
## A negative polarity "LOBdbase" object.
## 
## Contains entries for 97007 possible adduct ions of 21063 unique parent compounds.
## 
## Parent lipid types ( 16 ): bile_salt, DNPPE, FFA, fungalGSL, hapCER, hapGSL, hGSL, IP_DAG, IP_MAG, pigment, plastoquinone_9, PUA, scytonemin, sGSL, ubiquinone, vGSL 
## IP-DAG classes ( 13 ): DGCC, DGDG, DGTS_DGTA, DAG, PA, PE, PG, PC, SQDG, MGDG, BLL, PDPT, S_DGCC 
## IP-MAG classes ( 10 ): LDGCC, LDGDG, LDGTS_DGTA, MAG, LPA, LPE, LPG, LPC, LSQDG, LMGDG 
## Photosynthetic pigments ( 22 ): Chl_a, 19prime_but_fuco, 19prime_hex_fuco, Allox, Alpha_carotene, Beta_carotenes, Chl_b, Chl_c2, Chl_c3, Chlide_a, Croco, Dd_Ddc, Dt, Echin, Fuco, Lut, Neox_Nos, Peri, Pheophytin_a, Pras, Viol, Zeax 
## Adducts ( 11 ): [M-H]-, [M+2NaAc+Cl]-, [M+3Ac+2Na]-, [M+Cl]-, [M+HAc-H]-, [M+K-2H]-, [M+Na-2H]-, [M+Na+Cl-H]-, [M+NaAc-H]-, [M+NaAc+Cl]-, [M+NaAc+HAc-H]- 
## 
## m/z range: 95.05023848-1459.92498456 
## 
## Ranges of chemical parameters represented in molecules with acyl moieties:
## 
## Total number of acyl carbon atoms: 6-52 
## Total number of acyl carbon-carbon double bonds: 0-12 
## Number of additional oxygen atoms: 0-4 
## 
## Memory usage: 8.55 MB

Concept of operation: Using input tables

Though it might not be evident from examination of these defaults, LOBSTAHS generates databases based on values of parameters defined in simple tables. The values in these tables define the molecular properties of each individual compound or lipid class for which database entries will be created. Database generation is accomplished with the function generateLOBdbase(). Default values for the various inputs can be viewed by loading the default tables into the R workspace (directions below); the default input data are also given in Tables 1 and 2 of Collins et al. 2016.1 Package versions ≥ 1.1.2 contain adduct hierarchy and retention time window data for lipid classes BLL, PDPT, vGSL, sGSL, hGSL, hapGSL, and hapCER, which are described in Fulton et al. 2014 and Hunter et al. 2015.6. By modifying the default values in each table using provided templates, users can generate custom databases that include additional lipid classes. Customization directions are provided in the following section.

Users can load the default versions of the four input tables into the R workspace (and subsequently view them) at any time:

Introduction to the four input tables

Custom databases can be created in LOBSTAHS by modifying or adding to the default values in the four tables. We can generate custom databases in three steps:

Each of the four tables is introduced below, followed by instructions for adding new lipids and lipid classes to the LOBSTAHS databases.

We’ve also created a flow chart to help you navigate the database customization process; the flow chart is designed to complement — not replace — the detailed instructions below. Download a copy of the flow chart: as .pdf | as .svg

The componentCompTable

The first, most important table is the componentCompTable, which defines the base elemental compositions of each molecule or lipid class to be considered when databases are created. If we wish to add additional molecules or classes of molecules to the default set of compounds in the database, we’ll first need to add an additional entry or entries to this table.

For each lipid class or compound specified in the componentCompTable, the field DB_gen_compound_type contains one of five values:

  • “DB_acyl_iteration”
  • “DB_unique_species”
  • “basic_component”
  • “adduct_pos”
  • “adduct_neg”

DB_gen_compound_type is the field which determines how a particular compound, lipid class, or molecular fragment will be used and evaluated by the software. The last three compound types (“basic_component,” “adduct_pos,” and “adduct_neg”) are reserved for definition of basic components such as acteonitrile or acetate and for definition of adduct ion types; new entries of these types should only be created in the componentCompTable when a new adduct or basic component is to be specified.

The other two compound types in the componentCompTable (“DB_acyl_iteration” and “DB_unique_species”) are used to define the way generateLOBdbase creates database entries for different lipids and lipid classes. generateLOBdbase can create two kinds of database entries:

  1. When DB_gen_compound_type is set to “DB_unique_species”, generateLOBdbase() will create database entries only for adduct ions of the individual compound specified. (This is the simpler of the two cases.) We use this type for pigments and other lipids that do not have acyl groups, or in cases when we do not wish to consider any possible variation in acyl properties. In the case where DB_gen_compound_type = “DB_unique_species”, the exact mass of the complete (neutral) molecule is specified in the component definitions table. The photosynthetic pigments currently in the database provide an excellent template for creation of new compounds of the “DB_unique_species” type.

  2. When DB_gen_compound_type is set to “DB_acyl_iteration”, generateLOBdbase will automatically create multiple database entries for adduct ions of multiple molecular species within a lipid class based on the ranges of acyl properties and oxidation states specified in two other user-editable tables, acylRanges and oxyRanges. In the case of lipid classes for which DB_gen_compound_type = “DB_acyl_iteration,” the compound table is used to define the exact mass of a “base fragment” for the lipid class. Proper specification of the base fragment is explained below. Using the fragment as a starting point, generateLOBdbase creates multiple entries for molecules in the lipid class by iterative addition of various combinations of fatty acids.

The acylRanges and oxyRanges tables

acylRanges specifies the ranges of acyl properties to be considered for each lipid class when DB_gen_compound_type is set to “DB_acyl_iteration”. These properties include total acyl carbon chain length (i.e., the total number of carbon atoms in the fatty acids that make up TAG, PUA, IP-DAG, and IP-MAG) and degree of acyl carbon chain unsaturation (i.e., the number of possible carbon-carbon fatty acid double bonds) considered for each lipid class.

oxyRanges specifies the range of possible oxidation states (i.e., the number of additional oxygen atoms to be considered on each compound) we wish to consider for each of these compounds

The adductHierarchies table

A fourth table, adductHierarchies, contains empirical data on the relative abundances of various adduct ions formed by the compounds that belong to each lipid class. This table is also user-editable, but any additions or changes should first be confirmed by empirical analysis. Note that data should be provided in adductHierarchies for all lipid classes or compounds, regardless of the DB_gen_compound_type, above. In addition, an adduct hierarchy must be specified in the adductHierarchies table for each compound or compound class specified in the Adduct_hierarchy_lookup_class field of the componentCompTable.

Modification of table data to create new database entries

What modifications we make to the table(s) will depend on our goal. For example, if we wish only to expand (or constrain) the range of molecular properties for which database entries are to be created within an existing lipid class (i.e., of DB_gen_compound_type = “DB_acyl_iteration”), we’ll only have to modify existing values in the acylRanges and/or oxyRanges tables. However, if we wish to add new molecule(s) or lipid class(es) to the database(s), we’ll first need to define the new molecule/class in both the componentCompTable ( LOBSTAHS_componentCompTable.xlsx) and adductHierarchies table ( LOBSTAHS_adductHierarchies.xlsx). This is relatively trivial; don’t be discouraged! We will consider two cases:

First scenario: Defining single compounds

If the new entry is a single molecule (e.g., a new pigment), then new entries in the componentCompTable and adductHierarchies table (and, if desired, in the rt.windows table; see below) will be sufficient.

  1. Specify the DB_gen_compound_type

    In this instance, we should set “DB_unique_species” = DB_gen_compound_type when defining the species in the componentCompTable.

  2. Define neutal mass of molecule

    The entire exact mass of the neutral compound should be defined in the componentCompTable; we do not define the mass of a charged adduct ion, since adduct ion masses will be calculated automatically by the software.

  3. Define adduct hierarchies

    We’ll then need to define appropriate adduct hierarchies for the compound in the adductHierarchies table, with compound or class names that match the Adduct_hierarchy_lookup_class field in the componentCompTable. The same adduct hierarchy can be used for multiple compounds independently defined using the “DB_unique_species” type so long as the same adduct class name is used.

  4. Specify retention time data, if desired

    Retention time window data for the compound can also be added (if desired) to the rt.windows table (see below).

Second scenario: Defining entire lipid classes for which an iterative simulation is to be performed

Alternatively, if we wish to add an entire class of lipid for which multiple database entries will be created based on ranges of acyl properties and oxidation states specified in acylRanges ( LOBSTAHS_acylRanges.xlsx) and oxyRanges ( LOBSTAHS_oxyRanges.xlsx), we have to take just a bit more time to define the class and its acyl properties.

  1. Specify the DB_gen_compound_type

    First, in the componentCompTable, we should set DB_gen_compound_type = “DB_acyl_iteration”.

  2. Determine and define the base fragment

    We then have to define the mass of the “base fragment” upon which the generateLOBdbase() function will perform iteration. The base fragment should include the glycerol backbone, any carboxylic oxygen atoms in the fatty acid(s), and (if applicable), the entire lipid headgroup. In the case of IP-DAG and IP-MAG, the base fragment will include the entire polar headgroup, the glycerol backbone, and both carboxylic oxygen atoms in the fatty acid(s). In the case of TAG, the base fragment is defined as the glycerol backbone plus the carboxylic oxygen atoms on each of the three fatty acids. The base fragments for any new lipid classes for which the user desires evaluation of a range of acyl properties should be similarly defined.

    An example is given for phosphatidylcholine (PC), one of the eight basic IP-DAG classes that has been included in LOBSTAHS since its first release. We begin with an intact molecule, in this case PC 18:1, 18:1 (which would appear in a LOBSTAHS database as PC 36:2). Any intact PC species could be used as a starting point since our concern here is with the polar (headgroup) end of the molecule, which does not change under the simulation performed by generateLOBdbase():

    This intact neutral molecule has the chemical formula C44H84NO8P and an exact mass of 785.59346 amu. The C-1 carbon atom in both fatty acids is shown explicitly for ease of illustration. We define the base fragment for PC (and all other IP-DAG) as the polar headgroup, glycerol backbone, and both the carboxylic oxygen atoms (in red):

    In this case, the base fragment (shown in red) has the chemical formula C8H18NO8P and a mass of 287.07700 amu. One “easy” way to determine the formula (and therefore, the exact mass) of the base fragment is to use the fragmentation feature in an application such as ChemDraw. First, we homolytically cleave the bond(s) between the C-1 and C-2 carbon atom(s) in each acyl chain on which iteration is to be performed, giving one electron to each atom: