This workflow requires R version >= 3.4 and the following packages:
xcms
: for pre-processing of metabolomics data. A package version >= 2.99.0 is required which, if not yet in Bioconductor, can be installed from github using devtools::install_github("sneumann/xcms", ref = "xcms3")
. The package depends on mzR
for data import and uses classes from the MSnbase
package.faahKO
: provides the data set processed here.RColorBrewer
: for color definitions.doParallel
: to set-up parallel processing. xcms
uses parallel processing with the BiocParallel
package in most algorithms. While BiocParallel
’s default parallel processing setup (i.e. using multicore
on Unix systems and snow
for Windows) works on most systems this seems to be problematic/erratic on recent macOS versions. Pre-registering and setting up the parallel processing first using the doParallel
package and using this in BiocParallel
seems to avoid such problems.## Style
library(BiocStyle)
## Data processing
library(xcms)
## Toy data set for data processing
library(faahKO)
## For color definitions
library(RColorBrewer)
## For a working parallel processing setup on Mac systems
library(doParallel)
Parallel processing setup. As detailed above we’re using the doParallel
setup with a pre-defined set of cluster nodes.
## Set up the parallel processing using doParallel.
registerDoParallel(4)
## Register this setup for BiocParallel and make it default.
register(DoparParam(), default = TRUE)
This workflow illustrates the use of Bioconductor packages for the pre-processing of LCMS-based metabolomics data sets. This pre-processing aims to identify the features possibly corresponding to metabolites and to quantify them. While other package exist too (such as yamss
[1]), this workshop focuses on xcms
[2] and uses the functionality of its new user interface.
The topics covered in this document are:
The result from this metabolomics pre-processing is a table of intensities with rows corresponding to features (ions with a unique mass-to-charge ratio and retention time) and columns to samples.
The workflow does not cover (yet) data normalization strategies or identification of metabolites (i.e. mapping of detected features to corresponding metabolites). Also, the workflow focuses on gas/liquid chromatography (GC/LC) mass spectrometry (MS) which combines the physical separation capabilities of the chromatography with the mass analysis capabilities of the MS. Chromatography is usually preferred in mass spectrometry experiments with complex samples because the added dimension (the retention time rt) helps to discriminate between analytes with identical mass-to-charge ratios (mz).
For more details and description of the general metabolomics data workflow see [3][4].
In metabolomics data analysis, peaks are identified along the retention time axis. These peaks are referred to as chromatographic peaks in this document to distinguish them from mass peaks (i.e. peaks within a spectrum along the mz dimension). The term feature refers to individual ions with a unique mass-to-charge ratio (m/z, or mz) and a unique retention time (rt). Such features are defined during the correspondence step in which the detected chromatographic peaks are matched between (and within) samples.
A variety of different, vendor-specific, file formats exist for proteomics/metabolomics. This data files have to be converted first into one of the file formats supported by the mzR
package, i.e. the open mzML or the netCDF file format. For metabolomics experiments it is suggested to load the data using the readMSData2
function (from the MSnbase
package [5]) into an OnDiskMSnExp
object. Such objects hold only general spectra data (such as number of spectra in a file or the retention time of each spectrum) in memory while the full raw data (the mz and intensity values per spectrum) is accessed on demand. This keeps the memory footprint small and enables thus analyses also of large scale metabolomics experiments (see help pages and vignettes of the MSnbase
package for more details on the OnDiskMSnExp
object). The new user interface of the xcms
package, that’s used in this document, makes extensive use of objects from the MSnbase
package and hence reuses all of their functionality.
Below we read the raw data files (in netCDF format) from the faahKO
data package. This package provides a subset of the data from [6] examining the metabolic consequences of the knock-out of the fatty acid amide hydrolase (Faah) gene in mice. The data subset comprises spinal cord samples from 6 knock-out and 6 wild type mice. Each file contains centroided data from an LC-MS experiment acquired in positive ion mode in a mz range from 200-600 and retention time range from 2500-4500 seconds.
## Get the file names
cdf_files <- dir(system.file("cdf", package = "faahKO"), recursive = TRUE,
full.names = TRUE)
## Define the sample grouping.
s_groups <- rep("KO", length(cdf_files))
s_groups[grep(cdf_files, pattern = "WT")] <- "WT"
## Define a data.frame that can be used as phenodata
pheno <- data.frame(sample_name = sub(basename(cdf_files), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = s_groups, stringsAsFactors = FALSE)
## Read the data.
faahKO_raw <- readMSData2(cdf_files, pdata = new("NAnnotatedDataFrame", pheno))
The data is organized by spectrum, i.e. for each retention time we have a Spectrum1
object containing the mz and intensity duplets measured by the mass spec. Below we extract one of the spectra and evaluate the data it contains. Individual spectra can be extracted using [[
, which causes the MS data for the particular spectrum to be imported from the original data file.
## Access the 3rd spectrum in the data set.
spctr <- faahKO_raw[[3]]
## Get the retention time when the spectrum was measured
rtime(spctr)
## [1] 2501.38
## Access the mz of the spectrum
head(mz(spctr))
## [1] 201.0 202.1 203.1 204.1 205.1 205.8
## And the associated intensities
head(intensity(spctr))
## [1] 1083 672 1211 569 1244 452
## Optionally plot the Spectrum by plotting the mz values on the x- and
## the associated intensities on the y-axis.
We can use various accessor functions to extract information from the OnDiskMSnExp
object, such as rtime
to get the retention time for each spectrum. Many of these methods directly access information stored in the object’s fData
(corresponding to the spectrum headers in the mzML/netCDF files) and are thus very fast. The mz
, intensity
and spectra
methods on the other hand read the original data on the fly and are thus slower.
It is also important to note that the spectrum data within the object is not organized by sample and data is always returned as a one-dimensional vector. The association between a spectrum and the file from which it originates is provided by the fromFile
method which returns an integer vector with the index of the file from which the spectrum was extracted.
Various filter methods allow a fast and simple sub-setting of the full experiment. In the example below we create a total ion chromatogram (TIC) using the filterFile
method to subset the object to data from a certain file. The total ion current per spectrum is extracted with the tic
method. The TIC plots the sum of all measured intensities for a given retention time (i.e. a spectrum) against the retention time.
## Define the sample colors
sample_colors <- brewer.pal(3, "Set1")[1:2]
names(sample_colors) <- c("KO", "WT")
## Subset the full raw data by file and plot it.
tmp <- filterFile(faahKO_raw, file = 1)
plot(x = rtime(tmp), y = tic(tmp), xlab = "retention time", ylab = "TIC",
col = paste0(sample_colors[pData(tmp)$sample_group], 80), type = "l")
for (i in 2:length(fileNames(faahKO_raw))) {
tmp <- filterFile(faahKO_raw, file = i)
points(rtime(tmp), tic(tmp), type = "l",
col = paste0(sample_colors[pData(tmp)$sample_group], 80))
}
legend("topleft", col = sample_colors, legend = names(sample_colors), lty = 1)
The TIC or the base peak chromatogram (BPC, maximum signal per spectrum against its retention time) are useful plots to get a first general overview of an experiment and can also be used for quality control purposes, e.g. to spot problematic samples. Plotting the distribution of the total ion currents (tic
) or the base peak intensities (bpi
) per file using boxplots can also be used for quality assessment.
Exercise 1: alternative approach to plot the TIC: use the tic
and fromFile
methods instead.
Solution:
## The tic returns a vector, one value for each spectrum in the experiment. The
## values are not organized by sample/file
head(tic(faahKO_raw))
## The fromFile method returns the index from the file the spectrum derives
head(fromFile(faahKO_raw))
## Extract the total ion current and retention times and split them by file.
tics <- split(tic(faahKO_raw), f = fromFile(faahKO_raw))
rts <- split(rtime(faahKO_raw), f = fromFile(faahKO_raw))
## Define the color for each sample
cols <- paste0(sample_colors[pData(faahKO_raw)$sample_group], 80)
## initialize plot
plot(3, 3, pch = NA, xlim = range(rts), ylim = range(tics), main = "TIC",
xlab = "retention time", ylab = "intensity")
tmp <- mapply(rts, tics, cols, FUN = function(x, y, col) {
points(x = x, y = y, col = col, type = "l")
})
In most mzML and netCDF files the MS data is organized by spectrum (i.e. intensity values by their corresponding mz value) and, as detailed above, also the OnDiskMSnExp
object returns data by spectrum. In LC-MS metabolomics, however, peak detection is performed (for small slices along the mz dimension) in the time dimension and hence orthogonally to the spectrum data. To extract intensity data by retention time, xcms
defines the extractChromatograms
method and the Chromatogram
class. Below we create the base peak chromatogram (BPC, maximum signal per spectrum against its retention time). Usually we could use the bpi
method similarly to the tic
method, but the present netCDF files do not provide the base peak intensities in the spectrum header information. We thus have to create the BPC using the extractChromatogram
method that loads the full spectrum data from all files and aggregates the intensities per spectrum. The result is returned as a list
of Chromatogram
objects, one for each file. This is relatively fast for the present files (also because data is read in parallel) but can be slow with larger, higher resolution, MS experiments.
## Extract chromatograms for the full mz and rt range. By specifying
## aggregationFun = "max" we extract the maximum intensity per spectrum and
## get hence base peak chromatograms
chrs <- extractChromatograms(faahKO_raw, aggregationFun = "max")
## Plot the chromatograms
plotChromatogram(
chrs,
col = paste0(sample_colors[pData(faahKO_raw)$sample_group], 80))
legend("topleft", col = sample_colors, legend = names(sample_colors), lty = 1)
The BPC are similar between individual samples, but seem to be shifted in retention time dimension. This shift will be corrected in the alignment/retention time adjustment step.
Next we visualize the chromatogram for specific ions, i.e. for a small mz range and/or retention time window to inspect what type of chromatographic peaks have to be identified in the present LC-MS experiment.
## Extract the chromatogram for one mz value and a given rt range
chrs <- extractChromatograms(faahKO_raw, mz = 335, rt = c(2700, 2900))
plotChromatogram(chrs,
col = paste0(sample_colors[pData(faahKO_raw)$sample_group], 80))
The chromatographic peaks are about 40-50 seconds wide in this experiment. Note that not in all spectra (for all retention times) a signal was measured for the given mz range. The lines are thus not continuous in the plot above.
For the maximal intensity measured of the chromatographic peak we can also extract the corresponding spectrum in a file. Below we extract such spectrum for the first file and plot it.
## Subsetting the original object to the given retention time range and file,
## this returns an OnDiskMSnExp referencing to a single spectrum.
subs <- filterFile(filterRt(faahKO_raw, rt = c(2779, 2781)), file = 1)
## Extract the Spectrum
spctr <- spectra(subs)[[1]]
plot(mz(spctr), intensity(spctr), type = "h", xlab = "mz", ylab = "intensity")
points(x = 335, y = -10000, pch = 2)
Apparently there are many mass peaks present at the specific retention time, most of them larger than the one of our example chromatographic peak.
The first task in the pre-processing of LC-MS metabolomics data is the detection of peaks in the retention time dimension (i.e. chromatographic peaks) for MS data slices along the mz dimension. The most commonly used algorithm is centWave [7] that performs a relatively robust peak detection. Peak detection can be performed on OnDiskMSnExp
objects using the findChromPeaks
method providing in addition an algorithm-specific parameter class, such as an CentWaveParam
for centWave based peak detection, or MatchedFilterParam
for peak detection using the matched filter algorithm [2].
Below we use the default parameters for the peak detection (which is however never a good idea in LC-MS data pre-processing because peak shape and MS data are highly dependent on the experimental setup). The peak detection is carried out in parallel for each file.
## Create the parameter object for centWave
cwp <- CentWaveParam(noise = 200)
faahKO <- findChromPeaks(faahKO_raw, param = cwp)
faahKO
## MSn experiment data ("XCMSnExp")
## Object size in memory: 3.34 Mb
## - - - Spectra data - - -
## MS level(s): 1
## Number of spectra: 15336
## MSn retention times: 41:41 - 74:60 minutes
## - - - Processing information - - -
## Data loaded [Wed Jun 14 09:45:26 2017]
## Filter: select MS level(s) 1 [Wed Jun 14 09:45:37 2017]
## MSnbase version: 2.3.4
## - - - Meta data - - -
## phenoData
## rowNames: 1 2 ... 12 (12 total)
## varLabels: sample_name sample_group
## varMetadata: labelDescription
## Loaded from:
## [1] ko15.CDF... [12] wt22.CDF
## Use 'fileNames(.)' to see all files.
## protocolData: none
## featureData
## featureNames: X0001.1 X0001.10 ... X1278.9 (15336 total)
## fvarLabels: fileIdx spIdx ... spectrum (25 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
## method: centWave
## 21704 peaks identified in 12 samples.
## On average 1809 chromatographic peaks per sample.
The result from the peak detection is an XCMSnExp
object, which is an extension to the OnDiskMSnExp
object. While being a container for metabolomics pre-processing results, this object inherits the OnDiskMSnExp
’s ability to access the raw data files. Below we access the results from this analysis step using the chromPeaks
method.
head(chromPeaks(faahKO))
## mz mzmin mzmax rt rtmin rtmax into intb
## [1,] 425.9 425.9 425.9 2520.158 2510.768 2527.982 9999.769 9984.120
## [2,] 464.3 464.3 464.3 2518.593 2504.508 2532.677 32103.270 32082.926
## [3,] 499.1 499.1 499.1 2524.852 2520.158 2527.982 4979.194 4904.709
## [4,] 572.7 572.7 572.7 2524.852 2520.158 2527.982 2727.446 2721.187
## [5,] 579.8 579.8 579.8 2524.852 2520.158 2527.982 2450.477 2444.218
## [6,] 453.2 453.2 453.2 2506.073 2501.378 2527.982 1007408.973 1007380.804
## maxo sn sample is_filled
## [1,] 741 740 1 0
## [2,] 1993 1992 1 0
## [3,] 883 13 1 0
## [4,] 559 558 1 0
## [5,] 468 467 1 0
## [6,] 38152 38151 1 0
Each line in the matrix
represents a chromatographic peak identified in one sample. The index of the file in which the peak was detected is given in column "sample"
while the definition of the peak is provided in columns "mzmin"
, "mzmax"
, "rtmin"
and "rtmax"
and the peaks intensities in columns "into"
(integrated peak signal) and "maxo"
(maximum signal at the peak’s apex).
The XCMSnExp
object keeps also track of all performed processing steps storing also the employed parameter classes and guaranteeing hence full reproducibility. This information can be accessed with the processHistory
method that returns a list
of processing steps. Below we use this method to extract the parameter class used for the chromatographic peak detection.
## Getting the first process history step, in our case the chromatographic
## peak detection.
ph <- processHistory(faahKO)[[1]]
ph
## Object of class "XProcessHistory"
## type: Peak detection
## date: Wed Jun 14 09:45:37 2017
## info:
## fileIndex: 1,2,3,4,5,6,7,8,9,10,11,12
## Parameter class: CentWaveParam
## Extracting the Parameter class employed
processParam(ph)
## Object of class: CentWaveParam
## Parameters:
## ppm: 25
## peakwidth: 20, 50
## snthresh: 10
## prefilter: 3, 100
## mzCenterFun: wMean
## integrate: 1
## mzdiff: -0.001
## fitgauss: FALSE
## noise: 200
## verboseColumns: FALSE
## roiList length: 0
## firstBaselineCheck TRUE
## roiScales length: 0
Whether peak detection was successful is hard to tell. The numbers of detected peaks can provide some first information (Is the number much lower than expected? Are there files with considerably fewer peaks?). Also summaries of the rt and mz widths of identified peaks might be informative. Plotting the raw data and visually inspecting the detected peaks represents however one of the best options to estimate peak detection performance. This is in most cases done on a handful of known compounds or internal control compounds added to each sample. The new user interface facilitates extraction of full, or small slices of the MS data and enables an easy access to the original (or processed) data at any stage. Performance is guaranteed by making use of the indexing capabilities of mzML and netCDF files reading only sub-sets of the data where possible. The getEIC
method from the old xcms
user interface provided similar functionality but loaded the full data with each call. Also, not the original values were returned, but intensities from the profile matrix which contained intensities binned in equidistant slices along the mz dimension.
Below we plot the chromatogram for a mass-to-charge ratio of mz = 335
(and a retention time window from 2700 to 2900 seconds) and highlight also all identified chromatographic peaks in that region.
## Extract the chromatogram for one mz value and a given rt range
chrs <- extractChromatograms(faahKO, mz = 335, rt = c(2700, 2900))
plotChromatogram(chrs,
col = paste0(sample_colors[pData(faahKO)$sample_group], 80))
highlightChromPeaks(
faahKO, rt = c(2700, 2900), mz = 335,
border = paste0(sample_colors[pData(faahKO)$sample_group], 40))
Over and above the peak detection seemed to be OK although in some samples no peaks were identified, mostly due to low (and/or sparse) signal intensities.
The chromPeaks
method allows also to retrieve peaks for a specific mz
or rt
range. This enables to evaluate whether and how many chromatographic peaks have been detected for a certain mz-rt region. Below we extract all peaks identified in the above mz-rt region.
## Extract detected peaks for a mz-rt region. The parameter ppm allows to
## extend the mz range slightly
chromPeaks(faahKO, mz = 335, rt = c(2700, 2900), ppm = 10)
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn
## [1,] 335 335 335 2781.505 2761.160 2809.674 412134.3 382797.5 16856 22
## [2,] 335 335 335 2786.199 2764.290 2812.803 1496244.2 1462919.0 58736 78
## [3,] 335 335 335 2786.201 2764.291 2812.805 211297.2 195961.1 8158 17
## [4,] 335 335 335 2783.069 2762.725 2812.804 285228.1 271885.1 9154 21
## [5,] 335 335 335 2786.199 2764.290 2812.803 932645.2 920464.6 35856 86
## [6,] 335 335 335 2787.765 2765.855 2820.629 1636669.0 1605742.7 54640 73
## [7,] 335 335 335 2784.635 2759.595 2809.674 643673.2 631323.7 20672 53
## [8,] 335 335 335 2792.461 2768.987 2823.760 876585.5 847887.3 27200 39
## sample is_filled
## [1,] 1 0
## [2,] 2 0
## [3,] 3 0
## [4,] 4 0
## [5,] 8 0
## [6,] 9 0
## [7,] 10 0
## [8,] 11 0
As we have already seen above, a peak was detected in most samples.
To emphasize the need to adapt the peak detection algorithm setting to each setup/experiment we load an mzML file from a completely different experimental setup and perform a centWave peak detection using default settings.
## Load one file from a different setup.
fl <- paste0("./data/","250516_POOL_N_POS_28.mzML.gz")
raw_data <- readMSData2(fl)
## Run peak detection using default CentWave.
proc_data <- findChromPeaks(raw_data, param = CentWaveParam())
proc_data
## MSn experiment data ("XCMSnExp")
## Object size in memory: 0.36 Mb
## - - - Spectra data - - -
## MS level(s): 1
## Number of spectra: 1721
## MSn retention times: 0:0 - 8:0 minutes
## - - - Processing information - - -
## Data loaded [Wed Jun 14 09:46:27 2017]
## Filter: select MS level(s) 1 [Wed Jun 14 09:46:28 2017]
## MSnbase version: 2.3.4
## - - - Meta data - - -
## phenoData
## rowNames: 250516_POOL_N_POS_28.mzML.gz
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## 250516_POOL_N_POS_28.mzML.gz
## protocolData: none
## featureData
## featureNames: X0001.1 X0002.1 ... X1721.1 (1721 total)
## fvarLabels: fileIdx spIdx ... spectrum (26 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
## method: centWave
## 153 peaks identified in 1 samples.
## On average 153 chromatographic peaks per sample.
The number of detected peaks is very low, much lower than expected.
From the setup it is known that some compounds should be present/detected in the sample. One of these is glycine with an expected mz of 76.03969968
. Allowing a ppm of 20 we extract all identified peaks at about the expected mz.
mz_glyc <- 76.03969968
## Extract chromatographic peaks matching the mz of glycine, allowing
## a 20ppm deviation.
pks <- chromPeaks(proc_data, mz = mz_glyc, ppm = 20)
pks
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn sample is_filled
Not a single peak was detected in the expected region. Next we extract and plot the corresponding ion chromatogram to evaluate what signal is present in the region.
## Extend the mz range by 10 ppm on both sides.
mzr <- c(mz_glyc - mz_glyc * 10 / 1e6, mz_glyc + mz_glyc * 10 / 1e6)
## Extract the ion chromatogram for glycine
eic_glyc <- extractChromatograms(proc_data, mz = mzr, rt = c(165, 180))
## Plot the chromatogram
plotChromatogram(eic_glyc, rt = c(165, 180))
There is signal at the expected mz/rt, but why was this peak not detected?
Exercise 2: inspecting the chromatographic peak for glycine, how could you improve the centWave peak detection settings? Run peak detection with the modified settings and evaluate the results.
Solution: the chromatographic peaks are too narrow to be detected using the default settings. Adjust the peakwidth
parameter to represent the expected range of peak widths.
## Default centWave settings
CentWaveParam()
## The rt width of the peak is much smaller than the default 20-50 seconds.
## Adapt the peakwidth parameter and re-run the peak detection
cwp <- CentWaveParam(peakwidth = c(2, 10))
proc_data <- findChromPeaks(raw_data, param = cwp)
## Numer of detected peaks:
nrow(chromPeaks(proc_data))
## Average rt width
mean(chromPeaks(proc_data)[, "rtmax"] - chromPeaks(proc_data)[, "rtmin"])
## Do we find a glycine peak?
chromPeaks(proc_data, mz = mz_glyc, ppm = 20)
## Yes we do, and at the expected rt.
## plot the data and highlight the peak.
plotChromatogram(eic_glyc)
highlightChromPeaks(proc_data, mz = mz_glyc, rt = c(165, 180), ppm = 20)
## Peak is eventually even a little too broad.
The IPO
Bioconductor package [8] provides functionality for an automatic tuning of xcms
peak detection parameters and is thus a good starting point to automatically tune parameters for a specific metabolomics setup/experiment. Visual inspection of identified peaks is however crucial to guarantee proper peak detection.
The time at which analytes elute in the chromatography can vary between samples (and even compounds). Such a difference was already observable in the extracted ion chromatogram plot shown as an example in the previous section. The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.
A plethora of alignment algorithms exist (see [9]), with some of them being implemented also in xcms
. The method to perform the alignment/retention time correction in xcms
is adjustRtime
which uses different alignment algorithms depending on the provided parameter class. In the example below we use the obiwarp method [10] to align the samples. We use a binSize = 0.6
which creates warping functions in mz bins of 0.6. Also here it is advisable to modify the settings for each experiment and evaluate if retention time correction did align internal controls or known compounds properly.
## Define the parameters to the obiwarp method
owp <- ObiwarpParam(binSize = 0.6)
faahKO <- adjustRtime(faahKO, param = owp)
The result from the adjustRtime
call is the same XCMSnExp
object containing in addition the adjusted retention times. The hasAdjustedRtime
method can be used to evaluate if the object contains adjusted retention times that can be extracted using the adjustedRtime
method. If an XCMSnExp
object contains alignment results, the rtime
method does also by default return the adjusted retention times. Raw retention times can then be extracted by passing adjusted = FALSE
to the rtime
method. Note also that by passing the argument bySample = TRUE
, the rtime
and adjustedRtime
methods allow to extract retention time grouped by samples.
Below we simply test these methods to get a feeling of the type of result objects are returned.
## Do we have adjusted retention times?
hasAdjustedRtime(faahKO)
## [1] TRUE
## Get adjusted retention times for the first 6 spectra
head(rtime(faahKO))
## X0001.1 X0001.10 X0001.11 X0001.12 X0001.2 X0001.3
## 2501.378 2501.378 2501.378 2501.378 2501.378 2501.378
## And the raw retention times
head(rtime(faahKO, adjusted = FALSE))
## X0001.1 X0001.10 X0001.11 X0001.12 X0001.2 X0001.3
## 2501.378 2501.378 2501.380 2501.378 2501.378 2501.379
## By default the methods return again values per spectra. We can however pass
## bySample = TRUE and the result is returned as a list of numeric vectors,
## each list element representing the retention times for all spectra from one
## file
length(rtime(faahKO, bySample = TRUE))
## [1] 12
To evaluate the impact of the alignment we plot a base peak chromatogram before and after retention time correction as well as the deviation between raw and adjusted retention times.
## Extract BPC for each file; this reads all data from the original files.
chrs <- extractChromatograms(faahKO, aggregationFun = "max")
## To plot the BPC with the raw retention times we have to extract the
## intensities from the Chromatogram objects and extract the raw rt from
## the XCMSnExp with rtime(faahKO, adjusted = FALSE)
## Note that using bySample = TRUE the method returns the retention times split
## by sample.
rt_raw <- rtime(faahKO, adjusted = FALSE, bySample = TRUE)
## Extract the (base peak) intensities of the chromatograms
ints <- lapply(chrs, intensity)
## Preparing the plot
par(mfrow = c(3, 1), mar = c(0.5, 4, 1, 0.5))
## Plot first the base peak chromatogram with the raw retention times.
plot(3, 3, pch = NA, xlab = "", ylab = "base peak intensity", xaxt = "n",
main = "before adjustment", xlim = range(rt_raw), ylim = range(ints))
cols <- paste0(sample_colors[pData(faahKO)$sample_group], 80)
tmp <- mapply(rt_raw, ints, cols, FUN = function(x, y, col) {
points(x, y, col = col, type = "l")
})
## Plot the base peak chromatograms with the adjusted retention times.
plotChromatogram(chrs, main = "after adjustment", col = cols, xaxt = "n")
## Plot the difference between adjusted and raw adjustment.
par(mar = c(4, 4, 0.5, 0.5))
plotAdjustedRtime(faahKO, col = cols)
The retention time adjustment did align most of the base peaks across samples. Between 3600 and 3800 seconds the alignment was however less optimal showing also the strongest retention time adjustment.
It is also important to note that the alignment step adjusts also the reported retention times for the detected chromatographic peaks. If we were not happy with the results from the alignment step we could also drop these results using the dropAdjustedRtime
method in which case the raw retention times are restored (also for the detected chromatographic peaks).
Exercise 3: plot the chromatographic peak for mz = 335
and rt = c(2700, 2900)
before and after retention time correction. Hint: for the peaks before alignment, extract the chromatograms either from the raw faahKO_raw
object or use the dropAdjustedRtime
method to restore raw retention times.
Solution:
## To extract the chromatogram before retention time adjustment we could
## drop the retention time adjustment results:
chrs_raw <- extractChromatograms(dropAdjustedRtime(faahKO),
rt = c(2700, 2900), mz = 335)
## Or, more simpler, just pass the adjustedRtime = FALSE parameter to the method.
chrs_raw <- extractChromatograms(faahKO, adjustedRtime = FALSE,
rt = c(2700, 2900), mz = 335)
## Define the color for each sample
cols <- paste0(sample_colors[pData(faahKO)$sample_group], 80)
par(mfrow = c(2, 1))
plotChromatogram(chrs_raw, col = cols)
## Extract the chromatograms after adjustment.
chrs_adj <- extractChromatograms(faahKO, rt = c(2700, 2900), mz = 335)
plotChromatogram(chrs_adj, col = cols)
## Alignment is not perfect for this paricular peak.
An alternative to the obiwarp method is the peakGroups method that performs the alignment using so called hook peaks that are present in most samples. The retention times are then adjusted by aligning these peaks and interpolating in the region between them. Such peaks are usually being identified using the peak density correspondence method by enforcing the chromatographic peaks to be present in most samples, could however also be defined manually.
Note that all alignment methods do depend on the assumption that the samples are similar (obiwarp: have similar chromatograms and peak groups: have metabolites present in all samples that can be used as hook peaks).
limma
package on the log2 transformed (and normalized) intensities to identify features that are different between groups.CAMERA
package might be a good starting point.1. Myint L, Kleensang A, Zhao L, Hartung T, Hansen KD: Joint bounding of peaks across samples improves differential analysis in mass spectrometry-based metabolomics. Analytical chemistry 2017:acs.analchem.6b04719.
2. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical chemistry 2006, 78:779–787.
3. Sugimoto M, Kawakami M, Robert M, Soga T, Tomita M: Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis. Current bioinformatics 2012, 7:96–108.
4. Smith R, Mathis AD, Ventura D, Prince JT: Proteomics, lipidomics, metabolomics: a mass spectrometry tutorial from a computer scientist’s point of view. BMC Bioinformatics 2014, 15 Suppl 7(Suppl 7):S9.
5. Gatto L, Lilley KS: MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 2012, 28:288–289.
6. Saghatelian A, Trauger SA, Want EJ, Hawkins EG, Siuzdak G, Cravatt BF: Assignment of endogenous substrates to enzymes by global metabolite profiling. Biochemistry 2004, 43:14332–14339.
7. Tautenhahn R, Böttcher C, Neumann S: Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics 2008, 9:504.
8. Libiseller G, Dvorzak M, Kleb U, Gander E, Eisenberg T, Madeo F, Neumann S, Trausinger G, Sinner F, Pieber T, Magnes C: IPO: a tool for automated optimization of XCMS parameters. BMC Bioinformatics 2015, 16:118.
9. Smith R, Ventura D, Prince JT: LC-MS alignment in theory and practice: a comprehensive algorithmic review. Briefings in bioinformatics 2013, 16:bbt080–117.
10. Prince JT, Marcotte EM: Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical chemistry 2006, 78:6140–6152.