1 Introduction

The mzR package aims at providing a common, low-level interface to several mass spectrometry data formats, namely mzData (Orchard et al. 2007), mzXML (Pedrioli et al. 2004), mzML (Martens et al. 2010) for raw data, and mzIdentML (A. R. Jones et al. 2012), somewhat similar to the Bioconductor package affyio for affymetrix raw data. No processing is done in mzR, which is left to packages such as r Biocpkg("xcms") (C. A. Smith et al. 2006, Tautenhahn:2008) or MSnbase (L. Gatto and Lilley 2012). These packages also provide more convenient, high-level interfaces to raw and identification. data

Most importantly, access to the data should be fast and memory efficient. This is made possible by allowing on-disk random file access, i.e. retrieving specific data of interest without having to sequentially browser the full content nor loading the entire data into memory.

The actual work of reading and parsing the data files is handled by the included C/C++ libraries or backends. The mzRramp RAMP parser, written at the Institute for Systems Biology (ISB) is a fast and lightweight parser in pure C. Later, it gained support for the mzData format. The C++ reference implementation for the mzML is the proteowizard library (Kessner et al. 2008) (pwiz in short), which in turn makes use of the boost C++ (http://www.boost.org/) library. RAMP is able to access mzML files by calling pwiz methods. More recently, the proteowizard (http://proteowizard.sourceforge.net/) (M. C. Chambers et al. 2012) has been fully integrated using the mzRpwiz backend for raw data, and is not the default option. The mzRnetCDF backend provides support to CDF-based formats. Finally, the mzRident backend is available to access identification data (mzIdentML) through pwiz.

The mzR package is in essence a collection of wrappers to the C++ code, and benefits from the C++ interface provided through the Rcpp package (Eddelbuettel and François 2011).

2 Mass spectrometry raw data

All the mass spectrometry file formats are organized similarly, where a set of metadata nodes about the run is followed by a list of spectra with the actual masses and intensities. In addition, each of these spectra has its own set of metadata, such as the retention time and acquisition parameters.

2.1 Spectral data access

Access to the spectral data is done via the peaks function. The return value is a list of two-column mass-to-charge and intensity matrices or a single matrix if one spectrum is queried.

2.2 Chromatogram access

Access to the chromatogram(s) is done using the chromatogram (or chromatograms) function, that return one (or a list of) data.frames. See ?chromatogram for details. This functionality is only available with the pwiz backend.

2.3 Identification result access

The main access to identification result is done via psms, score and modifications. psms and score will return the detailed information on each psm and scores. modifications will return the details on each modification found in peptide.

2.4 Metadata access

Run metadata is available via several functions such as instrumentInfo() or runInfo(). The individual fields can be accessed via e.g. detector() etc.

Spectrum metadata is available via header(), which will return a list (for single scans) or a dataframe with information such as the basePeakMZ, peaksCount, … or, for higher-order MS the msLevel and precursor information.

Identification metadatais available via mzidInfo(), which will return a list with information such as the software, ModificationSearched, enzymes, SpectraSource and other information for this identification result.

The availability of this metadata can not always be guaranteed, and depends on the MS software which converted the data.

3 Example

3.1 mzXML/mzML/mzData files

A short example sequence to read data from a mass spectrometer. First open the file.

library(mzR)
## Loading required package: Rcpp
library(msdata)

mzxml <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", 
                     package = "msdata")
aa <- openMSfile(mzxml) 

We can obtain different kind of header information.

runInfo(aa)
## $scanCount
## [1] 55
## 
## $lowMz
## [1] 50.0036
## 
## $highMz
## [1] 298.673
## 
## $dStartTime
## [1] 0.3485
## 
## $dEndTime
## [1] 390.027
## 
## $msLevels
## [1] 1 2 3 4
## 
## $startTimeStamp
## [1] NA
instrumentInfo(aa)
## $manufacturer
## [1] "Thermo Scientific"
## 
## $model
## [1] "LTQ Orbitrap"
## 
## $ionisation
## [1] "electrospray ionization"
## 
## $analyzer
## [1] "fourier transform ion cyclotron resonance mass spectrometer"
## 
## $detector
## [1] "unknown"
## 
## $software
## [1] "Xcalibur software 2.2 SP1"
## 
## $sample
## [1] ""
## 
## $source
## [1] ""
header(aa,1)
## $seqNum
## [1] 1
## 
## $acquisitionNum
## [1] 1
## 
## $msLevel
## [1] 1
## 
## $polarity
## [1] 1
## 
## $peaksCount
## [1] 684
## 
## $totIonCurrent
## [1] 341427000
## 
## $retentionTime
## [1] 0.3485
## 
## $basePeakMZ
## [1] 120.066
## 
## $basePeakIntensity
## [1] 211860000
## 
## $collisionEnergy
## [1] 0
## 
## $ionisationEnergy
## [1] 0
## 
## $lowMZ
## [1] 50.3254
## 
## $highMZ
## [1] 298.673
## 
## $precursorScanNum
## [1] 0
## 
## $precursorMZ
## [1] 0
## 
## $precursorCharge
## [1] 0
## 
## $precursorIntensity
## [1] 0
## 
## $mergedScan
## [1] 0
## 
## $mergedResultScanNum
## [1] 0
## 
## $mergedResultStartScanNum
## [1] 0
## 
## $mergedResultEndScanNum
## [1] 0
## 
## $injectionTime
## [1] 0
## 
## $spectrumId
## [1] "controllerType=0 controllerNumber=1 scan=1"

Read a single spectrum from the file.

pl <- peaks(aa,10)
peaksCount(aa,10)
## [1] 317
head(pl)
##          [,1]     [,2]
## [1,] 50.08176 6984.858
## [2,] 50.62267 7719.419
## [3,] 50.70530 7185.290
## [4,] 50.73298 7509.140
## [5,] 50.83848 9366.624
## [6,] 50.88303 8012.808
plot(pl[,1], pl[,2], type="h", lwd=1)