---
title: "R / Bioconductor packages for Proteomics"
author: "[Laurent Gatto](http://proteome.sysbiol.cam.ac.uk/lgatto/)"
output:
  html_document:
    theme: united	
    toc: true
---

<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Bioc2014}
-->

**Abstract** This workshop will give delegates the opportunity to
discover and try some of the recent R / Bioconductor developments for
proteomics. Topics covered will including support for open
community-driven formats for raw data and identification results,
packages for peptide-spectrum matching, quantitative proteomics, mass
spectrometry (MS) and quantitation data processing, and
visualisation. The workshop material will be a self-contained
vignette/workflow including example data.

This short tutorial is part of the `ProteomicsBioc2014Workshop`
package (version `r packageVersion("ProteomicsBioc2014Workshop")`),
available at
https://github.com/ComputationalProteomicsUnit/ProteomicsBioc2014Workshop.

```{r, env, message=FALSE, echo=FALSE, warning=FALSE}
library("knitr")
opts_knit$set(error = FALSE)

library("RforProteomics")
library("mzR")
library("mzID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("BiocInstaller")
library("rTANDEM")
library("shinyTANDEM")
```

## Introduction

```{r, pk, echo=FALSE, warning=FALSE, cache=TRUE}
biocv <- as.character(biocVersion())
pp <- proteomicsPackages(biocv)
msp <- massSpectrometryPackages(biocv)
msdp <- massSpectrometryDataPackages(biocv)
```

In Bioconductor version `r biocv`, there are respectively `r nrow(pp)`
proteomics and `r nrow(msp)` mass spectrometry software packages and
`r nrow(msdp)` mass spectrometry experiment packages. These respective
packages can be extracted with the `proteomicsPackages()`,
`massSpectrometryPackages()` and `massSpectrometryDataPackages()` and
explored interactively.


```{r, pp, eval=FALSE}
library("RforProteomics")
pp <- proteomicsPackages(biocv)
display(pp)
```

## Mass spectrometry data

```{r, datatab, results='asis', echo=FALSE}

datatab <-
    data.frame(Type = c("raw", "identification", "quantitation",
                   "peak lists", "other"),
               Format = c("mzML, mzXML, netCDF, mzData",
                   "mzIdentML", "mzQuantML", "mgf", "mzTab"),
               Package = c(
                   "[`mzR`](http://bioconductor.org/packages/release/bioc/html/mzR.html) (read)",
                   "[`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html) (read)",
                   "",
                   "[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write)", 
                   "[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write)"))
kable(datatab)

```

### Getting data from proteomics repositories

Contemporary MS-based proteomics data is disseminated through the
[ProteomeXchange](http://www.proteomexchange.org/) infrastructure,
which centrally coordinates submission, storage and dissemination
through multiple data repositories, such as the
[PRIDE](https://www.ebi.ac.uk/pride/archive/) data base at the EBI for
MS/MS experiments, [PASSEL](http://www.peptideatlas.org/passel/) at
the ISB for SRM data and the
[MassIVE](http://massive.ucsd.edu/ProteoSAFe/static/massive.jsp)
resource. The
[`rpx`](http://www.bioconductor.org/packages/release/bioc/html/rpx.html)
is an interface to ProteomeXchange and provides a basic and unified
access to PX data.

```{r, rpx}
library("rpx")
pxannounced()
```

```{r, pxd}
px <- PXDataset("PXD000001")
px
pxfiles(px)

```

Other metadata for the `px` dataset:

```{r, pxvar, eval=FALSE}
pxtax(px)
pxurl(px)
pxref(px)
```

Data files can then be downloaded with the `pxget` function as
illustrated below. Alternatively, the file is available on the
workshop's Amazon virtual machine in `/data/Proteomics/data/`.

```{r, pxget}
mzf <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
mzf <- file.path("/data/Proteomics/data", mzf)
if (!file.exists(mzf))
    mzf <- pxget(px, pxfiles(px)[6])
mzf
```

### Handling raw MS data

The `mzR` package provides an interface to the
[proteowizard](http://proteowizard.sourceforge.net/) code base,
the legacy RAMP is a non-sequential parser and other C/C++ code to
access various raw data files, such as `mzML`, `mzXML`, `netCDF`, and
`mzData`. The data is accessed on-disk, i.e it does not get loaded
entirely in memory by default. The three main functions are
`openMSfile` to create a file handle to a raw data file, `header` to
extract metadata about the spectra contained in the file and `peaks`
to extract one or multiple spectra of interest. Other functions such
as `instrumentInfo`, or `runInfo` can be used to gather general
information about a run.

```{r, rawms}
library("mzR")
ms <- openMSfile(mzf)
ms
```

```{r, hd}
hd <- header(ms)
dim(hd)
names(hd)
```

#### Exercise

Extract the index of the MS2 spectrum with the highest base peak
intensity and plot its spectrum. Is the data centroided or in profile
mode?

```{r, ex_raw, echo=FALSE, eval=FALSE, fig.align='center'}
hd2 <- hd[hd$msLevel == 2, ]
i <- which.max(hd2$basePeakIntensity)
hd2[i, ]
pi <- peaks(ms, hd2[i, 1])
plot(pi, type = "h")
mz <- hd2[i, "basePeakMZ"]
plot(pi, type = "h", xlim = c(mz-0.5, mz+0.5))

pj <- peaks(ms, 100)
plot(pj, type = "l")
plot(pj, type = "l", xlim = c(536,540))
```

### Handling identification data

The `RforProteomics` package distributes a small identification result
file (see
`?TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzid`) that we
load and parse using infrastructure from the 
[`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html)
package.

```{r, id, cache=TRUE}
library("mzID")
(f <- dir(system.file("extdata", package = "ProteomicsBioc2014Workshop"),
         pattern = "mzid", full.names=TRUE))
id <- mzID(f)
id
```

Various data can be extracted from the `mzID` object, using one the
accessor functions such as `database`, `scans`, `peptides`, ... The
object can also be converted into a `data.frame` using the `flatten`
function.

#### Exercise

Is there a relation between the length of a protein and the number of
identified peptides, conditioned by the (average) e-value of the
identifications?

```{r, ex_id, echo = FALSE, eval=FALSE}
fid <- flatten(id)
x <- by(fid, fid$accession, function(x)
    c(unique(x$length),
      length(unique(x$pepseq)),
      mean(x$'ms-gf:specevalue')))
x <- data.frame(do.call(rbind, x))
colnames(x) <- c("plength", "npep", "eval")
x$bins <- cut(x$eval, summary(x$eval))
library("lattice")
xyplot(plength ~ npep | bins, data = x)
```

### MS/MS database search

While searches are generally performed using third-party software
independently of R or can be started from R using a `system` call, the
[`rTANDEM`](http://www.bioconductor.org/packages/release/bioc/html/rTANDEM.html)
package allows one to execute such searches using the X!Tandem
engine. The
[`shinyTANDEM`](http://www.bioconductor.org/packages/release/bioc/html/shinyTANDEM.html)
provides a interactive interface to explore the search results.

```{r, rtandem, eval=FALSE}
library("rTANDEM")
?rtandem
library("shinyTANDEM")
?shinyTANDEM
```


## High-level data interface

The above sections introduced low-level interfaces to raw and
identification results. The
[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html)
package provides abstractions for raw data through the `MSnExp` class
and containers for quantification data via the `MSnSet` class. Both
store the actual assay data (spectra or quantitation matrix) and
sample and feature metadata, accessed with `spectra` (or the
`[`, `[[` operators) or `exprs`, `pData` and `fData`.

The figure below give a schematics of an `MSnSet` instance and the
relation between the assay data and the respective feature and sample
metadata.


```{r, msnset, echo=FALSE, fig.width = 5, fig.height = 7, fig.align='center'}
plot(NA, xlim = c(0, 5), ylim = c(0, 10), axes=FALSE, xlab = NA, ylab = NA)
rect(0, 0, 3, 1.9)
rect(0, 2, 3, 10)
rect(3.05, 2, 5, 10)

segments(seq(0, 3, length.out = 7),
         rep(0, 7),
         seq(0, 3, length.out = 7),
         rep(10, 7),
         lty = "dotted")

segments(rep(0, 50),
         seq(2, 10, length.out = 50),
         rep(5, 100),
         seq(2, 10, length.out = 50),
         lty = "dotted")

text(1.5, 1, "sample metadata", cex = 1.5)
text(1.5, 6, "assay data", cex = 1.5)
text(4, 6, "feature\nmetadata", cex = 1.5)
```

Another useful slot is `processingData`, accessed with `processingData(.)`, that records all the processing that objects have undergone since their creation (see examples below).

The `readMSData` will parse the raw data, extract the MS2 spectra and
construct an MS experiment file.

```{r, msnbase}
library("MSnbase")
quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "mzXML$")
quantFile
msexp <- readMSData(quantFile, verbose=FALSE)
msexp
```

The identification results stemming from the same raw data file can
then be used to add PSM matches.

```{r, addid}
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "mzid$")
identFile
msexp <- addIdentificationData(msexp, identFile)
fData(msexp)
```

```{r, specplot, fig.align='center'}
msexp[[1]]
plot(msexp[[1]], full=TRUE)
as(msexp[[1]], "data.frame")[100:105, ]
```

### Quantitative proteomics

There are a wide range of proteomics quantitation techniques that can
broadly be classified as labelled vs. label-free, depending whether
the features are labelled prior the MS acquisition and the MS level at
which quantitation is inferred, namely MS1 or MS2. 

```{r, quanttab, echo=FALSE, results='asis'}

qtb <- matrix(c("XIC", "Counting", "SILAC, 15N", "iTRAQ, TMT"),
              nrow = 2, ncol = 2)
dimnames(qtb) <- list(
    'MS level' = c("MS1", "MS2"),
    'Quantitation' = c("Label-free", "Labelled"))

kable(qtb)

```

In terms of raw data quantitation, most efforts have been devoted to
MS2-level quantitation. Label-free XIC quantitation has however been
addressed in the frame of metabolomics data processing by the
[`xcms`](http://bioconductor.org/packages/release/bioc/html/xcms.html)
infrastructure. 


An `MSnExp` is converted to an `MSnSet` by the `quantitation`
method. Below, we use the iTRAQ 4-plex isobaric tagging strategy
(defined by the `iTRAQ4` parameter; other tags are available).

```{r, itraq4plot, fig.align='center'}
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
```

```{r, quantitraq}
msset <- quantify(msexp, method = "trap", reporters = iTRAQ4, verbose=FALSE)
exprs(msset)
processingData(msset)
```

Other MS2 quantitation methods available in `quantify` include the
(normalised) spectral index `SI` and (normalised) spectral abundance
factor `SAF` or simply a simple count method.

```{r, lfms2}
exprs(si <- quantify(msexp, method = "SIn"))     
exprs(saf <- quantify(msexp, method = "NSAF"))
```

Note that spectra that have not been assigned any peptide (`NA`) or
that match non-unique peptides (`npsm > 1`) are discarded in the
counting process.

**See also** The
[`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html)
package supports quantitation from centroided `mgf` peak lists or its
own tab-separated files that can be generated from Mascot and Phenyx
vendor files.

### Importing third-party data

The PSI `mzTab` file format is aimed at providing a simpler (than XML
formats) and more accessible file format to the wider community. It is
composed of a key-value metadata section and peptide/protein/small
molecule tabular sections. 

```{r, mztab}
mztf <- pxget(px, pxfiles(px)[2])
(mzt <- readMzTabData(mztf, what = "PEP"))
```

It is also possible to import arbitrary spreadsheets as `MSnSet`
objects into R with the `readMSnSet2` function. The main 2 arguments
of the function are (1) a text-based spreadsheet and (2) column names
of indices that identify the quantitation data.

```{r, readmsnset2}
csv <- dir(system.file ("extdata" , package = "pRolocdata"),
           full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")
ecols <- 7:10
res <- readMSnSet2(csv, ecols)
head(exprs(res))
head(fData(res))
```

## Data processing and analysis

### Processing and normalisation

Each different types of quantitative data will require their own
pre-processing and normalisation steps. Both `isobar` and `MSnbase`
allow to correct for isobaric tag impurities normalise the
quantitative data.

```{r, pure}
data(itraqdata)
qnt <- quantify(itraqdata, method = "trap",
                reporters = iTRAQ4, verbose = FALSE)
impurities <- matrix(c(0.929,0.059,0.002,0.000,
                       0.020,0.923,0.056,0.001,
                       0.000,0.030,0.924,0.045,
                       0.000,0.001,0.040,0.923),
                     nrow=4, byrow = TRUE)
## or, using makeImpuritiesMatrix()
## impurities <- makeImpuritiesMatrix(4)
qnt.crct <- purityCorrect(qnt, impurities)
processingData(qnt.crct)
```


```{r, pureplot, fig.align='center'}

plot0 <- function(x, y, main = "") {
    old.par <- par(no.readonly = TRUE)
    on.exit(par(old.par))
    par(mar = c(4, 4, 1, 1))
    par(mfrow = c(2, 2))
    sx <- sampleNames(x)
    sy <- sampleNames(y)
    for (i in seq_len(ncol(x))) {
        plot(exprs(x)[, i], exprs(y)[, i], log = "xy",
             xlab = sx[i], ylab = sy[i])
        grid()
    }
}

plot0(qnt, qnt.crct)
```

Various normalisation methods can be applied the `MSnSet` instances
using the `normalise` method: variance stabilisation (`vsn`), quantile
(`quantiles`), median or mean centring (`center.media` or
`center.mean`), ...

```{r, norm, fig.align='center'}
qnt.crct.nrm <- normalise(qnt.crct,"quantiles")
plot0(qnt, qnt.crct.nrm)
```

The `combineFeatures` method combines spectra/peptides quantitation
values into protein data. The grouping is defined by the `groupBy`
parameter, which is generally taken from the feature metadata (protein
accessions, for example).

```{r, comb}
## arbitraty grouping
g <- factor(c(rep(1, 25), rep(2, 15), rep(3, 15)))
prt <- combineFeatures(qnt.crct.nrm, groupBy = g, fun = "sum")
processingData(prt)
```

Finally, proteomics data analysis is generally hampered by missing
values. Missing data imputation is a sensitive operation whose success
will be guided by many factors, such as degree and (non-)random nature
of the missingness. Missing value in `MSnSet` instances can be
filtered out and imputed using the `filterNA` and `impute` functions.

```{r, impute, fig.align='center'}
set.seed(1)
qnt0 <- qnt
exprs(qnt0)[sample(prod(dim(qnt0)), 10)] <- NA
table(is.na(qnt0))
qnt00 <- filterNA(qnt0)
dim(qnt00)
qnt.imp <- impute(qnt0)
plot0(qnt, qnt.imp)
```

#### Exercise

The `mzt` instance created from the `mzTab` file has the following is
a TMT 6-plex with the following design:

   In this TMT 6-plex experiment, four exogenous proteins were spiked
   into an equimolar *Erwinia carotovora* lysate with varying
   proportions in each channel of quantitation; yeast enolase (ENO) at
   10:5:2.5:1:2.5:10, bovine serum albumin (BSA) at 1:2.5:5:10:5:1,
   rabbit glycogen phosphorylase (PHO) at 2:2:2:2:1:1 and bovin
   cytochrome C (CYT) at 1:1:1:1:1:2. Proteins were then digested,
   differentially labelled with TMT reagents, fractionated by reverse
   phase nanoflow UPLC (nanoACQUITY, Waters), and analysed on an LTQ
   Orbitrap Velos mass spectrometer (Thermo Scientic).

Explore the `mzt` data using some of the illustrated functions. The
heatmap and MAplot (see `MAplot` function), taken from the
[`RforProteomics`](http://www.bioconductor.org/packages/release/data/experiment/html/RforProteomics.html)
vignette, have been produced using the same data.

![heatmap](figure/heatmap.png)
![maplot](figure/maplot.png)

### Statistical analysis

R in general and Bioconductor in particular are well suited for the
statistical analysis of data. Several packages provide dedicated
resources for proteomics data:

- [`MSstats`](http://www.bioconductor.org/packages/release/bioc/html/MSstats.html):
  A set of tools for statistical relative protein significance
  analysis in DDA, SRM and DIA experiments.

- [`msmsTest`](http://www.bioconductor.org/packages/release/bioc/html/msmsTests.html):
  Statistical tests for label-free LC-MS/MS data by spectral counts,
  to discover differentially expressed proteins between two biological
  conditions. Three tests are available: Poisson GLM regression,
  quasi-likelihood GLM regression, and the negative binomial of the
  edgeR package.

<!-- ```{r, msmstest} -->
<!-- library(msmsTests) -->
<!-- data(msms.dataset) -->
<!-- msms.dataset -->
<!-- e <- pp.msms.data(msms.dataset) -->
<!-- e -->
     
<!-- null.f <- "y~batch" -->
<!-- alt.f <- "y~treat+batch" -->
<!-- div <- apply(exprs(e),2,sum) -->
<!-- res <- msms.edgeR(e,alt.f,null.f,div=div,fnm="treat") -->
     
<!-- head(res) -->
<!-- ``` -->

- [`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html)
  also provides dedicated infrastructure for the statistical analysis of isobaric data. 

### Machine learning

The
[`MLInterfaces`](http://www.bioconductor.org/packages/release/bioc/html/MLInterfaces.html)
package provides a unified interface to a wide range of machine
learning algorithms. Initially developed for microarray and
`ExpressionSet` instances, the
[`pRoloc`](http://www.bioconductor.org/packages/release/bioc/html/pRoloc.html)
package enables application of these algorithms to `MSnSet` data.

```{r, ml}
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
data(dunkley2006)
traininds <- which(fData(dunkley2006)$markers != "unknown")
ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds)
ans
```

```{r, clust, fig.align='center'}
kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12)
kcl
plot(kcl, exprs(dunkley2006))
```

```{r, clust2, fig.align='center'}
hcl <- MLearn( ~ ., data = t(dunkley2006), hclustI(distFun =  dist, cutParm = list(k = 4)))
hcl
plot(hcl, exprs(t(dunkley2006)))
```


## Annotation

All the
[Bioconductor annotation infrastructure](http://bioconductor.org/help/workflows/annotation/annotation/),
such as
[`biomaRt`](http://bioconductor.org/packages/release/bioc/html/biomaRt.html),
[`GO.db`](http://www.bioconductor.org/packages/release/data/annotation/html/GO.db.html),
organism specific annotations, .. are directly relevant to the
analysis of proteomics data. Some proteomics-centred annotations such
as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI
2.5) or Protein Modifications are available through the
[`rols`](http://www.bioconductor.org/packages/release/bioc/html/rols.html). Data
from the [Human Protein Atlas](http://www.proteinatlas.org/) is
available via the
[`hpar`](http://www.bioconductor.org/packages/release/bioc/html/hpar.html)
package.

## Other relevant packages/pipelines

- Analysis of post translational modification with
  [`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html).
- Analysis of label-free data from a Synapt G2 (including ion
  mobility) with
  [`synapter`](http://www.bioconductor.org/packages/release/bioc/html/synapter.html).
- Analysis of spatial proteomics data with
  [`pRoloc`](http://www.bioconductor.org/packages/release/bioc/html/pRoloc.html).
- Analysis of MALDI data with the
  [`MALDIquant`](http://cran.r-project.org/web/packages/MALDIquant/index.html)
  package.
- Access to the Proteomics Standard Initiative Common QUery InterfaCe
  with the
  [`PSICQUIC`](http://www.bioconductor.org/packages/release/bioc/html/PSICQUIC.html)
  package.

Additional relevant packages are described in the
[`RforProteomics`](http://www.bioconductor.org/packages/release/data/experiment/html/RforProteomics.html)
vignette. 

## Session information

```{r, si, echo=FALSE}
print(sessionInfo(), local = FALSE)
```