`MSnSet`

s. There is also features for quality control and visualisation of results. Please see other vignettes for convergence and other methodology.
bandle 1.2.2

In this vignette we use a real-life biological use-case to demonstrate how to analyse mass-spectrometry based proteomics data using the Bayesian ANalysis of Differential Localisation Experiments (BANDLE) method.

As mentioned in “Vignette 1: Getting Started with BANDLE” data from mass
spectrometry based proteomics methods most commonly yield a matrix of
measurements where we have proteins/peptides/peptide spectrum matches
(PSMs) along the rows, and samples/fractions along the columns. To use `bandle`

the data must be stored as a `MSnSet`

, as implemented in the Bioconductor
*MSnbase* package. Please see the relevant vignettes in
*MSnbase* for constructing these data containers.

The data used in this vignette has been published in Mulvey et al. (2021) and is currently
stored as `MSnSet`

instances in the the *pRolocdata* package. We
will load it in the next section.

In this workflow we analyse the data produced by Mulvey et al. (2021). In this experiment triplicate hyperLOPIT experiments (Mulvey et al. 2017) were conducted on THP-1 human leukaemia cells where the samples were analysed and collected (1) when cells were unstimulated and then (2) following 12 hours stimulation with LPS (12h-LPS).

In the following code chunk we load 4 of the datasets from the study: 2 replicates of the unstimulated and 2 replicates of the 12h-LPS stimulated samples. Please note to adhere to Bioconductor vignette build times we only load 2 of the 3 replicates for each condition to demonstrate the BANDLE workflow.

```
library("pRolocdata")
data("thpLOPIT_unstimulated_rep1_mulvey2021")
data("thpLOPIT_unstimulated_rep3_mulvey2021")
data("thpLOPIT_lps_rep1_mulvey2021")
data("thpLOPIT_lps_rep3_mulvey2021")
```

By typing the names of the datasets we get a `MSnSet`

data summary. For
example,

`thpLOPIT_unstimulated_rep1_mulvey2021`

```
## MSnSet (storageMode: lockedEnvironment)
## assayData: 5107 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: unstim_rep1_set1_126_cyto unstim_rep1_set1_127N_F1.4 ...
## unstim_rep1_set2_131_F24 (20 total)
## varLabels: Tag Treatment ... Fraction (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: A0AVT1 A0FGR8-2 ... Q9Y6Y8 (5107 total)
## fvarLabels: Checked_unst.r1.s1 Confidence_unst.r1.s1 ... markers (107
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:48 2021.
## Normalised to sum of intensities.
## MSnbase version: 2.14.2
```

`thpLOPIT_lps_rep1_mulvey2021`

```
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4879 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: lps_rep1_set1_126_cyto lps_rep1_set1_127N_F1.4 ...
## lps_rep1_set2_131_F25 (20 total)
## varLabels: Tag Treatment ... Fraction (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: A0A0B4J2F0 A0AVT1 ... Q9Y6Y8 (4879 total)
## fvarLabels: Checked_lps.r1.s1 Confidence_lps.r1.s1 ... markers (107
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:57 2021.
## Normalised to sum of intensities.
## MSnbase version: 2.14.2
```

We see that the datasets `thpLOPIT_unstimulated_rep1_mulvey2021`

and
`thpLOPIT_lps_rep1_mulvey2021`

contain 5107 and 4879 proteins respectively,
across 20 TMT channels. The data is accessed through different slots of the
`MSnSet`

(see `str(thpLOPIT_unstimulated_rep1_mulvey2021)`

for all available
slots). The 3 main slots which are used most frequently are those that contain
the quantitation data, the features i.e. PSM/peptide/protein information and the
sample information, and these can be accessed using the functions `exprs`

,
`fData`

, and `pData`

, respectively.

First, let us load the `bandle`

package along with some other R packages needed
for visualisation and data manipulation,

```
library("bandle")
library("pheatmap")
library("viridis")
library("dplyr")
library("ggplot2")
```

To run `bandle`

there are a few minimal requirements that the data must fulfill.

- the same number of channels across conditions and replicates
- the same proteins across conditions and replicates
- data must be a
`list`

of`MSnSet`

instances

If we use the `dim`

function we see that the datasets we have loaded have the
same number of channels but a different number of proteins per experiment.

`dim(thpLOPIT_unstimulated_rep1_mulvey2021)`

`## [1] 5107 20`

`dim(thpLOPIT_unstimulated_rep3_mulvey2021)`

`## [1] 5733 20`

`dim(thpLOPIT_lps_rep1_mulvey2021)`

`## [1] 4879 20`

`dim(thpLOPIT_lps_rep3_mulvey2021)`

`## [1] 5848 20`

We use the function `commonFeatureNames`

to extract proteins that are common
across all replicates. This function has a nice side effect which is that it
also wraps the data into a `list`

, ready for input into `bandle`

.

```
thplopit <- commonFeatureNames(c(thpLOPIT_unstimulated_rep1_mulvey2021, ## unstimulated rep
thpLOPIT_unstimulated_rep3_mulvey2021, ## unstimulated rep
thpLOPIT_lps_rep1_mulvey2021, ## 12h-LPS rep
thpLOPIT_lps_rep3_mulvey2021)) ## 12h-LPS rep
```

`## 3727 features in common`

We now have our list of `MSnSet`

s ready for `bandle`

with 3727 proteins common
across all 4 replicates/conditions.

`thplopit`

`## Instance of class 'MSnSetList' containig 4 objects.`

We can visualise the data using the `plot2D`

function from `pRoloc`

```
## create a character vector of title names for the plots
plot_id <- c("Unstimulated 1st rep", "Unstimulated 2nd rep",
"12h-LPS 1st rep", "12h-LPS 2nd rep")
## Let's set the stock colours of the classes to plot to be transparent
setStockcol(NULL)
setStockcol(paste0(getStockcol(), "90"))
## plot the data
par(mfrow = c(2,2))
for (i in seq(thplopit))
plot2D(thplopit[[i]], main = plot_id[i])
addLegend(thplopit[[4]], where = "topleft", cex = .75)
```

By default the `plot2D`

uses principal components analysis (PCA)
for the data transformation. Other options such as t-SNE, kernal
PCA etc. are also available, see `?plot2D`

and the `method`

argument.
PCA sometimes will randomly flip the axis, because the eigenvectors
only need to satisfy \(||v|| = 1\), which allows a sign flip. You will
notice this is the case for the 3rd plot. If desired you can flip
the axis/change the sign of the PCs by specifying any of the arguments
`mirrorX`

, `mirrorY`

, `axsSwitch`

to TRUE when you call `plot2D`

.

`bandle`

: fitting GPs and setting the priorsAs mentioned in the first vignette, `bandle`

uses a complex model to analyse the
data. Markov-Chain Monte-Carlo (MCMC) is used to sample the posterior
distribution of parameters and latent variables from which statistics of
interest can be computed. Again, here we only run a few iterations for brevity
but typically one needs to run thousands of iterations to ensure convergence, as
well as multiple parallel chains.

First, we need to fit non-parametric regression functions to the markers
profiles. We use the `fitGPmaternPC`

function using the default penalised
complexity priors (see `?fitGP`

), which work well.

`gpParams <- lapply(thplopit, function(x) fitGPmaternPC(x))`

We apply the `fitGPmaternPC`

function on to each dataset by using `lapply`

over
the `thplopit`

list of data. The posterior predictive means, standard deviations
and MAP hyperparamters for the GP are returned. If desired we can visualise the
predictives overlaid onto the marker profiles of the data by using the `plotGPmatern`

function.

The prior needs to form a `K*3`

matrix (where `K`

is the number of subcellular
classes in the data),

`(mrkCl <- getMarkerClasses(thplopit[[1]], fcol = "markers"))`

```
## [1] "40S/60S Ribosome" "Chromatin" "Cytosol"
## [4] "Endoplasmic Reticulum" "Golgi Apparatus" "Lysosome"
## [7] "Mitochondria" "Nucleolus" "Nucleus"
## [10] "Peroxisome" "Plasma Membrane"
```

So for this data we require a `11*3`

matrix. Three columns are needed which
represent the hyperparameters length-scale, amplitude, variance. We have found
that the `matrix(c(10, 60, 250), nrow = 1)`

worked well for the smaller datasets
with a few hundred proteins, as tested in Crook et al. (2021). Here, we found that
`matrix(c(1, 60, 100)`

worked well. This is a bigger dataset with several
thousand proteins and many more subcellular classes. This was visually assessed
by passing these values and visualising the GP fit using the `plotGPmatern`

function. Generally, (1) increasing the lengthscale parameter (the first column
of the hyppar matrix) increases the spread of the covariance i.e. the similarity
between points, (2) increasing the amplitude parameter (the second column of the
hyppar matrix) increases the maximum value of the covariance and lastly (3)
decreasing the variance (third column of the hyppar matrix) reduces the
smoothness of the function to allow for local variations. We strongly recommend
users start with the recommended parameters and change and assess them as
necessary for their dataset by visually evaluating the fit of the GPs using the
`plotGPmatern`

function.

```
K <- length(mrkCl)
pc_prior <- matrix(NA, ncol = 3, K)
pc_prior[seq.int(1:K), ] <- matrix(rep(c(1, 60, 100),
each = K), ncol = 3)
head(pc_prior)
```

```
## [,1] [,2] [,3]
## [1,] 1 60 100
## [2,] 1 60 100
## [3,] 1 60 100
## [4,] 1 60 100
## [5,] 1 60 100
## [6,] 1 60 100
```

Now we have generated these complexity priors we can pass them as an
argument to the `fitGPmaternPC`

function. For example,

```
gpParams <- lapply(thplopit,
function(x) fitGPmaternPC(x, hyppar = pc_prior))
```

By plotting the predictives using the `plotGPmatern`

function we see that
the distributions and fit looks sensible for each class so we will proceed with
setting the prior on the weights.

```
par(mfrow = c(4, 3))
plotGPmatern(thplopit[[1]], gpParams[[1]])
```