1 Introduction

In this vignette we use a real-life biological use-case to demonstrate how to analyse mass-spectrometry based proteomics data using the Bayesian ANalysis of Differential Localisation Experiments (BANDLE) method.

2 The data

As mentioned in “Vignette 1: Getting Started with BANDLE” data from mass spectrometry based proteomics methods most commonly yield a matrix of measurements where we have proteins/peptides/peptide spectrum matches (PSMs) along the rows, and samples/fractions along the columns. To use bandle the data must be stored as a MSnSet, as implemented in the Bioconductor MSnbase package. Please see the relevant vignettes in MSnbase for constructing these data containers.

The data used in this vignette has been published in Mulvey et al. (2021) and is currently stored as MSnSet instances in the the pRolocdata package. We will load it in the next section.

2.1 Spatialtemporal proteomic profiling of a THP-1 cell line

In this workflow we analyse the data produced by Mulvey et al. (2021). In this experiment triplicate hyperLOPIT experiments (Mulvey et al. 2017) were conducted on THP-1 human leukaemia cells where the samples were analysed and collected (1) when cells were unstimulated and then (2) following 12 hours stimulation with LPS (12h-LPS).

In the following code chunk we load 4 of the datasets from the study: 2 replicates of the unstimulated and 2 replicates of the 12h-LPS stimulated samples. Please note to adhere to Bioconductor vignette build times we only load 2 of the 3 replicates for each condition to demonstrate the BANDLE workflow.


By typing the names of the datasets we get a MSnSet data summary. For example,

## MSnSet (storageMode: lockedEnvironment)
## assayData: 5107 features, 20 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: unstim_rep1_set1_126_cyto unstim_rep1_set1_127N_F1.4 ...
##     unstim_rep1_set2_131_F24 (20 total)
##   varLabels: Tag Treatment ... Fraction (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: A0AVT1 A0FGR8-2 ... Q9Y6Y8 (5107 total)
##   fvarLabels: Checked_unst.r1.s1 Confidence_unst.r1.s1 ... markers (107
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:48 2021. 
## Normalised to sum of intensities. 
##  MSnbase version: 2.14.2
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4879 features, 20 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: lps_rep1_set1_126_cyto lps_rep1_set1_127N_F1.4 ...
##     lps_rep1_set2_131_F25 (20 total)
##   varLabels: Tag Treatment ... Fraction (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: A0A0B4J2F0 A0AVT1 ... Q9Y6Y8 (4879 total)
##   fvarLabels: Checked_lps.r1.s1 Confidence_lps.r1.s1 ... markers (107
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:57 2021. 
## Normalised to sum of intensities. 
##  MSnbase version: 2.14.2

We see that the datasets thpLOPIT_unstimulated_rep1_mulvey2021 and thpLOPIT_lps_rep1_mulvey2021 contain 5107 and 4879 proteins respectively, across 20 TMT channels. The data is accessed through different slots of the MSnSet (see str(thpLOPIT_unstimulated_rep1_mulvey2021) for all available slots). The 3 main slots which are used most frequently are those that contain the quantitation data, the features i.e. PSM/peptide/protein information and the sample information, and these can be accessed using the functions exprs, fData, and pData, respectively.

2.2 Preparing the data

First, let us load the bandle package along with some other R packages needed for visualisation and data manipulation,


To run bandle there are a few minimal requirements that the data must fulfill.

  • the same number of channels across conditions and replicates
  • the same proteins across conditions and replicates
  • data must be a list of MSnSet instances

If we use the dim function we see that the datasets we have loaded have the same number of channels but a different number of proteins per experiment.

## [1] 5107   20
## [1] 5733   20
## [1] 4879   20
## [1] 5848   20

We use the function commonFeatureNames to extract proteins that are common across all replicates. This function has a nice side effect which is that it also wraps the data into a list, ready for input into bandle.

thplopit <- commonFeatureNames(c(thpLOPIT_unstimulated_rep1_mulvey2021,  ## unstimulated rep
                                 thpLOPIT_unstimulated_rep3_mulvey2021,  ## unstimulated rep
                                 thpLOPIT_lps_rep1_mulvey2021,           ## 12h-LPS rep
                                 thpLOPIT_lps_rep3_mulvey2021))          ## 12h-LPS rep
## 3727 features in common

We now have our list of MSnSets ready for bandle with 3727 proteins common across all 4 replicates/conditions.

## Instance of class 'MSnSetList' containig 4 objects.

We can visualise the data using the plot2D function from pRoloc

## create a character vector of title names for the plots
plot_id <- c("Unstimulated 1st rep", "Unstimulated 2nd rep",
             "12h-LPS 1st rep", "12h-LPS 2nd rep")

## Let's set the stock colours of the classes to plot to be transparent
setStockcol(paste0(getStockcol(), "90"))

## plot the data
par(mfrow = c(2,2))
for (i in seq(thplopit))
    plot2D(thplopit[[i]], main = plot_id[i])
addLegend(thplopit[[4]], where = "topleft", cex = .75)

By default the plot2D uses principal components analysis (PCA) for the data transformation. Other options such as t-SNE, kernal PCA etc. are also available, see ?plot2D and the method argument. PCA sometimes will randomly flip the axis, because the eigenvectors only need to satisfy \(||v|| = 1\), which allows a sign flip. You will notice this is the case for the 3rd plot. If desired you can flip the axis/change the sign of the PCs by specifying any of the arguments mirrorX, mirrorY, axsSwitch to TRUE when you call plot2D.

3 Preparing bandle: fitting GPs and setting the priors

As mentioned in the first vignette, bandle uses a complex model to analyse the data. Markov-Chain Monte-Carlo (MCMC) is used to sample the posterior distribution of parameters and latent variables from which statistics of interest can be computed. Again, here we only run a few iterations for brevity but typically one needs to run thousands of iterations to ensure convergence, as well as multiple parallel chains.

3.1 Fitting Gaussian processes

First, we need to fit non-parametric regression functions to the markers profiles. We use the fitGPmaternPC function using the default penalised complexity priors (see ?fitGP), which work well.

gpParams <- lapply(thplopit, function(x) fitGPmaternPC(x))

We apply the fitGPmaternPC function on to each dataset by using lapply over the thplopit list of data. The posterior predictive means, standard deviations and MAP hyperparamters for the GP are returned. If desired we can visualise the predictives overlaid onto the marker profiles of the data by using the plotGPmatern function.

The prior needs to form a K*3 matrix (where K is the number of subcellular classes in the data),

(mrkCl <- getMarkerClasses(thplopit[[1]], fcol = "markers"))
##  [1] "40S/60S Ribosome"      "Chromatin"             "Cytosol"              
##  [4] "Endoplasmic Reticulum" "Golgi Apparatus"       "Lysosome"             
##  [7] "Mitochondria"          "Nucleolus"             "Nucleus"              
## [10] "Peroxisome"            "Plasma Membrane"

So for this data we require a 11*3 matrix. Three columns are needed which represent the hyperparameters length-scale, amplitude, variance. We have found that the matrix(c(10, 60, 250), nrow = 1) worked well for the smaller datasets with a few hundred proteins, as tested in Crook et al. (2021). Here, we found that matrix(c(1, 60, 100) worked well. This is a bigger dataset with several thousand proteins and many more subcellular classes. This was visually assessed by passing these values and visualising the GP fit using the plotGPmatern function. Generally, (1) increasing the lengthscale parameter (the first column of the hyppar matrix) increases the spread of the covariance i.e. the similarity between points, (2) increasing the amplitude parameter (the second column of the hyppar matrix) increases the maximum value of the covariance and lastly (3) decreasing the variance (third column of the hyppar matrix) reduces the smoothness of the function to allow for local variations. We strongly recommend users start with the recommended parameters and change and assess them as necessary for their dataset by visually evaluating the fit of the GPs using the plotGPmatern function.

K <- length(mrkCl)
pc_prior <- matrix(NA, ncol = 3, K)
pc_prior[seq.int(1:K), ] <- matrix(rep(c(1, 60, 100),
                                       each = K), ncol = 3)
##      [,1] [,2] [,3]
## [1,]    1   60  100
## [2,]    1   60  100
## [3,]    1   60  100
## [4,]    1   60  100
## [5,]    1   60  100
## [6,]    1   60  100

Now we have generated these complexity priors we can pass them as an argument to the fitGPmaternPC function. For example,

gpParams <- lapply(thplopit,
                   function(x) fitGPmaternPC(x, hyppar = pc_prior))

By plotting the predictives using the plotGPmatern function we see that the distributions and fit looks sensible for each class so we will proceed with setting the prior on the weights.

par(mfrow = c(4, 3))
plotGPmatern(thplopit[[1]], gpParams[[1]])