MSnSets. There is also features for quality control and visualisation of results. Please see other vignettes for convergence and other methodology.
In this vignette we use a real-life biological use-case to demonstrate how to analyse mass-spectrometry based proteomics data using the Bayesian ANalysis of Differential Localisation Experiments (BANDLE) method.
As mentioned in “Vignette 1: Getting Started with BANDLE” data from mass
spectrometry based proteomics methods most commonly yield a matrix of
measurements where we have proteins/peptides/peptide spectrum matches
(PSMs) along the rows, and samples/fractions along the columns. To use
the data must be stored as a
MSnSet, as implemented in the Bioconductor
MSnbase package. Please see the relevant vignettes in
MSnbase for constructing these data containers.
The data used in this vignette has been published in Mulvey et al. (2021) and is currently
MSnSet instances in the the pRolocdata package. We
will load it in the next section.
In this workflow we analyse the data produced by Mulvey et al. (2021). In this experiment triplicate hyperLOPIT experiments (Mulvey et al. 2017) were conducted on THP-1 human leukaemia cells where the samples were analysed and collected (1) when cells were unstimulated and then (2) following 12 hours stimulation with LPS (12h-LPS).
In the following code chunk we load 4 of the datasets from the study: 2 replicates of the unstimulated and 2 replicates of the 12h-LPS stimulated samples. Please note to adhere to Bioconductor vignette build times we only load 2 of the 3 replicates for each condition to demonstrate the BANDLE workflow.
library("pRolocdata") data("thpLOPIT_unstimulated_rep1_mulvey2021") data("thpLOPIT_unstimulated_rep3_mulvey2021") data("thpLOPIT_lps_rep1_mulvey2021") data("thpLOPIT_lps_rep3_mulvey2021")
By typing the names of the datasets we get a
MSnSet data summary. For
## MSnSet (storageMode: lockedEnvironment) ## assayData: 5107 features, 20 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: unstim_rep1_set1_126_cyto unstim_rep1_set1_127N_F1.4 ... ## unstim_rep1_set2_131_F24 (20 total) ## varLabels: Tag Treatment ... Fraction (5 total) ## varMetadata: labelDescription ## featureData ## featureNames: A0AVT1 A0FGR8-2 ... Q9Y6Y8 (5107 total) ## fvarLabels: Checked_unst.r1.s1 Confidence_unst.r1.s1 ... markers (107 ## total) ## fvarMetadata: labelDescription ## experimentData: use 'experimentData(object)' ## Annotation: ## - - - Processing information - - - ## Loaded on Tue Jan 12 14:46:48 2021. ## Normalised to sum of intensities. ## MSnbase version: 2.14.2
## MSnSet (storageMode: lockedEnvironment) ## assayData: 4879 features, 20 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: lps_rep1_set1_126_cyto lps_rep1_set1_127N_F1.4 ... ## lps_rep1_set2_131_F25 (20 total) ## varLabels: Tag Treatment ... Fraction (5 total) ## varMetadata: labelDescription ## featureData ## featureNames: A0A0B4J2F0 A0AVT1 ... Q9Y6Y8 (4879 total) ## fvarLabels: Checked_lps.r1.s1 Confidence_lps.r1.s1 ... markers (107 ## total) ## fvarMetadata: labelDescription ## experimentData: use 'experimentData(object)' ## Annotation: ## - - - Processing information - - - ## Loaded on Tue Jan 12 14:46:57 2021. ## Normalised to sum of intensities. ## MSnbase version: 2.14.2
We see that the datasets
thpLOPIT_lps_rep1_mulvey2021 contain 5107 and 4879 proteins respectively,
across 20 TMT channels. The data is accessed through different slots of the
str(thpLOPIT_unstimulated_rep1_mulvey2021) for all available
slots). The 3 main slots which are used most frequently are those that contain
the quantitation data, the features i.e. PSM/peptide/protein information and the
sample information, and these can be accessed using the functions
First, let us load the
bandle package along with some other R packages needed
for visualisation and data manipulation,
library("bandle") library("pheatmap") library("viridis") library("dplyr") library("ggplot2")
bandle there are a few minimal requirements that the data must fulfill.
If we use the
dim function we see that the datasets we have loaded have the
same number of channels but a different number of proteins per experiment.
##  5107 20
##  5733 20
##  4879 20
##  5848 20
We use the function
commonFeatureNames to extract proteins that are common
across all replicates. This function has a nice side effect which is that it
also wraps the data into a
list, ready for input into
thplopit <- commonFeatureNames(c(thpLOPIT_unstimulated_rep1_mulvey2021, ## unstimulated rep thpLOPIT_unstimulated_rep3_mulvey2021, ## unstimulated rep thpLOPIT_lps_rep1_mulvey2021, ## 12h-LPS rep thpLOPIT_lps_rep3_mulvey2021)) ## 12h-LPS rep
## 3727 features in common
We now have our list of
MSnSets ready for
bandle with 3727 proteins common
across all 4 replicates/conditions.
## Instance of class 'MSnSetList' containig 4 objects.
We can visualise the data using the
plot2D function from
## create a character vector of title names for the plots plot_id <- c("Unstimulated 1st rep", "Unstimulated 2nd rep", "12h-LPS 1st rep", "12h-LPS 2nd rep") ## Let's set the stock colours of the classes to plot to be transparent setStockcol(NULL) setStockcol(paste0(getStockcol(), "90")) ## plot the data par(mfrow = c(2,2)) for (i in seq(thplopit)) plot2D(thplopit[[i]], main = plot_id[i]) addLegend(thplopit[], where = "topleft", cex = .75)
By default the
plot2D uses principal components analysis (PCA)
for the data transformation. Other options such as t-SNE, kernal
PCA etc. are also available, see
?plot2D and the
PCA sometimes will randomly flip the axis, because the eigenvectors
only need to satisfy \(||v|| = 1\), which allows a sign flip. You will
notice this is the case for the 3rd plot. If desired you can flip
the axis/change the sign of the PCs by specifying any of the arguments
axsSwitch to TRUE when you call
bandle: fitting GPs and setting the priors
As mentioned in the first vignette,
bandle uses a complex model to analyse the
data. Markov-Chain Monte-Carlo (MCMC) is used to sample the posterior
distribution of parameters and latent variables from which statistics of
interest can be computed. Again, here we only run a few iterations for brevity
but typically one needs to run thousands of iterations to ensure convergence, as
well as multiple parallel chains.
First, we need to fit non-parametric regression functions to the markers
profiles. We use the
fitGPmaternPC function using the default penalised
complexity priors (see
?fitGP), which work well.
gpParams <- lapply(thplopit, function(x) fitGPmaternPC(x))
We apply the
fitGPmaternPC function on to each dataset by using
thplopit list of data. The posterior predictive means, standard deviations
and MAP hyperparamters for the GP are returned. If desired we can visualise the
predictives overlaid onto the marker profiles of the data by using the
The prior needs to form a
K*3 matrix (where
K is the number of subcellular
classes in the data),
(mrkCl <- getMarkerClasses(thplopit[], fcol = "markers"))
##  "40S/60S Ribosome" "Chromatin" "Cytosol" ##  "Endoplasmic Reticulum" "Golgi Apparatus" "Lysosome" ##  "Mitochondria" "Nucleolus" "Nucleus" ##  "Peroxisome" "Plasma Membrane"
So for this data we require a
11*3 matrix. Three columns are needed which
represent the hyperparameters length-scale, amplitude, variance. We have found
matrix(c(10, 60, 250), nrow = 1) worked well for the smaller datasets
with a few hundred proteins, as tested in Crook et al. (2021). Here, we found that
matrix(c(1, 60, 100) worked well. This is a bigger dataset with several
thousand proteins and many more subcellular classes. This was visually assessed
by passing these values and visualising the GP fit using the
function. Generally, (1) increasing the lengthscale parameter (the first column
of the hyppar matrix) increases the spread of the covariance i.e. the similarity
between points, (2) increasing the amplitude parameter (the second column of the
hyppar matrix) increases the maximum value of the covariance and lastly (3)
decreasing the variance (third column of the hyppar matrix) reduces the
smoothness of the function to allow for local variations. We strongly recommend
users start with the recommended parameters and change and assess them as
necessary for their dataset by visually evaluating the fit of the GPs using the
K <- length(mrkCl) pc_prior <- matrix(NA, ncol = 3, K) pc_prior[seq.int(1:K), ] <- matrix(rep(c(1, 60, 100), each = K), ncol = 3) head(pc_prior)
## [,1] [,2] [,3] ## [1,] 1 60 100 ## [2,] 1 60 100 ## [3,] 1 60 100 ## [4,] 1 60 100 ## [5,] 1 60 100 ## [6,] 1 60 100
Now we have generated these complexity priors we can pass them as an
argument to the
fitGPmaternPC function. For example,
gpParams <- lapply(thplopit, function(x) fitGPmaternPC(x, hyppar = pc_prior))
By plotting the predictives using the
plotGPmatern function we see that
the distributions and fit looks sensible for each class so we will proceed with
setting the prior on the weights.
par(mfrow = c(4, 3)) plotGPmatern(thplopit[], gpParams[])