MSstats: Protein/Peptide significance analysis

Package: MSstats

Author: Meena Choi mnchoi67@gmail.com, Tsung-Heng Tsai tsai.tsungheng@gmail.com, Cyril Galitzine cyrildgg@gmail.com

Date: October 2, 2017

This vignette summarizes the introduction and various options of all functionalities in MSstats. More details are available in User Manual.

SkylinetoMSstatsFormat

Preprocess MSstats input report from Skyline and convert into the required input format for MSstats.

Arguments

Example

# 'MSstatsInput.csv' is the MSstats report from Skyline.
input <- read.csv(file="MSstatsInput.csv")

raw <- SkylinetoMSstatsFormat(input)

MaxQtoMSstatsFormat

Convert MaxQuant output into the required input format for MSstats.

Arguments

Example

# Read in MaxQuant files
proteinGroups <- read.table("proteinGroups.txt", sep="\t", header=TRUE)

infile <- read.table("evidence.txt", sep="\t", header=TRUE)

# Read in annotation including condition and biological replicates per run.
# Users should make this annotation file. It is not the output from MaxQuant.
annot <- read.csv("annotation.csv", header=TRUE)

raw <- MaxQtoMSstatsFormat(evidence=infile, 
                           annotation=annot, 
                           proteinGroups=proteinGroups)

ProgenesistoMSstatsFormat

Convert Progenesis output into the required input format for MSstats.

Arguments

Example

input <- read.csv("output_progenesis.csv", stringsAsFactors=F) 

# Read in annotation including condition and biological replicates per run.
# Users should make this annotation file. It is not the output from Progenesis.
annot <- read.csv('annotation.csv')

raw <- ProgenesistoMSstatsFormat(input, annotation=annot)

SpectronauttoMSstatsFormat

Convert Spectronaut output into the required input format for MSstats.

Arguments

Example

input <- read.csv("output_spectronaut.csv", stringsAsFactors=F) 

quant <- SpectronauttoMSstatsFormat(input)

dataProcess

Data pre-processing and quality control of MS runs of the original raw data into quantitative data for model fitting and group comparison. Log transformation is automatically applied and additional variables are created in columns for model fitting and group comparison process. Three options of data pre-processing and quality control of MS runs in dataProcess are

- Normalization: to remove systematic bias between MS runs.

Arguments

Example

QuantData <- dataProcess(SRMRawData)

dataProcessPlots

Visualization for explanatory data analysis. To illustrate the quantitative data after data-preprocessing and quality control of MS runs, dataProcessPlots takes the quantitative data from function dataProcess as input and automatically generate three types of figures in pdf files as output :

Arguments

Example

QuantData <- dataProcess(SRMRawData)

# Profile plot
dataProcessPlots(data=QuantData, type="ProfilePlot")

# Quality control plot 
dataProcessPlots(data=QuantData, type="QCPlot") 

# Quantification plot for conditions
dataProcessPlots(data=QuantData, type="ConditionPlot")

groupComparison

Tests for significant changes in protein abundance across conditions based on a family of linear mixed-effects models in targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment. It is applicable to multiple types of sample preparation, including label-free workflows, workflows that use stable isotope labeled reference proteins and peptides, and workflows that use fractionation. Experimental design of case-control study (patients are not repeatedly measured) or time course study (patients are repeatedly measured) is automatically determined based on proper statistical model.

Arguments

Example

QuantData <- dataProcess(SRMRawData)

levels(QuantData$ProcessedData$GROUP_ORIGINAL)
comparison <- matrix(c(-1,0,0,0,0,0,1,0,0,0), nrow=1)
row.names(comparison) <- "T7-T1"

# Tests for differentially abundant proteins with models:
testResultOneComparison <- groupComparison(contrast.matrix=comparison, data=QuantData)

groupComparisonPlots

Visualization for model-based analysis and summarizing differentially abundant proteins. To summarize the results of log-fold changes and adjusted p-values for differentially abundant proteins, groupComparisonPlots takes testing results from function groupComparison as input and automatically generate three types of figures in pdf files as output :

Arguments

Example

QuantData <- dataProcess(SRMRawData)

# based on multiple comparisons  (T1 vs T3; T1 vs T7; T1 vs T9)
comparison1<-matrix(c(-1,0,1,0,0,0,0,0,0,0),nrow=1)
comparison2<-matrix(c(-1,0,0,0,0,0,1,0,0,0),nrow=1)
comparison3<-matrix(c(-1,0,0,0,0,0,0,0,1,0),nrow=1)
comparison<-rbind(comparison1,comparison2, comparison3)
row.names(comparison)<-c("T3-T1","T7-T1","T9-T1")

testResultMultiComparisons <- groupComparison(contrast.matrix=comparison, data=QuantData)

# Volcano plot 
groupComparisonPlots(data=testResultMultiComparisons$ComparisonResult, type="VolcanoPlot")

# Heatmap 
groupComparisonPlots(data=testResultMultiComparisons$ComparisonResult, type="Heatmap")

# Comparison Plot
groupComparisonPlots(data=testResultMultiComparisons$ComparisonResult, type="ComparisonPlot")

modelBasedQCPlots

Results based on statistical models for whole plot level inference are accurate as long as the assumptions of the model are met. The model assumes that the measurement errors are normally distributed with mean 0 and constant variance. The assumption of a constant variance can be checked by examining the residuals from the model.

To check the assumption of linear model for whole plot inference, modelBasedQCPlots takes the results after fitting models from function groupComparison as input and automatically generate two types of figures in pdf files as output.

Arguments

Example

testResultOneComparison <- groupComparison(contrast.matrix=comparison, data=QuantData)

# normal quantile-quantile plots
modelBasedQCPlots(data=testResultOneComparison, type="QQPlots")

# residual plots
modelBasedQCPlots(data=testResultOneComparison, type="ResidualPlots")

designSampleSize

Calculate sample size for future experiments of a Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment based on intensity-based linear model. The function fits the model and uses variance components to calculate sample size. The underlying model fitting with intensity-based linear model with technical MS run replication. Estimated sample size is rounded to 0 decimal. Two options of the calculation:

Arguments

Example

QuantData <- dataProcess(SRMRawData)
head(QuantData$ProcessedData)

## based on multiple comparisons  (T1 vs T3; T1 vs T7; T1 vs T9)
comparison1 <- matrix(c(-1,0,1,0,0,0,0,0,0,0),nrow=1)
comparison2 <- matrix(c(-1,0,0,0,0,0,1,0,0,0),nrow=1)
comparison3 <- matrix(c(-1,0,0,0,0,0,0,0,1,0),nrow=1)
comparison <- rbind(comparison1,comparison2, comparison3)
row.names(comparison) <- c("T3-T1","T7-T1","T9-T1")

testResultMultiComparisons <- groupComparison(contrast.matrix=comparison,data=QuantData)

#(1) Minimal number of biological replicates per condition
designSampleSize(data=testResultMultiComparisons$fittedmodel, numSample=TRUE,
  desiredFC=c(1.25,1.75), FDR=0.05, power=0.8)

#(2) Power calculation
designSampleSize(data=testResultMultiComparisons$fittedmodel, numSample=2,
  desiredFC=c(1.25,1.75), FDR=0.05, power=TRUE)

designSampleSizePlots

To illustrate the relationship of desired fold change and the calculated minimal number sample size which are

The input is the result from function designSampleSize.

Arguments

Example

# (1) Minimal number of biological replicates per condition
result.sample <- designSampleSize(data=testResultMultiComparisons$fittedmodel, numSample=TRUE,
                                desiredFC=c(1.25,1.75), FDR=0.05, power=0.8)
designSampleSizePlots(data=result.sample)

# (2) Power
result.power <- designSampleSize(data=testResultMultiComparisons$fittedmodel, numSample=2,
                               desiredFC=c(1.25,1.75), FDR=0.05, power=TRUE)
designSampleSizePlots(data=result.power)

designSampleSizeClassification

For classification problem (such as disgnosys of disease), simulation is utilized to estimate the optimal sample size of training data under different size of validation data, including Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment. The function fits intensity-based linear model and uses variance components and mean abundance to simulate new data with different sample size. Training data and validation data are simulated independently. Random forest model is fitted on train data and used to predict the validation data. The above procedure is repeated several times. Mean predictive accuracy and variance under different size of training data and validation data are reported. For each size of validation data, the number of training samples with best predictive accuracy is also returned.

Arguments

Example

# Consider quantitative data (i.e. QuantData) from yeast study.
# A time course study with ten time points of interests and three biological replicates.
QuantSRM <- dataProcess(SRMRawData)
# estimate the mean predictive accuray under different sizes of training data and validation data
# n is the number of biological replicates per condition
result.srm <- designSampleSizeClassification(data=QuantSRM, n=5, step=10)
result.srm$meanPA # mean predictive accuracy

# Consider another training set from a colorectal cancer study
# Subjects are from control group or colorectal cancer group
# 72 proteins were targeted with SRM
require(MSstatsBioData)
set.seed(1235)
data(SRM_crc_training)
QuantCRCSRM <- dataProcess(SRM_crc_training)
# estimate the mean predictive accuray under different sizes of training data and validation data
# n is the number of biological replicates per condition
crc.srm <- designSampleSizeClassification(data=QuantCRCSRM, n=10, step=10)
crc.srm$meanPA # mean predictive accuracy

designSampleSizeClassificationPlots

To illustrate the mean classification accuracy under different size of training data and validation data. The input is the result from function designSampleSizeClassification.

Arguments

Example

# Consider another training set from a colorectal cancer study
# Subjects are from control group or colorectal cancer group
# 72 proteins were targeted with SRM
require(MSstatsBioData)
set.seed(1235)
data(SRM_crc_training)
QuantCRCSRM <- dataProcess(SRM_crc_training)
crc.srm <- designSampleSizeClassification(data=QuantCRCSRM, n=10, step=10)
designSampleSizeClassificationPlots(data=crc.srm)

quantification

Model-based quantification for each condition or for each biological samples per protein in a targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment. Quantification takes the processed data set by dataProcess as input and automatically generate the quantification results (data.frame) with long or matrix format. The quantification for endogenous samples is based on run summarization from subplot model, with TMP robust estimation.

Arguments

Example

QuantData <- dataProcess(SRMRawData)

# Sample quantification
sampleQuant <- quantification(QuantData)

# Group quantification
groupQuant <- quantification(QuantData, type="Group")

Assay characterization

nonlinear_quantlim

This function calculates the value of the LOB (limit of blank) and LOD (limit of detection) from the (Concentration, Intensity) spiked in data. This function should be used instead of the linear function whenever a significant threshold is present at low concentrations. Such threshold is characterized by a signal that is dominated by noise where the mean intensity is constant and independent of concentration. The function also returns the values of the nonlinear curve fit that allows it to be plotted. At least 2 blank samples (characterized by Intensity = 0) are required by this function which are used to calculate the background noise. The LOB is defined as the concentration at which the value of the nonlinear fit is equal to the 95\% upper bound of the noise. The LOD is the concentration at which the latter is equal to the 90\% lower bound (5\% quantile) of the prediction interval of the nonlinear fit. A weighted nonlinear fit is used with weights for every unique concentration proportional to the inverse of variance between replicates. The details behind the calculation of the nonlinear fit can be found in the Reference.

The output is a data frame that contains the following columns:

The LOB and LOD can only be calculated when more than 2 blank samples are included. The data should ideally be plotted using the companion function plot_quantlim to ensure that the fit is suited to the data.

Arguments

Example

# Consider data from a spiked-in contained in an example dataset
head(SpikeInDataNonLinear)

nonlinear_quantlim_out <- nonlinear_quantlim(SpikeInDataNonLinear)

# Get values of LOB/LOD
nonlinear_quantlim_out$LOB[1]
nonlinear_quantlim_out$LOD[1]

linear_quantlim

This function calculates the value of the LOB (limit of blank) and LOD (limit of detection) from the (Concentration, Intensity) spiked in data. The function also returns the values of the linear curve fit that allows it to be plotted. At least 2 blank samples (characterized by Intensity = 0) are required by this function which are used to calculate the background noise. The LOB is defined as the concentration at which the value of the linear fit is equal to the 95\% upper bound of the noise. The LOD is the concentration at which the latter is equal to the 90\% lower bound of the prediction interval (5\% quantile) of the linear fit. A weighted linear fit is used with weights for every unique concentration proportional to the inverse of variance between replicates.

The output is a data frame that contains the following columns:

The LOB and LOD can only be calculated when more than 2 blank samples are included. The data should ideally be plotted using the companion function plot_quantlim to ensure that a linear fit is suited to the data.

Arguments

Example

# Consider data from a spiked-in contained in an example dataset
head(SpikeInDataLinear)

linear_quantlim_out <- linear_quantlim(SpikeInDataLinear)

# Get values of LOB/LOD
linear_quantlim_out$LOB[1]
linear_quantlim_out$LOD[1]

plot_quantlim

This function allows to plot the curve fit that is used to calculate the LOB and LOD with functions nonlinear_quantlim and linear_quantlim. The function outputs for each calibration curve, two pdf files each containg one plot. On the first, designated by *_overall.pdf, the entire concentration range is plotted. On the second plot, designated by *_zoom.pdf, the concentration range between 0 and xlim_plot (if specified in the argument of the function) is plotted. When no xlim_plot value is specified, the region close to LOB and LOD is automatically plotted.

This plotting function should ideally be used every time nonlinear_quantlim or linear_quantlim are called to visually ensure that the fits and data are accurate.

Arguments

Example

# Consider data from a spiked-in contained in an example dataset
head(SpikeInDataNonLinear)

nonlinear_quantlim_out <- nonlinear_quantlim(SpikeInDataNonLinear, alpha = 0.05)

#Get values of LOB/LOD
nonlinear_quantlim_out$LOB[1]
nonlinear_quantlim_out$LOD[1]

plot_quantlim(spikeindata = SpikeInDataLinear, quantlim_out  = nonlinear_quantlim_out,
dir_output =  getwd(), alpha = 0.05)