Author: Devon Kohler
MSstats+
is an extension to the MSstats
R package in Bioconductor. It is
designed to incorporate the underlying spectral peak quality into differential
analysis, going beyond peak intensity alone. It integrates the underlying
spectral peaks using an isolation forest (iForest) for anomaly detection, which
integrates all the peak quality metrics into a single score. This score is then
used a weight in protein-level summarization. After
protein-level summarization, the estimates of underlying protein abundance and
their associated variance are used as weights in a second weighted
linear model which is used to perform differential analysis. The models can be
applied to experiments with simple group comparison designs, or more complex designs
(e.g. comparing more than two experimental conditions, or a repeated measure
design, such as a time course).
The algorithm is designed to be used with data acquired using data independent
acquisition (DIA). Other acquisitions are possible but have yet to be tested.
For guidance on how to use MSstats+
with other acquisitions, please reach out
on the MSstats
Google Group: \url{https://groups.google.com/g/msstats}.
Below we outline the workflow for using MSstats+
, highlighting key options
and decisions.
library(MSstats)
library(MSstatsConvert)
library(data.table)
library(ggplot2)
The main iForest algorithm for MSstats+
is implemented in the MSstatsConvert
package (a dependency of MSstats
). The MSstatsConvert
package contains
individual converters for each spectral processing tool. Currently, MSstats+
only has a dedicated converter for Spectronaut
, however more converters are
in development and can be implemented manually (see below), For integration with
other tools, please reach out on the MSstats
Google Group: \url{https://groups.google.com/g/msstats}.
First we will load the Spectronaut report of identified and quantified peaks. It
is necessary that users output the spectral peak quality data along with
these peaks. These spectral peak quality metrics must be at least at the
precursor-level (i.e., the column headers must not start with F.
). There are
multiple options for columns that can be extracted, some
examples include: FG.ShapeQualityScore (MS1)
, FG.ShapeQualityScore (MS2)
,
and EG.DeltaRT
. Users can include as many quality metrics as they want,
however we recommend using only quality metrics that are key indicators of the
quality of a spectral peak. Including too many quality metrics may lead to
the iForest model fitting noise. We include additional functionality to assess the
performance of the model (discussed below).
An example output from Spectronaut is included in the
MSstatsConvert
package to provide an idea of what the input should look like.
In addition to the Spectronaut output, MSstats+
also requires an annotation
file which maps the MS run names to the corresponding experimental design. For
more details on creating an annotation file, please see the documentation in
standard MSstats
and the associated publications.
Finally, MSstats+
also requires the temporal order of MS run acquisition to
identify time-based drops in quality. This can be extracted from the Spectronaut
output as shown below.
# Spectronaut quant file
spectronaut_raw = system.file(
"tinytest/raw_data/Spectronaut/spectronaut_quality_input.csv",
package = "MSstatsConvert")
spectronaut_raw = data.table::fread(spectronaut_raw)
# Annotation file
annotation = system.file(
"tinytest/raw_data/Spectronaut/annotation.csv",
package = "MSstatsConvert")
annotation = data.table::fread(annotation)
# Create temporal run ordering data.table
spectronaut_raw[, `R.Run Date (Formatted)` := as.POSIXct(
`R.Run Date (Formatted)`, format = "%m/%d/%Y %I:%M:%S %p")]
run_order = unique(spectronaut_raw[, c("R.Run Date (Formatted)",
"R.FileName")])
run_order = run_order[order(`R.Run Date (Formatted)`)]
run_order$Order = 1:nrow(run_order)
setnames(run_order, c("R.FileName"), c("Run"))
MSstats+
uses the same converter as base MSstats
. New parameters have been
added to indicate if a user wants to leverage the functionalities in MSstats+
.
Specifically, to use MSstats+
, users must indicate that the anomaly detection
algorithm should used by setting calculateAnomalyScores=TRUE
.
An example specification of the converter can be seen below, with key parameter options highlighted below the code.
msstats_input = MSstatsConvert::SpectronauttoMSstatsFormat(
spectronaut_raw, annotation, intensity = 'PeakArea',
excludedFromQuantificationFilter = TRUE, # Pre-filtering
calculateAnomalyScores=TRUE, # Key parameter to use MSstats+
anomalyModelFeatures=c("FGShapeQualityScore(MS2)",
"FGShapeQualityScore(MS1)",
"EGDeltaRT"), # Quality metrics
anomalyModelFeatureTemporal=c(
"mean_decrease", "mean_decrease",
"dispersion_increase"), # Temporal direction of quality metrics
runOrder=run_order, # Temporal order of MS runs
n_trees=100,
max_depth="auto", # iForest parameter (depth of trees)
numberOfCores=1) # Cores to use for parallel processing (Mac or Linux only)
## INFO [2025-10-16 19:58:21] ** Raw data from Spectronaut imported successfully.
## INFO [2025-10-16 19:58:21] ** Raw data from Spectronaut cleaned successfully.
## INFO [2025-10-16 19:58:21] ** Using provided annotation.
## INFO [2025-10-16 19:58:21] ** Run labels were standardized to remove symbols such as '.' or '%'.
## INFO [2025-10-16 19:58:21] ** The following options are used:
## - Features will be defined by the columns: PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge
## - Shared peptides will be removed.
## - Proteins with single feature will not be removed.
## - Features with less than 3 measurements across runs will be removed.
## INFO [2025-10-16 19:58:21] ** Intensities with values of FExcludedFromQuantification equal to TRUE are replaced with NA
## INFO [2025-10-16 19:58:21] ** Features with all missing measurements across runs are removed.
## INFO [2025-10-16 19:58:21] ** Shared peptides are removed.
## INFO [2025-10-16 19:58:21] ** Multiple measurements in a feature and a run are summarized by summaryforMultipleRows: max
## INFO [2025-10-16 19:58:21] ** Features with one or two measurements across runs are removed.
## INFO [2025-10-16 19:58:21] ** Run annotation merged with quantification data.
## INFO [2025-10-16 19:58:21] ** Features with one or two measurements across runs are removed.
## INFO [2025-10-16 19:58:21] ** Fractionation handled.
## INFO [2025-10-16 19:58:21] ** Updated quantification data to make balanced design. Missing values are marked by NA
## INFO [2025-10-16 19:58:22] ** Finished preprocessing. The dataset is ready to be processed by the dataProcess function.
head(msstats_input)
## Run ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge
## <char> <char> <char> <int> <char> <int>
## 1: 20250102_Tulum_NeatK562_Seq1 Q9UFW8 AEFEEQNVR 2 y5 1
## 2: 20250102_Tulum_NeatK562_Seq1 Q9UFW8 AEFEEQNVR 2 y6 1
## 3: 20250102_Tulum_NeatK562_Seq1 Q9UFW8 AEFEEQNVR 2 y7 1
## 4: 20250102_Tulum_NeatK562_Seq2 Q9UFW8 AEFEEQNVR 2 y5 1
## 5: 20250102_Tulum_NeatK562_Seq2 Q9UFW8 AEFEEQNVR 2 y6 1
## 6: 20250102_Tulum_NeatK562_Seq2 Q9UFW8 AEFEEQNVR 2 y7 1
## IsotopeLabelType Condition BioReplicate Fraction Intensity AnomalyScores FGShapeQualityScore(MS2)
## <char> <char> <int> <int> <num> <num> <num>
## 1: L Condition1 31 1 79.12771 0.02482253 0.5149927
## 2: L Condition1 31 1 24.03648 0.02482253 0.5149927
## 3: L Condition1 31 1 128.56342 0.02482253 0.5149927
## 4: L Condition1 37 1 76.65041 0.01170805 0.5705628
## 5: L Condition1 37 1 80.78468 0.01170805 0.5705628
## 6: L Condition1 37 1 143.93866 0.01170805 0.5705628
## FGShapeQualityScore(MS1) EGDeltaRT FGShapeQualityScore(MS2).mean_decrease
## <num> <num> <num>
## 1: 0.5787569 -0.02157192 0
## 2: 0.5787569 -0.02157192 0
## 3: 0.5787569 -0.02157192 0
## 4: 0.3547189 -0.01764335 0
## 5: 0.3547189 -0.01764335 0
## 6: 0.3547189 -0.01764335 0
## FGShapeQualityScore(MS1).mean_decrease EGDeltaRT.dispersion_increase
## <num> <num>
## 1: 0 0
## 2: 0 0
## 3: 0 0
## 4: 0 0
## 5: 0 0
## 6: 0 0
Key parameters for MSstats+
:
calculateAnomalyScores
: Set to TRUE
to use the iForest algorithm to
calculate anomaly scores for each spectral peak.anomalyModelFeatures
: A character vector of the column names of the quality
metrics to be used in the iForest model. These must be at the precursor-level
or higher (i.e., the column headers must not start with F.
).anomalyModelFeatureTemporal
: A character vector of the same length as
anomalyModelFeatures
indicating the temporal direction of each quality
metric.9 Possible values are mean_decrease
, mean_increase
, and
dispersion_increase
. These indicate the direction a metric is changes which
indicates a drop in quality (i.e., decrease, increase, or can go in either
direction). For example, EG.DeltaRT
uses dispersion_increase
because a
deviation in expected retention can be either higher or lower.runOrder
: A data.table with two columns: R.FileName
and Order
. The
R.FileName
column should contain the MS run names as they appear in the
annotation file, and the Order
column should contain the temporal order of
acquisition (1 being the first run acquired, 2 being the second, etc.). This
is used to identify time-based drops in quality.n_trees
: The number of trees to use in the iForest. Default is 100. Too few
trees will make the algorithm unstable, while too many trees will increase
computation time without performance improvement.max_depth
: The maximum depth of each tree in the iForest. Default is
"auto"
, which sets the depth to ceiling(log2(number of samples))
.numberOfCores
: The number of cores to use for parallel processing. This is
recommended since the algorithm is fit on the precursor-level and can take
significant time to process if using a single core. Only available on Mac and
Linux.After running the converter, it is important to check the fit of the iForest.
The iForest is an unsupervised algorithm, so there is no ground truth to compare
to. However, we can check the distribution of the anomaly scores to ensure that
the model is performing as expected. The anomaly scores should be right-skewed,
with most of the scores being low (indicating normal peaks) and a small number
of scores being high (indicating anomalous peaks). If the distribution is not
right-skewed, it may indicate that the quality metrics used are not appropriate,
or that there are not many poor quality peaks (generally found in smaller
studies using less complex biological material). In these cases, it may make
sense to use base MSstats
instead of MSstats+
.
The fit of the model can be evaluated using the CheckDataHealth
function in MSstatsConvert
. This function will calculate Pearson’s moment coefficient of
skewness for each iForest model fit (each precursor). We recommend plotting
these coefficients using a histogram and confirming the distribution is positive
and shifted away from zero (with a median around 1).
An example of how to use the function is shown below.
health_info = CheckDataHealth(msstats_input)
## INFO [2025-10-16 19:58:23] 16.88% of the values are missing across all intensities
## INFO [2025-10-16 19:58:23] Intensity distribution standard truncated data.
## INFO [2025-10-16 19:58:23] 100% of features have a ratio of standard deviation to mean > 0.5. A high value here could indicate problematic measurements.
## INFO [2025-10-16 19:58:23] 5.48% of all measured intensities fall within the outlier range
## INFO [2025-10-16 19:58:23] 0% of features are missing in more than half the runs. These features may need to be removed.
skew_score = health_info[[2]]
ggplot(skew_score, aes(x = skew)) +
geom_histogram(fill = "#E69F00", color = "black", bins=12) +
geom_vline(xintercept = 0, linetype = "dashed",
color = "black", linewidth = 1.5) +
theme_minimal(base_size = 16) +
labs(
title="Distribution of skewness of anomaly scores",
x = "Pearson's moment coefficient of skewness",
y = "Count"
) +
theme(
axis.text.x = element_text(size = 16),
axis.text.y = element_text(size = 16),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18)
)
We can see in the example data that the distribution is shifted right away from zero. This indicates that the anomaly model was able to detect poor quality measurements and the user can proceed with the next step.
The MSstatsConvert
package includes an automatic MSstats+
converter for
Spectronaut, however other tools require manual formatting. This can be done by
leveraging the existing MSstats
converters, joining the quality metrics, and
running the anomaly model separately. An example workflow is included below.
# Extract quality metrics from PSM output
quality_join_cols = c("R.FileName", "PG.ProteinGroups", "EG.ModifiedSequence",
"FG.Charge", "FG.ShapeQualityScore (MS1)",
"FG.ShapeQualityScore (MS2)", "EG.DeltaRT")
quality_join_df = spectronaut_raw[, ..quality_join_cols]
quality_join_df$EG.ModifiedSequence = stringr::str_replace_all(
quality_join_df$EG.ModifiedSequence, "\\_", "")
quality_join_df = unique(quality_join_df)
# In case there are duplicates (not common)
quality_join_df = quality_join_df[
, lapply(.SD, mean, na.rm = TRUE),
by = .(R.FileName, PG.ProteinGroups, EG.ModifiedSequence, FG.Charge),
.SDcols = c("FG.ShapeQualityScore (MS1)",
"FG.ShapeQualityScore (MS2)",
"EG.DeltaRT")
]
# Extract run order
spectronaut_raw[, `R.Run Date (Formatted)` := as.POSIXct(
`R.Run Date (Formatted)`, format = "%m/%d/%Y %I:%M:%S %p")]
run_order = unique(spectronaut_raw[, c("R.Run Date (Formatted)",
"R.FileName")])
run_order = run_order[order(`R.Run Date (Formatted)`)]
run_order$Order = 1:nrow(run_order)
setnames(run_order, c("R.FileName"), c("Run"))
# Run standard MSstats converter without anomaly detection
msstats_input = MSstatsConvert::SpectronauttoMSstatsFormat(
spectronaut_raw, annotation, intensity = 'PeakArea',
excludedFromQuantificationFilter = TRUE, # Pre-filtering
calculateAnomalyScores=FALSE) # Turn anomaly detection off
## INFO [2025-10-16 19:58:23] ** Raw data from Spectronaut imported successfully.
## INFO [2025-10-16 19:58:23] ** Raw data from Spectronaut cleaned successfully.
## INFO [2025-10-16 19:58:23] ** Using provided annotation.
## INFO [2025-10-16 19:58:23] ** Run labels were standardized to remove symbols such as '.' or '%'.
## INFO [2025-10-16 19:58:23] ** The following options are used:
## - Features will be defined by the columns: PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge
## - Shared peptides will be removed.
## - Proteins with single feature will not be removed.
## - Features with less than 3 measurements across runs will be removed.
## INFO [2025-10-16 19:58:23] ** Intensities with values of FExcludedFromQuantification equal to TRUE are replaced with NA
## INFO [2025-10-16 19:58:23] ** Features with all missing measurements across runs are removed.
## INFO [2025-10-16 19:58:23] ** Shared peptides are removed.
## INFO [2025-10-16 19:58:23] ** Multiple measurements in a feature and a run are summarized by summaryforMultipleRows: max
## INFO [2025-10-16 19:58:23] ** Features with one or two measurements across runs are removed.
## INFO [2025-10-16 19:58:23] ** Run annotation merged with quantification data.
## INFO [2025-10-16 19:58:23] ** Features with one or two measurements across runs are removed.
## INFO [2025-10-16 19:58:23] ** Fractionation handled.
## INFO [2025-10-16 19:58:23] ** Updated quantification data to make balanced design. Missing values are marked by NA
## INFO [2025-10-16 19:58:23] ** Finished preprocessing. The dataset is ready to be processed by the dataProcess function.
# Join in quality metrics
msstats_input = merge(as.data.table(msstats_input),
as.data.table(quality_join_df),
all.x=TRUE, all.y=FALSE,
by.x=c("Run", "ProteinName", "PeptideSequence",
"PrecursorCharge"),
by.y=c("R.FileName", "PG.ProteinGroups",
"EG.ModifiedSequence", "FG.Charge"))
# Run Anomaly Model separately
input = MSstatsConvert::MSstatsAnomalyScores(
msstats_input,
c("FG.ShapeQualityScore (MS1)", "FG.ShapeQualityScore (MS2)", "EG.DeltaRT"),
c("mean_decrease", "mean_decrease", "dispersion_increase"),
.5, 100, run_order, 100, "auto", 1)
The resulting data can now be passed into the remaining MSstats+
workflow as
listed below.
After running the converter with MSstats+
options enabled, the next step is to
run the standard MSstats
dataProcess
function. This function will
automatically leverage the information from the MSstats+
option, as long as
the linear summarization option is turned on. The code to do this is highlighted
below.
summarized = dataProcess(msstats_input,
normalization=FALSE, # no normalization
MBimpute=TRUE, # Use imputation
summaryMethod="linear" # Key MSstats+ parameter
)
## INFO [2025-10-16 19:58:25] ** Features with one or two measurements across runs are removed.
## INFO [2025-10-16 19:58:25] ** Fractionation handled.
## INFO [2025-10-16 19:58:25] ** Updated quantification data to make balanced design. Missing values are marked by NA
## INFO [2025-10-16 19:58:25] ** Log2 intensities under cutoff = 1.6097 were considered as censored missing values.
## INFO [2025-10-16 19:58:25] ** Log2 intensities = NA were considered as censored missing values.
## INFO [2025-10-16 19:58:25] ** Use top100 features that have highest average of log2(intensity) across runs.
## INFO [2025-10-16 19:58:25]
## # proteins: 3
## # peptides per protein: 1-5
## # features per peptide: 3-3
## INFO [2025-10-16 19:58:25]
## Condition1 Condition2
## # runs 20 20
## # bioreplicates 20 20
## # tech. replicates 1 1
## INFO [2025-10-16 19:58:25] The following runs have more than 75% missing values: 3,
## 39
## INFO [2025-10-16 19:58:25] == Start the summarization per subplot...
##
|
| | 0%
|
|================================= | 33%
|
|=================================================================== | 67%
|
|====================================================================================================| 100%
## INFO [2025-10-16 19:58:25] == Summarization is done.
head(summarized$ProteinLevelData)
## RUN Protein LogIntensities Variance originalRUN GROUP SUBJECT
## 1 2 Q96DR4 3.362811 0.1989436 20250102_Tulum_1to2K562_Seq32 Condition1 15
## 2 4 Q96DR4 3.409740 0.1989436 20250102_Tulum_1to4K562_Seq34 Condition1 22
## 3 5 Q96DR4 1.630112 0.1989436 20250102_Tulum_1to8K562_Seq36 Condition1 25
## 4 6 Q96DR4 4.306031 0.1989436 20250102_Tulum_NeatK562_Seq1 Condition1 31
## 5 7 Q96DR4 2.084798 0.1989436 20250102_Tulum_NeatK562_Seq10 Condition1 32
## 6 8 Q96DR4 4.026494 0.1989436 20250102_Tulum_NeatK562_Seq11 Condition1 33
## TotalGroupMeasurements NumMeasuredFeature MissingPercentage more50missing NumImputedFeature
## 1 60 3 0.0000000 FALSE 0
## 2 60 2 0.3333333 FALSE 1
## 3 60 1 0.6666667 TRUE 2
## 4 60 3 0.0000000 FALSE 0
## 5 60 3 0.0000000 FALSE 0
## 6 60 3 0.0000000 FALSE 0
Key parameters for MSstats+
:
summaryMethod
: Set to linear
to use the peak quality-weighted linear
regression summarization. This is a key parameter for MSstats+
. Other
summarization methods (e.g., Tukey’s median polish) will not leverage the
peak quality information.When the dataProcess
function is run with summaryMethod=linear
a new column
will be included called Variance
indicating the variance of the underlying
protein abundance estimate. This variance is used in the next step of the
workflow.
The final step of the MSstats+
workflow is to perform differential analysis.
This is done using the standard groupComparison
function in MSstats
. Similar
to the dataProcess
function, the groupComparison
function will automatically
leverage the information from the MSstats+
option. The code to do this is highlighted below.
comparison = matrix(c(-1,1),nrow=1)
row.names(comparison) = "Condition2-Condition1"
colnames(comparison) = c("Condition1", "Condition2")
comparison_result = groupComparison(
contrast.matrix = comparison,
data=summarized)
## INFO [2025-10-16 19:58:25] == Start to test and get inference in whole plot ...
##
|
| | 0%
|
|================================= | 33%
|
|=================================================================== | 67%
|
|====================================================================================================| 100%
## INFO [2025-10-16 19:58:25] == Comparisons for all proteins are done.
head(comparison_result$ComparisonResult)
## Protein Label log2FC SE Tvalue DF pvalue adj.pvalue issue
## 1 Q96DR4 Condition2-Condition1 -0.7772097 0.2420924 -3.210385 33 0.002951009 0.008853028 NA
## 2 Q96S19 Condition2-Condition1 -0.8206449 0.4087581 -2.007654 37 0.052024697 0.054104875 NA
## 3 Q9UFW8 Condition2-Condition1 -1.0176838 0.5120332 -1.987535 38 0.054104875 0.054104875 NA
## MissingPercentage ImputationPercentage
## 1 0.3583333 0.2333333
## 2 0.3854167 0.3604167
## 3 0.1100000 0.1100000
The groupComparison
function is run the exact same in MSstats+
as MSstats
,
but under the hood the variation from the previous step is used in a second
weighted linear regression (in comparison to the unweighted version in
MSstats
).
MSstats
base_msstats_input = MSstatsConvert::SpectronauttoMSstatsFormat(
spectronaut_raw, annotation, intensity = 'PeakArea',
excludedFromQuantificationFilter = TRUE, # Pre-filtering
calculateAnomalyScores=FALSE) # No anomaly model
## INFO [2025-10-16 19:58:25] ** Raw data from Spectronaut imported successfully.
## INFO [2025-10-16 19:58:25] ** Raw data from Spectronaut cleaned successfully.
## INFO [2025-10-16 19:58:25] ** Using provided annotation.
## INFO [2025-10-16 19:58:25] ** Run labels were standardized to remove symbols such as '.' or '%'.
## INFO [2025-10-16 19:58:25] ** The following options are used:
## - Features will be defined by the columns: PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge
## - Shared peptides will be removed.
## - Proteins with single feature will not be removed.
## - Features with less than 3 measurements across runs will be removed.
## INFO [2025-10-16 19:58:25] ** Intensities with values of FExcludedFromQuantification equal to TRUE are replaced with NA
## INFO [2025-10-16 19:58:25] ** Features with all missing measurements across runs are removed.
## INFO [2025-10-16 19:58:25] ** Shared peptides are removed.
## INFO [2025-10-16 19:58:25] ** Multiple measurements in a feature and a run are summarized by summaryforMultipleRows: max
## INFO [2025-10-16 19:58:25] ** Features with one or two measurements across runs are removed.
## INFO [2025-10-16 19:58:25] ** Run annotation merged with quantification data.
## INFO [2025-10-16 19:58:25] ** Features with one or two measurements across runs are removed.
## INFO [2025-10-16 19:58:25] ** Fractionation handled.
## INFO [2025-10-16 19:58:25] ** Updated quantification data to make balanced design. Missing values are marked by NA
## INFO [2025-10-16 19:58:25] ** Finished preprocessing. The dataset is ready to be processed by the dataProcess function.
base_summarized = dataProcess(base_msstats_input,
normalization=FALSE, # no normalization
MBimpute=TRUE, # Use imputation
summaryMethod="TMP" # Tukey median polish summarization
)
## INFO [2025-10-16 19:58:25] ** Log2 intensities under cutoff = 1.6097 were considered as censored missing values.
## INFO [2025-10-16 19:58:25] ** Log2 intensities = NA were considered as censored missing values.
## INFO [2025-10-16 19:58:25] ** Use top100 features that have highest average of log2(intensity) across runs.
## INFO [2025-10-16 19:58:25]
## # proteins: 3
## # peptides per protein: 1-5
## # features per peptide: 3-3
## INFO [2025-10-16 19:58:25]
## Condition1 Condition2
## # runs 20 20
## # bioreplicates 20 20
## # tech. replicates 1 1
## INFO [2025-10-16 19:58:25] The following runs have more than 75% missing values: 3,
## 39
## INFO [2025-10-16 19:58:25] == Start the summarization per subplot...
##
|
| | 0%
|
|================================= | 33%
|
|=================================================================== | 67%
|
|====================================================================================================| 100%
## INFO [2025-10-16 19:58:25] == Summarization is done.
base_comparison_result = groupComparison(
contrast.matrix = comparison,
data=base_summarized)
## INFO [2025-10-16 19:58:25] == Start to test and get inference in whole plot ...
##
|
| | 0%
|
|================================= | 33%
|
|=================================================================== | 67%
|
|====================================================================================================| 100%
## INFO [2025-10-16 19:58:25] == Comparisons for all proteins are done.
head(base_comparison_result$ComparisonResult)
## Protein Label log2FC SE Tvalue DF pvalue adj.pvalue issue
## 1 Q96DR4 Condition2-Condition1 -0.9356333 0.2912517 -3.212457 33 0.002934893 0.00880468 NA
## 2 Q96S19 Condition2-Condition1 -0.8648916 0.4144781 -2.086701 37 0.043862456 0.06579368 NA
## 3 Q9UFW8 Condition2-Condition1 -0.9715015 0.5381389 -1.805299 38 0.078955739 0.07895574 NA
## MissingPercentage ImputationPercentage
## 1 0.3583333 0.2333333
## 2 0.3854167 0.3604167
## 3 0.1100000 0.1100000
We can see that MSstats
does not identify all the proteins as differentially abundant, even though there was a true log$_2$ fold change of -1. We can compare the fold change and standard error estimates between MSstats+
and base MSstats
to see why the p-values are different.
comparison_result$ComparisonResult$Model = "MSstats+"
base_comparison_result$ComparisonResult$Model = "MSstats"
merged_results = rbindlist(list(
comparison_result$ComparisonResult,
base_comparison_result$ComparisonResult
))
ggplot(data=merged_results) +
geom_col(aes(x=Protein, y=abs(log2FC), fill=Model), position="dodge") +
geom_hline(yintercept=1, linetype="dashed", color="black", linewidth=1) +
theme_bw(base_size =16) +
labs(title="Log fold change comparison",
x="MSstats+",
y="Base MSstats")
The fold changes are both near the ground truth and MSstats+
does not clearly outperform base MSstats
.
ggplot(data=merged_results) +
geom_col(aes(x=Protein, y=SE, fill=Model), position="dodge") +
theme_bw(base_size =16) +
labs(title="Standard error comparison",
x="MSstats+",
y="Base MSstats")
We can see the advantage of MSstats+
in the standard error estimates. MSstats+
downweights poorly quantified measurements and runs, which can lead to much better estimates of the true standard error. In this example MSstats+
downweighted the runs which added a large amount of uncertainty, leading to it identifying all proteins as differentially abundant.
MSstats+ can also be used in conjunction with MSstatsBig to analyze large DIA datasets. The workflow leverages the MSstatsBig converter and an external execution of the MSstats+ isolation forest. An example of how to do this is shown below.
This section of the vignette does not run automatically and must be executed manually.
library(MSstatsBig)
# Define file path to data
spectronaut_raw = system.file(
"tinytest/raw_data/Spectronaut/spectronaut_quality_input.csv",
package = "MSstatsConvert")
# Run MSstatsBig converter
converted_data = bigSpectronauttoMSstatsFormat(
input_file=spectronaut_raw,
output_file_name="output_file.csv",
intensity = "F.PeakHeight",
filter_by_excluded = TRUE,
filter_by_identified = FALSE,
filter_by_qvalue = TRUE,
qvalue_cutoff = 0.01,
filter_unique_peptides = TRUE,
aggregate_psms = TRUE,
filter_few_obs = FALSE,
remove_annotation = FALSE,
calculateAnomalyScores=TRUE,
anomalyModelFeatures=c("FG.ShapeQualityScore (MS1)",
"FG.ShapeQualityScore (MS2)",
"EG.ApexRT"),
backend="arrow")
converted_data = dplyr::collect(converted_data)
# Prepare info for anomaly model
# Get temporal ordering from input file (can also be done externally)
temporal = fread(spectronaut_raw)[
, .SD[1], by = .(`R.Run Date (Formatted)`, R.FileName)]
temporal = temporal[, Run.TimeParsed := as.POSIXct(
`R.Run Date (Formatted)`, format = "%m/%d/%Y %I:%M:%S %p")][
order(Run.TimeParsed)][
, !"Run.TimeParsed"]
temporal$Order = 1:nrow(temporal)
temporal$Run = temporal$R.FileName
temporal = temporal[, c("Run", "Order")]
# Specify anomaly model parameters
anomalyModelFeatures=c("FG.ShapeQualityScore (MS2)",
"FG.ShapeQualityScore (MS1)",
"EG.ApexRT")
anomalyModelFeatures = MSstatsConvert:::.standardizeColnames(anomalyModelFeatures)
anomalyModelFeatureTemporal=c("mean_decrease",
"mean_decrease",
"dispersion_increase")
n_trees=100
max_depth="auto"
numberOfCores=1
# Run anomaly detection model
msstats_input = MSstatsConvert::MSstatsAnomalyScores(
as.data.table(converted_data), anomalyModelFeatures,
anomalyModelFeatureTemporal, .5, 100, temporal, n_trees,
max_depth, numberOfCores)
head(msstats_input)
This code will run the MSstatsBig converter and then run the anomaly model. This workflow leads to a much faster processing speed compared to using the standard converter, and allows users to work with Spectronaut files which may be too large to fit into memory. After running the anomaly model, the resulting data can be input into the MSstats+ workflow reviewed previously.
One note here is the temporal ordering. In this example, we extracted the temporal run order directly from the Spectronaut file. However, in practice this may not be appropriate for big data files. In cases where the data is large, this ordering should be extracted externally and provided by the user, or the user should sample the Spectronaut file rather than load it all into memory.