Contents

1 Overview of the analysis

This package provides an integrated analysis workflow for robust and reproducible analysis of mass spectrometry proteomics data for differential protein expression or differential enrichment. It requires tabular input (e.g. txt files) as generated by quantitative analysis softwares of raw mass spectrometry data, such as MaxQuant or IsobarQuant. Functions are provided for data preparation, filtering, variance normalization and imputation of missing values, as well as statistical testing of differentially enriched / expressed proteins. It also includes tools to check intermediate steps in the workflow, such as normalization and missing values imputation. Finally, visualization tools are provided to explore the results, including heatmap, volcano plot and barplot representations. For scientists with limited experience in R, the package also contains wrapper functions that entail the complete analysis workflow and generate a report (see the chapter on Workflow functions). Even easier to use are the interactive Shiny apps that are provided by the package (see the chapter on Shiny apps).

2 Installation

The most recent version of R can be downloaded from CRAN. Additional system requirements are Pandoc, as well as Rtools for Windows and NetCDF for Linux.

Start R and install the DEP package:

source("https://bioconductor.org/biocLite.R")
biocLite("DEP")

library("DEP")

3 Interactive analysis using the DEP Shiny apps

The package contains shiny apps, which allow for the interactive analysis entailing the entire workflow described below. These apps are especially relevant for users with limited or no experience in R. Currently, there are two different apps. One for label-free quantification (LFQ)-based analysis (output from MaxQuant) and the second for tandem-mass-tags (TMT)-based analysis (output from IsobarQuant). To run the apps, simply run this single command:

# For LFQ analysis
run_app("LFQ")

# For TMT analysis
run_app("TMT")

4 Differential analysis

4.1 Example dataset: Ubiquitin interactors

We analyze a proteomics dataset in which Ubiquitin-protein interactors were characterized (Zhang et al. Mol Cell 2017). The raw mass spectrometry data were first analyzed using MaxQuant (Cox and Mann, Nat Biotech 2007) and the resulting “proteinGroups.txt” file is used as input for the downstream analysis.

4.2 Loading of the data

# Loading a package required for data handling
library("dplyr")

# The data is provided with the package
data <- UbiLength

# We filter for contaminant proteins and decoy database hits, which are indicated by "+" in the columns "Potential.contaminants" and "Reverse", respectively. 
data <- filter(data, Reverse != "+", Potential.contaminant != "+")

The data.frame has the following dimensions:

dim(data)
## [1] 2941   23

The data.frame has the following column names:

colnames(data)
##  [1] "Protein.IDs"             "Majority.protein.IDs"   
##  [3] "Protein.names"           "Gene.names"             
##  [5] "Fasta.headers"           "Peptides"               
##  [7] "Razor...unique.peptides" "Unique.peptides"        
##  [9] "LFQ.intensity.Ubi4_1"    "LFQ.intensity.Ubi4_2"   
## [11] "LFQ.intensity.Ubi4_3"    "LFQ.intensity.Ubi6_1"   
## [13] "LFQ.intensity.Ubi6_2"    "LFQ.intensity.Ubi6_3"   
## [15] "LFQ.intensity.Ctrl_1"    "LFQ.intensity.Ctrl_2"   
## [17] "LFQ.intensity.Ctrl_3"    "LFQ.intensity.Ubi1_1"   
## [19] "LFQ.intensity.Ubi1_2"    "LFQ.intensity.Ubi1_3"   
## [21] "Only.identified.by.site" "Reverse"                
## [23] "Potential.contaminant"

The “LFQ.intensity” columns will be used for subsequent analysis.

4.3 Data preparation

The dataset has unique Uniprot identifiers, however those are not immediately informative. The associated gene names are informative, however these are not always unique.

# Are there any duplicated gene names?
data$Gene.names %>% duplicated() %>% any()
## [1] TRUE
# Make a table of duplicated gene names
data %>% group_by(Gene.names) %>% summarize(frequency = n()) %>% 
  arrange(desc(frequency)) %>% filter(frequency > 1)
## # A tibble: 51 x 2
##    Gene.names frequency
##         <chr>     <int>
##  1                    7
##  2      ATXN2         4
##  3     ATXN2L         4
##  4        SF1         4
##  5      HSPA8         3
##  6      RBM33         3
##  7       UGP2         3
##  8     ACTL6A         2
##  9     BCLAF1         2
## 10       BRAP         2
## # ... with 41 more rows

For further analysis these proteins must get unique names. Additionally, some proteins do not have an annotated gene name and for those we will use the Uniprot ID.

# Make unique names using the annotation in the "Gene.names" column as primary names and the annotation in "Protein.IDs" as name for those that do not have an gene name.
data_unique <- make_unique(data, "Gene.names", "Protein.IDs", delim = ";")
## Joining, by = c("Protein.IDs", "Gene.names")
# Are there any duplicated names?
data$name %>% duplicated() %>% any()
## [1] FALSE

4.4 Generate a SummarizedExperiment object

Many Bioconductor packages use SummarizedExperiment objects as input and/or output. This class of objects contains and coordinates the actual (assay) data, information on the samples as well as feature annotation. We can generate the SummarizedExperiment object from our data using two different approaches. We can extract sample information directly from the column names or we add sample information using an experimental design template. An experimental design is included in the package for our example dataset.

The experimental design must contain ‘label’, ‘condition’ and ‘replicate’ columns. The ‘label’ column contains the identifiers of the different samples and they should correspond to the column names containing the assay data. The ‘condition’ and ‘replicate’ columns contain the annotation of these samples as defined by the user.

label condition replicate
Ubi4_1 Ubi4 1
Ubi4_2 Ubi4 2
Ubi4_3 Ubi4 3
Ubi6_1 Ubi6 1
Ubi6_2 Ubi6 2
Ubi6_3 Ubi6 3
Ctrl_1 Ctrl 1
Ctrl_2 Ctrl 2
Ctrl_3 Ctrl 3
Ubi1_1 Ubi1 1
Ubi1_2 Ubi1 2
Ubi1_3 Ubi1 3
# Generate a SummarizedExperiment object using an experimental design
LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers
experimental_design <- UbiLength_ExpDesign
data_se <- make_se(data_unique, LFQ_columns, experimental_design)

# Generate a SummarizedExperiment object by parsing condition information from the column names
LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers
data_se_parsed <- make_se_parse(data_unique, LFQ_columns)

# Let's have a look at the SummarizedExperiment object
data_se
## class: SummarizedExperiment 
## dim: 2941 12 
## metadata(0):
## assays(1): ''
## rownames(2941): RBM47 UBA6 ... ATXN2.3 X6RHB9
## rowData names(13): Protein.IDs Majority.protein.IDs ... name ID
## colnames(12): Ubi4_1 Ubi4_2 ... Ubi1_2 Ubi1_3
## colData names(4): label ID condition replicate

4.5 Prerequisites of the SummarizedExperiment object

The make_se and make_se_parse functions generate a SummarizedExperiment object that has a couple of specifications. The assay data is log2-transformed and its rownames depict the protein names. The rowData contains, amongst others, the ‘name’ and ‘ID’ columns that were generated by make_unique. The colData contains the experimental design and thereby the sample annotation. Thereby the colData includes the ‘label’, ‘condition’ and ‘replicate’ columns as well as a newly generated ‘ID’ column. The log2-transformed assay data and the specified rowData and colData columns are prerequisites for the subsequent analysis steps.

4.6 Filter on missing values

The dataset contains proteins which are not quantified in all replicates. Some proteins are even only quantified in a single replicate.

# Plot a barplot of the protein identification overlap between samples
plot_frequency(data_se)

This leaves our dataset with missing values, which need to be imputed. However, this should not be done for proteins that contain too many missing values. Therefore, we first filter out proteins that contain too many missing values. This is done by setting the threshold for the allowed number of missing values per condition in the filter_missval function.

# Filter for proteins that are identified in all replicates of at least one condition
data_filt <- filter_missval(data_se, thr = 0)

# Less stringent filtering:
# Filter for proteins that are identified in 2 out of 3 replicates of at least one condition
data_filt2 <- filter_missval(data_se, thr = 1)

After filtering, the number of identified proteins per sample can be plotted as well as the overlap in identifications between samples.

# Plot a barplot of the number of identified proteins per samples
plot_numbers(data_filt)

# Plot a barplot of the protein identification overlap between samples
plot_coverage(data_filt)

4.7 Normalization

The data is background corrected and normalized by variance stabilizing transformation (vsn).

# Normalize the data
data_norm <- normalize_vsn(data_filt)

The normalization can be inspected by checking the distributions of the samples before and after normalization.

# Visualize normalization by boxplots for all samples before and after normalization
plot_normalization(data_filt, data_norm)

4.8 Impute data for missing values

The remaining missing values in the dataset need to be imputed. The data can be missing at random (MAR), for example if proteins are quantified in some replicates but not in others. Data can also be missing not at random (MNAR), for example if proteins are not quantified in specific conditions (e.g. in the control samples). MNAR can indicate that proteins are below the detection limit in specific samples, which could be very well the case in proteomics experiments. For these different conditions, different imputation methods have to be used, as described in the MSnbase vignette and more specifically in the impute function descriptions.

To explore the pattern of missing values in the data, a heatmap is plotted indicating whether values are missing (0) or not (1). Only proteins with at least one missing value are visualized.

# Plot a heatmap of proteins with missing values
plot_missval(data_filt)

This heatmap indicates that missing values are highly biased to specific samples. The example dataset is an affinity enrichment dataset of ubiquitin interactors, which is likely to have proteins which are below the detection limit in specific samples. These can be proteins that are specifically enriched in the ubiquitin purifications, but are not enriched in the controls samples, or vice versa. To check whether missing values are biased to lower intense proteins, the densities and cumulative fractions are plotted for proteins with and without missing values.

# Plot intensity distributions and cumulative fraction of proteins with and without missing values
plot_detect(data_filt)