title: “XINA: a workflow for the integration of multiplexed proteomics kinetics data with network analysis”
output: rmarkdown::html_vignette
vignette: >
%
%

1. Introduction

Quantitative proteomics experiments, using for instance isobaric tandem mass tagging approaches, are conducive to measuring changes in protein abundance over multiple time points in response to one or more conditions or stimulations. The aim of XINA is to determine which proteins exhibit similar patterns within and across experimental conditions, since proteins with co-abundance patterns may have common molecular functions. XINA imports multiple datasets, tags dataset in silico, and combines the data for subsequent subgrouping into multiple clusters. The result is a single output depicting the variation across all conditions. XINA, not only extracts co-abundance profiles within and across experiments, but also incorporates protein-protein interaction databases and integrative resources such as KEGG to infer interactors and molecular functions, respectively, and produces intuitive graphical outputs.

Main contribution: an easy-to-use software for non-expert users of clustering and network analyses.

Data inputs: any type of quantitative proteomics data, label or label-free

2. XINA references

https://cics.bwh.harvard.edu/software

https://github.com/langholee/XINA/

3. XINA installation

XINA requires R>=3.5.0.

# Install from Github
install.packages('devtools')
library('devtools')
install_github('langholee/XINA')

# Install from Bioconductor
install.packages("BiocManager")
BiocManager::install("XINA")

To follow this tutorial, you need these libraries. If you don’t have the packages below, please install them.

library(XINA)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect,
##     is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unsplit, which,
##     which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:BiocGenerics':
## 
##     normalize, path, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggplot2)
library(STRINGdb)

4. Example theoretical dataset

We generated an example dataset to show how XINA can be used for your research. To demonstrate XINA functions and allow users to perform similar exercises, we included a module that can generate multiplexed time-series datasets using theoretical data. This data consists of three treatment conditions, ‘Control’, ‘Stimulus1’ and ‘Stimulus2’. Each condition has time series data from 0 hour to 72 hours. As an example, we chose the mTOR pathway to be differentially regulated across the three conditions.

# Generate random multiplexed time-series data
random_data_info <- make_random_xina_data()

# The number of proteins
random_data_info$size
## [1] 500
# Time points
random_data_info$time_points
## [1] "0hr"  "2hr"  "6hr"  "12hr" "24hr" "48hr" "72hr"
# Three conditions
random_data_info$conditions
## [1] "Control"   "Stimulus1" "Stimulus2"

Read and check the randomly generated data

Control <- read.csv("Control.csv", check.names=FALSE, stringsAsFactors = FALSE)
Stimulus1 <- read.csv("Stimulus1.csv", check.names=FALSE, stringsAsFactors = FALSE)
Stimulus2 <- read.csv("Stimulus2.csv", check.names=FALSE, stringsAsFactors = FALSE)

head(Control)
##   Accession                                  Description 0hr    2hr    6hr
## 1  DEFB107B                           defensin beta 107B 0.5 0.4991 0.8238
## 2      GRM4            glutamate metabotropic receptor 4 0.5 0.2476 0.5050
## 3 TNFRSF10A          TNF receptor superfamily member 10a 0.5 0.1639 0.8403
## 4    CFAP57     cilia and flagella associated protein 57 0.5 0.5611 0.8145
## 5     SIN3B SIN3 transcription regulator family member B 0.5 0.5897 0.7149
## 6    TEX13C                        TEX13 family member C 0.5 0.4272 0.8812
##     12hr   24hr   48hr   72hr
## 1 0.9917 0.5909 0.7294 0.3718
## 2 0.8245 0.3878 0.1344 0.0499
## 3 0.4599 0.1372 0.3136 0.5415
## 4 0.9858 0.9947 0.4432 0.5924
## 5 0.8055 0.1112 0.1278 0.5653
## 6 0.5421 0.6511 0.0851 0.4517
head(Stimulus1)
##   Accession                                  Description 0hr    2hr    6hr
## 1  DEFB107B                           defensin beta 107B 0.5 0.4900 0.2093
## 2      GRM4            glutamate metabotropic receptor 4 0.5 0.7832 0.2214
## 3 TNFRSF10A          TNF receptor superfamily member 10a 0.5 0.1669 0.3250
## 4    CFAP57     cilia and flagella associated protein 57 0.5 0.1435 0.7788
## 5     SIN3B SIN3 transcription regulator family member B 0.5 0.9666 0.5469
## 6    TEX13C                        TEX13 family member C 0.5 0.8621 0.1685
##     12hr   24hr   48hr   72hr
## 1 0.8382 0.8725 0.3611 0.3109
## 2 0.9034 0.7616 0.7027 0.2497
## 3 0.6596 0.2576 0.2639 0.4044
## 4 0.9242 0.4418 0.8461 0.2009
## 5 0.0962 0.1736 0.9421 0.6933
## 6 0.7072 0.8025 0.1695 0.3302
head(Stimulus2)
##   Accession                                  Description 0hr    2hr    6hr
## 1  DEFB107B                           defensin beta 107B 0.5 0.6832 0.3568
## 2      GRM4            glutamate metabotropic receptor 4 0.5 0.0358 0.8346
## 3 TNFRSF10A          TNF receptor superfamily member 10a 0.5 0.2275 0.1563
## 4    CFAP57     cilia and flagella associated protein 57 0.5 0.8600 0.1462
## 5     SIN3B SIN3 transcription regulator family member B 0.5 0.6282 0.1640
## 6    TEX13C                        TEX13 family member C 0.5 0.0804 0.2515
##     12hr   24hr   48hr   72hr
## 1 0.1389 0.7790 0.4757 0.2128
## 2 0.8335 0.6704 0.8030 0.4551
## 3 0.8501 0.6934 0.7643 0.0629
## 4 0.8931 0.0157 0.2649 0.0406
## 5 0.9234 0.9320 0.1780 0.0468
## 6 0.1977 0.8172 0.6947 0.1073

Since XINA needs to know which columns have the kinetics data matrix, the user should give a vector containing column names of the kinetics data matrix. These column names have to be the same in all input datasets (Control input, Stimulus1 input and Stimulus2 input).

head(Control[random_data_info$time_points])
##   0hr    2hr    6hr   12hr   24hr   48hr   72hr
## 1 0.5 0.4991 0.8238 0.9917 0.5909 0.7294 0.3718
## 2 0.5 0.2476 0.5050 0.8245 0.3878 0.1344 0.0499
## 3 0.5 0.1639 0.8403 0.4599 0.1372 0.3136 0.5415
## 4 0.5 0.5611 0.8145 0.9858 0.9947 0.4432 0.5924
## 5 0.5 0.5897 0.7149 0.8055 0.1112 0.1278 0.5653
## 6 0.5 0.4272 0.8812 0.5421 0.6511 0.0851 0.4517

5. Package features

XINA is an R package and can examine, but is not limited to, time-series omics data from multiple experiment conditions. It has three modules: 1. Model-based clustering analysis, 2. coregulation analysis, and 3. Protein-protein interaction network analysis (we used STRING DB for this practice).

5.1 Clustering analysis using model-based clustering or k-means clustering algorithm

XINA implements model-based clustering to classify features (genes or proteins) depending on their expression profiles. The model-based clustering optimizes the number of clusters at minimum Bayesian information criteria (BIC). Model-based clustering is fulfilled by the ‘mclust’ R package [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/], which was used by our previously developed tool mIMT-visHTS [https://www.ncbi.nlm.nih.gov/pubmed/26232111]. By default, XINA performs sum-normalization for each gene/protein time-series profile [https://www.ncbi.nlm.nih.gov/pubmed/19861354]. This step is done to standardize all datasets. Most importantly, XINA assigns an electronic tag to each dataset’s proteins (similar to TMT) in order to combine the multiple datasets (Super dataset) for subsequent clustering.

XINA uses the ‘mclust’ package for the model-based clustering. ‘mclust’ requires the fixed random seed to get reproducible clustering results.

set.seed(0)

‘nClusters’ is the number of desired clusters. ‘mclust’ will choose the most optimized number of clusters by considering the Bayesian information criteria (BIC). BIC of ‘mclust’ is the negative of normal BIC, thus the higher BIC, the more optimized clustering scheme in ‘mclust’, while lowest BIC is better in statistics.

# Data files
data_files <- paste(random_data_info$conditions, ".csv", sep='')
data_files
## [1] "Control.csv"   "Stimulus1.csv" "Stimulus2.csv"
# time points of the data matrix
data_column <- random_data_info$time_points
data_column
## [1] "0hr"  "2hr"  "6hr"  "12hr" "24hr" "48hr" "72hr"

Run the model-based clustering

# Run the model-based clusteirng
clustering_result <- xina_clustering(data_files, data_column=data_column, nClusters=30)

If you think the clustering cannot be optimized by the automatically selected covariance model (scored highest BIC), you can adjust the clustering results by appointing specific parameterisations of the within-condition covariance matrix.

# Model-based clustering using VVI covariance model
clustering_result_vvi <- xina_clustering(data_files, data_column=data_column, nClusters=30, chosen_model='VVI')

XINA also supports k-means clustering as well as the model-based clustering

clustering_result_km <- xina_clustering(data_files, data_column=data_column, nClusters=30, chosen_model='kmeans')

The clustering results are stored in your working directory. XINA clustering generates the mclust BIC plot containing the BIC of each covariance matrix ifor each number of clusters. ‘xina_clusters.csv’ has a long list of XINA cluster results. ‘xina_clusters_aligned.csv’ has clustering results arranged by gene name. If you want to recall previous clustering results, you can use ‘load_previous_results’

clustering_result_reloaded <- load_previous_results(".")
head(clustering_result_reloaded$aligned)
##   Gene name                                      Description Control
## 1     OR6C6 olfactory receptor family 6 subfamily C member 6       1
## 2      ORM1                                    orosomucoid 1       1
## 3    TFAP2B                   transcription factor AP-2 beta       1
## 4     SAMD9          sterile alpha motif domain containing 9       1
## 5     DDX27                             DEAD-box helicase 27       1
## 6    ATP2C2     ATPase secretory pathway Ca2+ transporting 2       1
##   Stimulus1 Stimulus2
## 1        20         3
## 2        28        20
## 3         9        15
## 4        23         1
## 5         1        12
## 6         5         6

Load previously generated dataset for upcoming XINA analyses.

data(xina_example)

For visualizing clustering results, XINA draws line graphs of the clustering results using ‘plot_clusters’.

plot_clusters(example_clusters, xylab=FALSE)