1 Introduction

This vignette provides an overview of the Bioconductor package ASSIGN (Adaptive Signature Selection and InteGratioN) for signature-based profiling of heterogeneous biological pathways. ASSIGN is a computational tool used to evaluate the pathway deregulation/activation status in individual patient samples. ASSIGN employs a flexible Bayesian factor analysis approach that adapts predetermined pathway signatures derived either from a literature search or from perturbation experiments to create cell-/tissue-specific pathway signatures. The deregulation/activation level of each context-specific pathway is quantified to a score, which represents the extent to which a patient sample matches thepathway deregulation/activation signature.

Some distinctive features of ASSIGN are:

  1. Multiple Pathway Profiling: ASSIGN can profile multiple pathway signatures simultaneously, accounting for ‘cross-talk’ between interconnected pathway components.
  2. Context specificity in baseline gene expression: Baseline gene expression levels (i.e., the gene expression level under normal conditions) may vary widely due to differences across tissue types, disease statuses, or measurement platforms. ASSIGN can adaptively estimate background gene expression levels across a set of samples.
  3. Context-specific signature estimation: ASSIGN provides the flexibility to use either an input gene list or magnitudes of signature genes as prior information, allowing for adaptive refinement of pathway signatures in specific cell or tissue types.
  4. Regularization of signature strength estimates: ASSIGN regularizes the signature strength coefficients using a Bayesian ridge regression formulation by shrinking the strength of the irrelevant signature genes toward zero. The parameter regularization constrains the pathway signature to a small group of genes, making the results more biologically interpretable.

2 How to use the ASSIGN package

2.1 Example Data

In the following examples, we will illustrate how to run ASSIGN using either the easy to use assign.wrapper function for simple analysis or each individual ASSIGN step for more detailed intermediate results.

For either analysis, we will first load ASSIGN and create a temporary directory 'tempdir' under the user’s current working directory. All output generated in this vignette will be saved in 'tempdir'.

library(ASSIGN)

dir.create("tempdir")
tempdir <- "tempdir"

Next, load the training data, test data, training labels, and test labels. The training dataset is a G (number of genomic measurements) x N (number of samples in pathway perturbation experiments) matrix, including five oncogenic pathways: B-Catenin, E2F3, MYC, RAS, and SRC pathways in this example. The training data labels denote the column indices of control and experimental samples for each perturbation experiment. For example, we specify the column indices of the 10 RAS control samples to be 1:10, and column indices of 10 RAS activated samples to be 39:48. The test dataset is a G (number of genomic measurements) x N (number of patient samples) matrix. The test data labels denote the classes of the N test samples. In our example, test samples 1-53 are adenocarcinoma and samples 54-111 are squamous cell carcinoma. We specify 'Adeno' and 'Squamous' in the vector of test data labels. Note that the test data labels are optional. ASSIGN outputs additional validation plots to evaluate classification accuracy when test data labels are provided.

data(trainingData1)
data(testData1)
data(geneList1)
trainingLabel1 <- list(control = list(bcat=1:10, e2f3=1:10,
                                      myc=1:10, ras=1:10, src=1:10),
                       bcat = 11:19, e2f3 = 20:28, myc= 29:38,
                       ras = 39:48, src = 49:55)
testLabel1 <- rep(c("Adeno", "Squamous"), c(53,58))

2.2 Run ASSIGN all-in-one using assign.wrapper

We developed an all-in-one assign.wrapper function to run ASSIGN with one command. For most users, assign.wrapper will be sufficient. The assign.wrapper function outputs the following files:

  • pathway_activity_testset.csv: ASSIGN predicted pathway activity in test samples.
  • signature_heatmap_testset_prior.pdf: heatmaps of the expression level of prior signature genes in training samples.
  • pathway_activity_scatterplot_testset.pdf: scatterplot of pathway activity in test samples. The x-axis represents test samples ordered by pathway activity; the y-axis represents pathway activity.
  • output.rda: The intermediate results of individual ASSIGN functions.
  • parameters.txt: A log file containing the parameters used for this ASSIGN run.

If training data is provided, assign.wrapper also outputs the following files:

  • pathway_activity_trainingset.csv: ASSIGN predicted pathway activity in training samples.
  • signature_heatmap_trainingset.pdf: heatmaps of the expression level of signature genes in training samples.
  • pathway_activity_scatterplot_trainingset.pdf: scatterplot of pathway activity in training samples.
  • signature_gene_list_prior.csv: the gene list and prior coefficients for the pathway signature.

When Adaptive_S is TRUE, assign.wrapper also outputs the following files:

  • signature_heatmap_testset_posterior.pdf: heatmaps of the expression level of posterior signature genes in training samples.
  • posterior_delta.csv: a csv file of the prior and posterior change in expression and probability of inclusion for each gene in each signature.
  • Signature_convergence.pdf: A plot of the MCMC convergence.

Finally, if the testLabel argument is not NULL, assign.wrapper also outputs the following files:

  • pathway_activity_boxplot_testset.pdf: boxplot of pathway activity in every test class.

Here we illustrate how to run assign.wrapper function with three examples. To start, create a temporary directory 'tempdir' and load training and test datasets. The individual parameters are described in detail in the sections below and the ASSIGN reference manual.

2.2.1 Example 1: Training data is available, but a gene list of pathway signature genes is not available:

dir.create(file.path(tempdir,"wrapper_example1"))
assign.wrapper(trainingData=trainingData1, testData=testData1,
               trainingLabel=trainingLabel1, testLabel=testLabel1,
               geneList=NULL, n_sigGene=rep(200,5), adaptive_B=TRUE,
               adaptive_S=FALSE, mixture_beta=TRUE,
               outputDir=file.path(tempdir,"wrapper_example1"),
               iter=2000, burn_in=1000)

2.2.2 Example 2: Training data is available, and a gene list of pathway signature genes is available:

dir.create(file.path(tempdir,"wrapper_example2"))
assign.wrapper(trainingData=trainingData1, testData=testData1,
               trainingLabel=trainingLabel1, testLabel=NULL,
               geneList=geneList1, n_sigGene=NULL, adaptive_B=TRUE,
               adaptive_S=FALSE, mixture_beta=TRUE,
               outputDir=file.path(tempdir,"wrapper_example2"),
               iter=2000, burn_in=1000)

2.2.3 Example 3: Training data is not available, but a gene list of pathway signature genes is available:

dir.create(file.path(tempdir,"wrapper_example3"))
assign.wrapper(trainingData=NULL, testData=testData1,
               trainingLabel=NULL, testLabel=NULL,
               geneList=geneList1, n_sigGene=NULL, adaptive_B=TRUE,
               adaptive_S=TRUE, mixture_beta=TRUE,
               outputDir=file.path(tempdir,"wrapper_example3"),
               iter=2000, burn_in=1000)

2.3 Run ASSIGN step-by-step

We developed a series of functions: assign.preprocess, assign.mcmc, assign.convergence, assign.summary, assign.cv.output, and assign.output that work in concert to produce detailed results.

2.3.1 assign.preprocess

We first run the assign.preprocess function on the input datasets. When the genomic measurements (e.g., gene expression profiles) of training samples are provided, but predetermined pathway signature gene lists are not provided, the assign.preprocess function utilizes a Bayesian univariate regression module to select a gene set (usually 50-200 genes, but this can be specified by the user) based on the absolute value of the regression coefficient (fold change) and the posterior probability of the variable to be selected (statistical significance). Since we have no predetermined gene lists to provide, we leave the geneList option as default NULL. Here we specify 200 signature genes for each of the five pathways.

# training dataset is available;
# the gene list of pathway signature is NOT available
processed.data <- assign.preprocess(trainingData=trainingData1,
                                    testData=testData1,
                                    trainingLabel=trainingLabel1,
                                    geneList=NULL, n_sigGene=rep(200,5))

Alternatively, the users can have both the training data and the curated/predetermined pathway signatures. Some genes in the curated pathway signatures, although not significantly differentially expressed, need to be included for the purpose of prediction. In this case, we specify the trainingData and geneList parameters when both the training dataset and predetermined signature gene list are available.

# training dataset is available;
# the gene list of pathway signature is a