# 1. Introduction

pathwayPCA is an integrative analysis tool that implements the principal component analysis (PCA) based pathway analysis approaches described in Chen et al. (2008), Chen et al. (2010), and Chen (2011). pathwayPCA allows users to:

1. Test pathway association with binary, continuous, or survival phenotypes.
2. Extract relevant genes in the pathways using the SuperPCA and AESPCA approaches.
3. Compute principal components (PCs) based on the selected genes. These estimated latent variables represent pathway activities for individual subjects, which can then be used to perform integrative pathway analysis, such as multi-omics analysis.
4. Extract relevant genes that drive pathway significance as well as data corresponding to these relevant genes for additional in-depth analysis.
5. Perform analyses with enhanced computational efficiency with parallel computing and enhanced data safety with S4-class data objects.
6. Analyze studies with complex experimental designs, with multiple covariates, and with interaction effects, e.g., testing whether pathway association with clinical phenotype is different between male and female subjects.

## Installing the Package

pathwayPCA is a package for R, so you need R first. We also strongly recommend the RStudio integrated development environment as a user-friendly graphical wrapper for R.

### Stable Build

The stable build of our package will be available on Bioconductor in May of 2019. To access Bioconductor packages, first install BiocManager, then use BiocManager to install this package:

install.packages("BiocManager")
BiocManager::install("pathwayPCA")

### Development Build

Because we are currently in the development phase for version 2 of this package, you can install the package from GitHub. In order to install a package from GitHub, you will need the devtools:: package (https://github.com/r-lib/devtools) and either Rtools (for Windows) or Xcode (for Mac). Then you can install the development version of the pathwayPCA package from GitHub:

devtools::install_github("gabrielodom/pathwayPCA")

Throughout this vignette, we will make use of the tidyverse suite of utility packages (https://www.tidyverse.org/). The tidyverse and pathwayPCA can be loaded into R using:

# 2. Case study: identifying significant pathways in protein expressions associated with survival outcome in ovarian cancer data

## 2.1. Ovarian cancer dataset

For this example, we will use the mass spectrometry based global proteomics data for ovarian cancer recently generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC). The normalized protein abundance expression dataset can be obtained from the LinkedOmics database at http://linkedomics.org/data_download/TCGA-OV/. We used the dataset “Proteome (PNNL, Gene level)” which was generated by the Pacific Northwest National Laboratory (PNNL). One subject was removed due to missing survival outcome. Missing values in proteins expression data were imputed using the Bioconductor package impute under default settings. The final dataset consisted of 5162 protein expression values for 83 samples.

## 2.2. Creating an Omics data object for pathway analysis

First, we need to create an Omics-class data object that stores

1. the expression dataset
2. phenotype information for the samples
3. a collection of pathways

### 2.2.1 Expression and Phenotype Data

We can obtain datasets 1 and 2 for the ovarian cancer dataset by loading the ovarian_PNNL_survival.RDS data file included in the pathwayPCAdata supplement repository: https://github.com/lizhongliu1996/pathwayPCAdata.

The ovSurv_df dataset is a data frame with protein expression levels and survival outcome matched by sample IDs. The variables (columns) include overall survival time and censoring status, as well as expression data for 5162 proteins for each of the 83 samples.

### 2.2.2 Pathway Collections

For the collection of pathways in (3), we need to specify a .gmt file, a text file with each row corresponding to one pathway. Each row contains an ID (column TERMS), an optional description (column description), and the genes in the pathway (all subsequent columns). Pathway collections in .gmt form can be downloaded from the MSigDB database at http://software.broadinstitute.org/gsea/msigdb/collections.jsp.

For WikiPathways, one can download monthly data releases in .gmt format using the dowloadPathwayArchive() function in the rWikiPathways package from Bioconductor. For example, the following commands downloads the latest release of the human pathways from WikiPathways to your current directory:

trying URL 'http://data.wikipathways.org/current/gmt/wikipathways-20190110-gmt-Homo_sapiens.gmt'
Content type '' length 174868 bytes (170 KB)
#> [1] "wikipathways-20190110-gmt-Homo_sapiens.gmt"

pathwayPCA includes the June 2018 Wikipathways collection for homo sapiens, which can be loaded using the read_gmt function:

### 2.2.3 Create an OmicsSurv Data Container

Now that we have these three data components, we create an OmicsSurv data container. Note that when assayData_df and response are supplied from two different files, the user must match and merge these data sets by sample IDs.

To see a summary of the Omics data object we just created, simply type the name of the object:

## 2.3. Testing pathway association with phenotypes

### 2.3.1 Method Description

Once we have a valid Omics object, we can perform pathway analysis using the AES-PCA (Adaptive, Elastic-net, Sparse PCA) or Supervised PCA (SuperPCA) methodology described in Chen et al. (2008), Chen et al. (2010), and Chen (2011).

Briefly, in the AES-PCA method, we first extract PCs representing activities within each pathway using a dimension reduction approach based on adaptive, elastic-net, sparse principal component analysis (https://doi.org/10.2202/1544-6115.1697). The estimated latent variables are then tested against phenotypes using a permutation test that permutes sample labels. Note that the AESPCA approach does not use the response information to estimate pathway PCs, so it is an unsupervised approach.

This is in contrast to the SuperPCA approach, where a selected subset of genes most associated with disease outcome are used to estimate the latent variable for a pathway (https://doi.org/10.1002/gepi.20532). Because of this gene selection step, the test statistics from the SuperPCA model can no longer be approximated well using the Student’s $$t$$-distribution. To account for the gene selection step, pathwayPCA estimates $$p$$-values from a two-component mixture of Gumbel extreme value distributions instead.

### 2.3.2 Implementation

Because the syntax for performing SuperPCA is nearly identical to the AES-PCA syntax, we will illustrate only the AES-PCA workflow below.

Note that when the value supplied to the numReps argument is greater than 0, the AESPCA_pvals() function employs a parametric test when estimating pathway significance via the following model: “phenotype ~ intercept + PC1”. Pathway $$p$$-values are estimated based on a likelihood ratio test that compares this model to a null model (with intercept only).

This ovarian_aespcOut object contains 3 components: a table of pathway $$p$$-values, AESPCA-estimated PCs of each sample from each pathway, and the loadings of each protein onto the AESPCs.

### 2.3.3 The pathway analysis results

For this ovarian cancer dataset, the top 10 most significant pathways identified by AES-PCA are:

Before constructing a graph of the $$p$$-values, we extract the top 20 pathways (the default value for numPaths is 20):

Now we plot the pathway significance level for the top 20 pathways. In this figure, score indicates the negative natural logarithm of the unadjusted $$p$$-values for each pathway.

### 2.3.4 Extract relevant genes from significant pathways

Because pathways are defined a priori, typically only a subset of genes within each pathway are relevant to the phenotype and contribute to pathway significance. In AESPCA, these relevant genes are the genes with nonzero loadings in the first PC of AESPCs.

For example, for the “IL-1 signaling pathway” (Wikipathways WP195), we can extract the PCs and their protein Loadings using the getPathPCLs() function:

The proteins with non-zero loadings can be extracted using the following lines:

Now we will construct a column chart with ggplot2’s geom_col() function.

Alternatively, we can also plot the correlation of each gene with first PC for each gene. These correlations can be computed by using the TidyCorrelation() function in Section 3.3 of additional vignette “Chapter 5 - Visualizing the Results”.

### 2.3.5 Subject-specific PCA estimates

In the study of complex diseases, there is often considerable heterogeneity among different subjects with regard to underlying causes of disease and benefit of particular treatment. Therefore, in addition to identifying disease-relevant pathways for the entire patient group, successful (personalized) treatment regimens will also depend upon knowing if a particular pathway is dysregulated for an individual patient.

To this end, we can also assess subject-specific pathway activity. As we saw earlier, the getPathPCLs() function also returns subject-specific estimates for the individual pathway PCs.

This graph shows there can be considerable heterogeneity in pathway activities between the patients.

### 2.3.6 Extract analysis dataset for significant pathways

Users are often also interested in examining the actual data used for analysis of the top pathways, especially for the relevant genes with the pathway. To extract this dataset, we can use the SubsetPathwayData() function. These commands extract data for the most significant pathway (IL-1 signaling):

In order to plot a heatmap of the proteins in this pathway, we need to “gather” the data frame into three columns.

Now for the heatmap:

We can also perform analysis for individual genes belonging to the pathway:

Additionally, we can estimate Kaplan-Meier survival curves for patients with high or low expression values for individual genes:

Finally, we can plot these K-M curves over NFKB1 protein expression.

# 3. Case study: an integrative multi-omics pathway analysis of ovarian cancer data

While copy number alterations are common genomic aberrations in ovarian carcer, recent studies have shown these changes do not necessarily lead to concordant changes in protein expression. In Section 2.3 above, we illustrated testing pathway activities in protein expression against survival outcome. In this section, we will additionally test pathway activities in copy number against survival outcome. Moreover, we will perform integrative analysis to identify those survival -associated protein pathways, genes, and samples driven by copy number alterations.

## 3.1 Creating copy number Omics data object for pathway analysis

We can identify copy number (CNV) pathways significantly associated with survival in the same way as we did for protein expressions. This gene level CNV data was downloaded from UCSC Xena Functional Genomics Browser (http://xena.ucsc.edu/).

And now we create an Omics data container.

======  Creating object of class OmicsSurv  =======
The input pathway database included 5831 unique features.
The input assay dataset included 24776 features.
Only pathways with at least 5 or more features included in the assay dataset are
tested (specified by minPathSize parameter). There are 424 pathways which meet
this criterion.
Because pathwayPCA is a self-contained test (PMID: 17303618), only features in
both assay data and pathway database are considered for analysis. There are 5637
such features shared by the input assay and pathway database.

Finally, we can apply the AESPCA method to this copy-number data container. Due to the large sample size, this will take a few moments.

Part 1: Calculate Pathway AES-PCs
Initializing Computing Cluster: DONE
Extracting Pathway PCs in Parallel: DONE

Part 2: Calculate Pathway p-Values
Initializing Computing Cluster: DONE
Extracting Pathway p-Values in Parallel: DONE

Part 3: Adjusting p-Values and Sorting Pathway p-Value Data Frame
DONE

## 3.2 Identifying significant pathways and relevant genes in both CNV and protein level

Next, we identify the intersection of significant pathways based on both CNV and protein data. First, we will create a data frame of the pathway $$p$$-values from both CNV and proteomics.

The results showed there are 23 pathways significantly associated with survival in both CNV and protein data, which is significantly more than expected by chance (p-value = 0.02088; Fisher’s Exact Test; shown in multi_pathway_overlap_fishers.R). Here are the top-10 most significant pathways (sorted by protein data significance):

Similar to the protein pathway analysis shown in Section 2.3.4, we can also identify relevant genes with nonzero loadings that drives pathway significance in CNV. The “IL-1 signaling pathway” (WP195) is significant in both CNV and protein data.

The result showed the NFKB1 gene was selected by AESPCA when testing IL-1 signaling pathway (WP195) with survival outcome in both CNV and protein pathway analysis.

Because the proteomics data were recorded on a subset of the subjects shown in the copy number data, we can further examine the relationship between CNV and protein expressions for this gene.

This Pearson test shows that the correlation between CNV and protein expression is highly significant for this gene.

We can then visualize the multi-omics relationship for the NFKB1 gene using a scatterplot.

## 3.3 An integrative view on patient-specific pathway activities

In Section 2.3.5, we have seen that there can be considerable heterogeneity in pathway activities between the patients. One possible reason could be that copy number changes might not result in changes in protein expression for some of the patients. pathwayPCA can be used to estimate pathway activities for each patient, both for protein expressions and copy number expressions separately. These estimates can then be viewed jointly using a Circos plot.

The accompanying Circos plot shown normalized copy number (outer circle) and protein expression (inner circle) pathway activities for the IL-1 signaling pathway in the ovarian cancer dataset samples. Each bar corresponds to a patient sample. Red bars indicate higher expression values and more pathway activity for the sample, while blue color bars indicate lower expression values and lower pathway activity for the sample. Note that only some patients have concordant changes in copy number and protein expression.

# 4. Case study: analysis of studies with complex designs

pathwayPCA is capable of analyzing studies with multiple experimental factors. In this section, we illustrate using pathwayPCA to test differential association of pathway expression with survival outcome in male and female subjects.

## 4.2 Test for sex interaction with first PC

Next we can test whether there is differential pathway association with survival outcome for males and females by the following model, $h(t) = h_0(t) \exp\left[ \beta_0 + \beta_1\text{PC}_1 + \beta_2\text{male} + \beta_3(\text{PC}_1 \times \text{male}) \right].$

In this model, $$h(t)$$ is expected hazard at time $$t$$, $$h_0 (t)$$ is baseline hazard when all predictors are zero, variable $$male$$ is an indicator variable for male samples, and $$PC_1$$ is a pathway’s estimated first principal component based on AESPCA.

In order to test the sex interaction effect for all pathways, we will write a function which tests the interaction effect for one pathway.

As an example, we can test it on patwhay WP195,

We can also apply this function to a list of pathways and select the significant pathways:

paths_char <- kidney_aespcOut$pVals_df$terms
interactions_ls <- lapply(
paths_char,
FUN = TestIntxn,
pcaOut = kidney_aespcOut,
resp_df = kidney_df[, 2:4]
)
names(interactions_ls) <- paths_char

interactions_df <-
# Take list of interactions
interactions_ls %>%
# select the first element (the data frame of model stats)
lapply([[, 1) %>%
# stack these data frames
bind_rows() %>%
as.tibble() %>%
# sort the rows by significance
arrange(Pr...z..)
#> Warning: as.tibble() was deprecated in tibble 2.0.0.
#> Please use as_tibble() instead.
#> The signature and semantics have changed, see ?as_tibble.
#> This warning is displayed once every 8 hours.
#> Call lifecycle::last_lifecycle_warnings() to see where this warning was generated.

interactions_df %>%
filter(Pr...z.. < 0.05)
#> # A tibble: 14 × 7
#>    terms  description                     coef exp.coef. se.coef.     z Pr...z..
#>    <chr>  <chr>                          <dbl>     <dbl>    <dbl> <dbl>    <dbl>
#>  1 WP1559 TFs Regulate miRNAs related … -0.597     0.550    0.216 -2.76  0.00573
#>  2 WP453  Inflammatory Response Pathway -0.272     0.762    0.109 -2.50  0.0126
#>  3 WP3893 Development and heterogeneit… -0.329     0.719    0.137 -2.40  0.0166
#>  4 WP3929 Chemokine signaling pathway   -0.252     0.777    0.109 -2.32  0.0206
#>  5 WP2849 Hematopoietic Stem Cell Diff… -0.278     0.757    0.129 -2.16  0.0305
#>  6 WP1423 Ganglio Sphingolipid Metabol… -0.487     0.615    0.230 -2.11  0.0346
#>  7 WP3892 Development of pulmonary den… -0.316     0.729    0.151 -2.10  0.0360
#>  8 WP3678 Amplification and Expansion … -0.357     0.700    0.171 -2.09  0.0363
#>  9 WP4141 PI3K/AKT/mTOR - VitD3 Signal…  0.314     1.37     0.153  2.05  0.0407
#> 10 WP3941 Oxidative Damage              -0.354     0.702    0.176 -2.01  0.0442
#> 11 WP3863 T-Cell antigen Receptor (TCR… -0.233     0.792    0.117 -1.99  0.0462
#> 12 WP3872 Regulation of Apoptosis by P…  0.267     1.31     0.135  1.98  0.0472
#> 13 WP3672 LncRNA-mediated mechanisms o… -0.392     0.675    0.198 -1.98  0.0475
#> 14 WP3967 miR-509-3p alteration of YAP… -0.299     0.741    0.151 -1.98  0.0481

The results showed the most significant pathway is WP1559 (“TFs Regulate miRNAs related to cardiac hypertrophy”). We can inspect the model results for this pathway directly.

The results showed that although sex is not significantly associated with survival outcome, the association of pathway gene expression (PC1) with survival is highly dependent on sex of the samples.

## 4.3 Survival curves by sex interaction

For visualization, we can divide subjects according to sex and high or low PC-expression, and add a group indicator for the four groups.

Now we can plot survival curves for the four groups.

These Kaplan-Meier curves showed that while high or low pathway activities were not associated with survival in male subjects (green and purple curves, respectively), female subjects with high pathway activities (red) had significantly worse survival outcomes than those with low pathway activities (blue).

For addtional information on pathwayPCA, please see each of our supplementary vignette chapters for detailed tutorials on each of the three topics discussed above. These vignettes are:

# 6. References

Chen, X., Wang, L., Smith, J.D. and Zhang, B. (2008) Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes. Bioinformatics, 24, 2474-2481.

Chen, X., Wang, L., Hu, B., Guo, M., Barnard, J. and Zhu, X. (2010) Pathway-based analysis for genome-wide association studies using supervised principal components. Genetic epidemiology, 34, 716-724.

Chen, X. (2011) Adaptive elastic-net sparse principal component analysis for pathway association testing. Statistical applications in genetics and molecular biology, 10.

The R session information for this vignette is:

sessionInfo()
#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#>
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#>
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
#>  [3] LC_TIME=en_GB              LC_COLLATE=C
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#>  [1] survminer_0.4.9   ggpubr_0.4.0      survival_3.3-1    pathwayPCA_1.12.0
#>  [5] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.8       purrr_0.3.4
#>  [9] readr_2.1.2       tidyr_1.2.0       tibble_3.1.6      ggplot2_3.3.5
#> [13] tidyverse_1.3.1
#>
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.2        sass_0.4.1        jsonlite_1.8.0    splines_4.2.0
#>  [5] carData_3.0-5     modelr_0.1.8      bslib_0.3.1       assertthat_0.2.1
#>  [9] highr_0.9         cellranger_1.1.0  yaml_2.3.5        pillar_1.7.0
#> [13] backports_1.4.1   lattice_0.20-45   glue_1.6.2        digest_0.6.29
#> [17] ggsignif_0.6.3    rvest_1.0.2       colorspace_2.0-3  htmltools_0.5.2
#> [21] Matrix_1.4-1      pkgconfig_2.0.3   broom_0.8.0       haven_2.5.0
#> [25] xtable_1.8-4      scales_1.2.0      km.ci_0.5-6       KMsurv_0.1-5
#> [29] tzdb_0.3.0        mgcv_1.8-40       generics_0.1.2    farver_2.1.0
#> [33] car_3.0-12        ellipsis_0.3.2    withr_2.5.0       cli_3.3.0
#> [37] magrittr_2.0.3    crayon_1.5.1      readxl_1.4.0      evaluate_0.15
#> [41] fs_1.5.2          fansi_1.0.3       nlme_3.1-157      rstatix_0.7.0
#> [45] xml2_1.3.3        data.table_1.14.2 tools_4.2.0       hms_1.1.1
#> [49] lifecycle_1.0.1   munsell_0.5.0     reprex_2.0.1      compiler_4.2.0
#> [53] jquerylib_0.1.4   rlang_1.0.2       grid_4.2.0        rstudioapi_0.13
#> [57] labeling_0.4.2    rmarkdown_2.14    gtable_0.3.0      lars_1.3
#> [61] abind_1.4-5       DBI_1.1.2         R6_2.5.1          zoo_1.8-10
#> [65] gridExtra_2.3     lubridate_1.8.0   knitr_1.38        fastmap_1.1.0
#> [69] survMisc_0.5.6    utf8_1.2.2        stringi_1.7.6     parallel_4.2.0
#> [73] vctrs_0.4.1       dbplyr_2.1.1      tidyselect_1.1.2  xfun_0.30