Installation

To install and load methylGSA

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("methylGSA")
library(methylGSA)

Depending on the DNA methylation array type, other packages may be needed before running the analysis.

If analyzing 450K array, IlluminaHumanMethylation450kanno.ilmn12.hg19 needs to be loaded.

library(IlluminaHumanMethylation450kanno.ilmn12.hg19)

If analyzing EPIC array, IlluminaHumanMethylationEPICanno.ilm10b4.hg19 needs to be loaded.

library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)

If analyzing user-supplied mapping between CpG ID and gene name, neither IlluminaHumanMethylation450kanno.ilmn12.hg19 nor IlluminaHumanMethylationEPICanno.ilm10b4.hg19 needs to be loaded.

Introduction

The methylGSA package contains functions to carry out gene set analysis adjusting for the number of CpGs per gene. It has been shown by Geeleher et al [1] that gene set analysis is extremely biased for DNA methylation data. This package contains three main functions to adjust for the bias in gene set analysis.

Supported gene sets and gene ID types

Supported array types

Description of methylglm

methylglm is an extention of GOglm [9]. GOglm adjusts length bias for RNA-Seq data by incorporating gene length as a covariate in logistic regression model. methylglm adjusts length bias in DNA methylation by the number of CpGs instead of gene length. For each gene set, we fit a logistic regression model:

\[ \text{logit} (\pi_{i}) = \beta_{0} + \beta_{1}x_{i} + \beta_{2}c_{i}\] For each gene \(i\), \(\pi_{i}\) = Pr(gene \(i\) is in gene set), \(x_{i}\) represents negative logarithmic transform of its minimum p-value in differential methylation analysis, and \(c_{i}\) is logarithmic transform of its number of CpGs.

methylglm requires only a simple named vector. This vector contains p-values of each CpG. Names should be their corresponding CpG IDs.

Example

Here is what the input vector looks like:

data(cpgtoy)
head(cpg.pval, 20)
cg00050873 cg00212031 cg00214611 cg01707559 cg02004872 cg02011394 cg02050847 
    0.2876     0.7883     0.4090     0.8830     0.9405     0.0456     0.5281 
cg02233190 cg02494853 cg02839557 cg02842889 cg03052502 cg03244189 cg03443143 
    0.8924     0.5514     0.4566     0.9568     0.4533     0.6776     0.5726 
cg03683899 cg03706273 cg03750315 cg04016144 cg04042030 cg04448376 
    0.1029     0.8998     0.2461     0.0421     0.3279     0.9545 

Please note that the p-values presented here in cpg.pval is for illustration purpose only. They are used to show how to utilize the functions in methylGSA. It does not represent p-values from any real data analysis. The actual p-values in differential methylation analysis may be quite different from the p-values in cpg.pval in terms of the magnitude.

Run methylglm:

res1 = methylglm(cpg.pval = cpg.pval, minsize = 200, 
                 maxsize = 500, GS.type = "KEGG")
head(res1, 15)
         ID                             Description Size    pvalue padj
04080 04080 Neuroactive ligand-receptor interaction  272 0.4811308    1
04060 04060  Cytokine-cytokine receptor interaction  265 0.6961531    1
04010 04010                  MAPK signaling pathway  268 1.0000000    1
04144 04144                             Endocytosis  201 1.0000000    1
04510 04510                          Focal adhesion  200 1.0000000    1
04740 04740                  Olfactory transduction  388 1.0000000    1
04810 04810        Regulation of actin cytoskeleton  213 1.0000000    1
05200 05200                      Pathways in cancer  326 1.0000000    1

Result is a data frame ranked by p-values of gene sets. The meaning of each column is given below.

Column Explanation
ID Gene set ID
Description (N/A for user supplied gene set) Gene set description
Size Number of genes in gene set
pvalue p-value in logistic regression
padj Adjusted p-value

Bioconductor package org.Hs.eg.db can be used to pull out the genes corresponding a specific pathway. For instance, one of the pathways in the methylglm output above is “Neuroactive ligand-receptor interaction” with KEGG ID 04080. The following code can be used to obtain its genes.

library(org.Hs.eg.db)
genes_04080 = select(org.Hs.eg.db, "04080", "SYMBOL", keytype = "PATH")
head(genes_04080)
   PATH    SYMBOL
1 04080 ADCYAP1R1
2 04080    ADORA1
3 04080   ADORA2A
4 04080   ADORA2B
5 04080    ADORA3
6 04080    ADRA1D

The following code can be used to get the genes in all the pathways in methylglm output.

# include all the IDs as the 2nd argument in select function
genes_all_pathway = select(org.Hs.eg.db, as.character(res1$ID), 
                     "SYMBOL", keytype = "PATH")
head(genes_all_pathway)

Description of methylRRA

Robust rank aggregation [2] is a parameter free model that aggregates several ranked gene lists into a single gene list. The aggregation assumes random order of input lists and assign each gene a p-value based on order statistics. We apply this order statistics idea to adjust for number of CpGs.

For gene \(i\), let \(P_{1}, P_{2}, ... P_{n}\) be the p-values of individual CpGs in differential methylation analysis. Under the null hypothesis, \(P_{1}, P_{2}, ... P_{n} ~ \overset{i.i.d}{\sim} Unif[0, 1]\). Let \(P_{(1)}, P_{(2)}, ... P_{(n)}\) be the order statistics. Define: \[\rho = \text{min}\{\text{Pr}(P_{(1)}<P_{(1)\text{obs}}), \text{Pr}(P_{(2)}<P_{(2)\text{obs}})..., \text{Pr}(P_{(n)}<P_{(n)\text{obs}}) \} \]

methylRRA supports two approaches to adjust for number of CpGs, ORA and GSEAPreranked [3]. In ORA approach, for gene \(i\), conversion from \(\rho\) score into p-value is done by Bonferroni correction [2]. We get a p-value for each gene and these p-values are then corrected for multiple testing use Benjamini & Hochberg procedure [10]. By default, genes with False Discovery Rate (FDR) below 0.05 are considered differentially expressed (DE) genes. If there are no DE genes under FDR 0.05, users are able to use sig.cut option to specify a higher FDR cut-off or topDE option to declare top genes to be differentially expressed. We then apply ORA based on these DE genes.

In GSEAPreranked approach, for gene \(i\), we also convert \(\rho\) score into p-value by Bonferroni correction. p-values are converted into z-scores. We then apply Preranked version of Gene Set Enrichment Analysis (GSEAPreranked) on the gene list ranked by the z-scores.

Example

To apply ORA approach, we use argument method = "ORA" (default) in methylRRA

res2 = methylRRA(cpg.pval = cpg.pval, method = "ORA", 
                    minsize = 200, maxsize = 210)
head(res2, 15)

The meaning of each column in the output is given below.

Column Explanation
ID Gene set ID
Description (N/A for user supplied gene set) Gene set description
Count Number of significant genes in the gene set
overlap Names of significant genes in the gene set
Size Number of genes in gene set
pvalue p-value in ORA
padj Adjusted p-value

To apply GSEAPreranked approach, we use argument method = "GSEA" in methylRRA

res3 = methylRRA(cpg.pval = cpg.pval, method = "GSEA", 
                    minsize = 200, maxsize = 210)
head(res3, 10)

The meaning of each column in the output is given below.

Column Explanation
ID Gene set ID
Description (N/A for user supplied gene set) Gene set description
Size Number of genes in gene set
enrichmentScore Enrichment score (see [3] for details)
NES Normalized enrichment score (see [3] for details)
pvalue p-value in GSEA
padj Adjusted p-value

Description of methylgometh

methylgometh calls gometh or gsameth function in missMethyl package [4] to adjust number of CpGs in gene set testing. gometh modifies goseq method [11] by fitting a probability weighting function and resampling from Wallenius non-central hypergeometric distribution.

methylgometh requires two inputs, cpg.pval and sig.cut. sig.cut specifies the cut-off point to declare a CpG as differentially methylated. By default, sig.cut is 0.001. Similar to methylRRA, if no CpG is significant, users are able to specify a higher cut-off or use topDE option to declare top CpGs to be differentially methylated.

Example

res4 = methylgometh(cpg.pval = cpg.pval, sig.cut = 0.001, 
                        minsize = 200, maxsize = 210)
head(res4, 15)

Other options

methylGSA provides many other options for users to customize the analysis.

Examples

An example of user supplied gene sets is given below. The gene ID type is gene symbol.

data(GSlisttoy)
## to make the display compact, only a proportion of each gene set is shown
head(lapply(GS.list, function(x) x[1:30]), 3)   
$GS1
 [1] "ABCA11P"   "ACOT1"     "ACSM2A"    "ADAMTS4"   "ADH4"      "AGTR2"    
 [7] "AMAC1"     "AMY1B"     "AMY2A"     "ANKRD13C"  "ANKRD20A1" "ANKRD20A3"
[13] "ANKRD20A4" "ANXA2P3"   "ANXA8"     "ANXA8L1"   "ARGFXP2"   "ARL17B"   
[19] "ATF5"      "ATP5F1"    "BAGE2"     "BCL8"      "BET3L"     "C12orf24" 
[25] "C15orf62"  "C16orf93"  "C17orf86"  "C19orf70"  "C1orf182"  "C22orf29" 

$GS2
 [1] "KRTAP5-5"     "KRTAP6-1"     "KRTAP9-8"     "KTI12"        "LASS1"       
 [6] "LCMT2"        "LGALS9B"      "LOC100093631" "LOC100101115" "LOC100101266"
[11] "LOC100130264" "LOC100130932" "LOC100131193" "LOC100131726" "LOC100132247"
[16] "LOC100132832" "LOC100133920" "LOC100286938" "LOC144438"    "LOC145820"   
[21] "LOC148413"    "LOC202781"    "LOC286367"    "LOC339047"    "LOC388499"   
[26] "LOC392196"    "LOC399753"    "LOC440895"    "LOC441294"    "LOC441455"   

$GS3
 [1] "SNORA17"     "SNORA23"     "SNORA25"     "SNORA2B"     "SNORA31"    
 [6] "SNORA36C"    "SNORA64"     "SNORD11"     "SNORD113-5"  "SNORD113-7" 
[11] "SNORD114-1"  "SNORD114-16" "SNORD114-17" "SNORD114-18" "SNORD114-21"
[16] "SNORD114-27" "SNORD114-28" "SNORD114-5"  "SNORD114-6"  "SNORD114-8" 
[21] "SNORD114-9"  "SNORD116-16" "SNORD116-26" "SNORD116-29" "SNORD12"    
[26] "SNORD125"    "SNORD126"    "SNORD16"     "SNORD38B"    "SNORD50A"   

Below is an example of running methylglm with parallel option

library(BiocParallel)
res_p = methylglm(cpg.pval = cpg.pval, minsize = 200, 
                  maxsize = 500, GS.type = "KEGG", parallel = TRUE)

methylglm and methylRRA support user supplied CpG ID to gene mapping. The mapping is expected to be a matrix, or a data frame or a list. For a matrix or data frame, 1st column should be CpG ID and 2nd column should be gene name. For a list, entry names should be gene names and elements correpond to CpG IDs. Below an example of user supplied CpG to gene mapping.

data(CpG2Genetoy)
head(CpG2Gene)   
         CpG   Gene
1 cg00050873  TSPY4
2 cg00212031 TTTY14
3 cg00214611 TMSB4Y
4 cg01707559  TBL1Y
5 cg02004872 TMSB4Y
6 cg02011394  TSPY4

To use user supplied mapping in methylglm or methylRRA, first preprocess the mapping by prepareAnnot function

FullAnnot = prepareAnnot(CpG2Gene) 

Test the gene sets using “ORA” in methylRRA, use FullAnnot argument to provide the preprocessed CpG ID to gene mapping

GS.list = GS.list[1:10]
res5 = methylRRA(cpg.pval = cpg.pval, FullAnnot = FullAnnot, method = "ORA", 
                    GS.list = GS.list, GS.idtype = "SYMBOL", 
                    minsize = 100, maxsize = 300)
head(res5, 10)
       ID Count Size pvalue padj
GS1   GS1     0  157      1    1
GS2   GS2     0  257      1    1
GS3   GS3     0  181      1    1
GS4   GS4     0  274      1    1
GS5   GS5     0  285      1    1
GS6   GS6     0  108      1    1
GS7   GS7     0  202      1    1
GS8   GS8     0  273      1    1
GS9   GS9     0  206      1    1
GS10 GS10     0  187      1    1

Below is another example testing Reactome pathways using methylglm.

res6 = methylglm(cpg.pval = cpg.pval, array.type = "450K", 
                    GS.type = "Reactome", minsize = 100, maxsize = 110)
head(res6, 10)

Visualization

Following bar plot implemented in enrichplot [12], we also provide bar plot to visualize the gene set analysis results. The input of barplot function can be any result returned by methylglm, methylRRA, or methylgometh. Various options are provided for users to customize the plot.

Example

Below is an example of using barplot to visualize the result of methylglm.

barplot(res1, num = 8, colorby = "pvalue")

Session info

sessionInfo()
R version 4.4.0 RC (2024-04-16 r86468)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.19.1                               
 [2] AnnotationDbi_1.67.0                              
 [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
 [4] minfi_1.51.0                                      
 [5] bumphunter_1.47.0                                 
 [6] locfit_1.5-9.9                                    
 [7] iterators_1.0.14                                  
 [8] foreach_1.5.2                                     
 [9] Biostrings_2.73.0                                 
[10] XVector_0.45.0                                    
[11] SummarizedExperiment_1.35.0                       
[12] Biobase_2.65.0                                    
[13] MatrixGenerics_1.17.0                             
[14] matrixStats_1.3.0                                 
[15] GenomicRanges_1.57.0                              
[16] GenomeInfoDb_1.41.0                               
[17] IRanges_2.39.0                                    
[18] S4Vectors_0.43.0                                  
[19] BiocGenerics_0.51.0                               
[20] methylGSA_1.23.0                                  

loaded via a namespace (and not attached):
  [1] later_1.3.2                                        
  [2] splines_4.4.0                                      
  [3] BiocIO_1.15.0                                      
  [4] ggplotify_0.1.2                                    
  [5] bitops_1.0-7                                       
  [6] tibble_3.2.1                                       
  [7] polyclip_1.10-6                                    
  [8] preprocessCore_1.67.0                              
  [9] XML_3.99-0.16.1                                    
 [10] lifecycle_1.0.4                                    
 [11] lattice_0.22-6                                     
 [12] MASS_7.3-60.2                                      
 [13] base64_2.0.1                                       
 [14] scrime_1.3.5                                       
 [15] magrittr_2.0.3                                     
 [16] limma_3.61.0                                       
 [17] sass_0.4.9                                         
 [18] rmarkdown_2.26                                     
 [19] jquerylib_0.1.4                                    
 [20] yaml_2.3.8                                         
 [21] httpuv_1.6.15                                      
 [22] doRNG_1.8.6                                        
 [23] askpass_1.2.0                                      
 [24] cowplot_1.1.3                                      
 [25] DBI_1.2.2                                          
 [26] RColorBrewer_1.1-3                                 
 [27] abind_1.4-5                                        
 [28] zlibbioc_1.51.0                                    
 [29] quadprog_1.5-8                                     
 [30] purrr_1.0.2                                        
 [31] ggraph_2.2.1                                       
 [32] RCurl_1.98-1.14                                    
 [33] yulab.utils_0.1.4                                  
 [34] tweenr_2.0.3                                       
 [35] GenomeInfoDbData_1.2.12                            
 [36] enrichplot_1.25.0                                  
 [37] ggrepel_0.9.5                                      
 [38] tidytree_0.4.6                                     
 [39] reactome.db_1.88.0                                 
 [40] genefilter_1.87.0                                  
 [41] annotate_1.83.0                                    
 [42] DelayedMatrixStats_1.27.0                          
 [43] codetools_0.2-20                                   
 [44] DelayedArray_0.31.0                                
 [45] DOSE_3.31.0                                        
 [46] xml2_1.3.6                                         
 [47] ggforce_0.4.2                                      
 [48] tidyselect_1.2.1                                   
 [49] aplot_0.2.2                                        
 [50] UCSC.utils_1.1.0                                   
 [51] farver_2.1.1                                       
 [52] viridis_0.6.5                                      
 [53] beanplot_1.3.1                                     
 [54] illuminaio_0.47.0                                  
 [55] GenomicAlignments_1.41.0                           
 [56] jsonlite_1.8.8                                     
 [57] multtest_2.61.0                                    
 [58] tidygraph_1.3.1                                    
 [59] survival_3.6-4                                     
 [60] RobustRankAggreg_1.2.1                             
 [61] missMethyl_1.39.0                                  
 [62] tools_4.4.0                                        
 [63] treeio_1.29.0                                      
 [64] Rcpp_1.0.12                                        
 [65] glue_1.7.0                                         
 [66] gridExtra_2.3                                      
 [67] SparseArray_1.5.0                                  
 [68] xfun_0.43                                          
 [69] qvalue_2.37.0                                      
 [70] dplyr_1.1.4                                        
 [71] HDF5Array_1.33.0                                   
 [72] withr_3.0.0                                        
 [73] fastmap_1.1.1                                      
 [74] rhdf5filters_1.17.0                                
 [75] fansi_1.0.6                                        
 [76] openssl_2.1.2                                      
 [77] digest_0.6.35                                      
 [78] mime_0.12                                          
 [79] gridGraphics_0.5-1                                 
 [80] R6_2.5.1                                           
 [81] colorspace_2.1-0                                   
 [82] GO.db_3.19.1                                       
 [83] RSQLite_2.3.6                                      
 [84] utf8_1.2.4                                         
 [85] tidyr_1.3.1                                        
 [86] generics_0.1.3                                     
 [87] data.table_1.15.4                                  
 [88] rtracklayer_1.65.0                                 
 [89] graphlayouts_1.1.1                                 
 [90] httr_1.4.7                                         
 [91] S4Arrays_1.5.0                                     
 [92] scatterpie_0.2.2                                   
 [93] pkgconfig_2.0.3                                    
 [94] gtable_0.3.5                                       
 [95] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
 [96] blob_1.2.4                                         
 [97] siggenes_1.79.0                                    
 [98] shadowtext_0.1.3                                   
 [99] clusterProfiler_4.13.0                             
[100] htmltools_0.5.8.1                                  
[101] fgsea_1.31.0                                       
[102] scales_1.3.0                                       
[103] png_0.1-8                                          
[104] ggfun_0.1.4                                        
[105] knitr_1.46                                         
[106] tzdb_0.4.0                                         
[107] reshape2_1.4.4                                     
[108] rjson_0.2.21                                       
[109] nlme_3.1-164                                       
[110] curl_5.2.1                                         
[111] cachem_1.0.8                                       
[112] rhdf5_2.49.0                                       
[113] stringr_1.5.1                                      
[114] HDO.db_0.99.1                                      
[115] restfulr_0.0.15                                    
[116] GEOquery_2.73.0                                    
[117] pillar_1.9.0                                       
[118] grid_4.4.0                                         
[119] reshape_0.8.9                                      
[120] vctrs_0.6.5                                        
[121] promises_1.3.0                                     
[122] xtable_1.8-4                                       
[123] evaluate_0.23                                      
[124] readr_2.1.5                                        
[125] GenomicFeatures_1.57.0                             
[126] cli_3.6.2                                          
[127] compiler_4.4.0                                     
[128] Rsamtools_2.21.0                                   
[129] rlang_1.1.3                                        
[130] crayon_1.5.2                                       
[131] rngtools_1.5.2                                     
[132] labeling_0.4.3                                     
[133] nor1mix_1.3-3                                      
[134] mclust_6.1.1                                       
[135] plyr_1.8.9                                         
[136] fs_1.6.4                                           
[137] stringi_1.8.3                                      
[138] viridisLite_0.4.2                                  
[139] BiocParallel_1.39.0                                
[140] munsell_0.5.1                                      
[141] lazyeval_0.2.2                                     
[142] GOSemSim_2.31.0                                    
[143] Matrix_1.7-0                                       
[144] patchwork_1.2.0                                    
[145] hms_1.1.3                                          
[146] sparseMatrixStats_1.17.0                           
[147] bit64_4.0.5                                        
[148] ggplot2_3.5.1                                      
[149] Rhdf5lib_1.27.0                                    
[150] shiny_1.8.1.1                                      
[151] KEGGREST_1.45.0                                    
[152] statmod_1.5.0                                      
[153] highr_0.10                                         
[154] igraph_2.0.3                                       
[155] memoise_2.0.1                                      
[156] bslib_0.7.0                                        
[157] ggtree_3.13.0                                      
[158] fastmatch_1.1-4                                    
[159] bit_4.0.5                                          
[160] gson_0.1.0                                         
[161] ape_5.8                                            

References

[1] Geeleher, Paul, Lori Hartnett, Laurance J Egan, Aaron Golden, Raja Affendi Raja Ali, and Cathal Seoighe. 2013. Gene-Set Analysis Is Severely Biased When Applied to Genome-Wide Methylation Data. Bioinformatics 29 (15). Oxford University Press: 1851–7.

[2] Kolde, Raivo, Sven Laur, Priit Adler, and Jaak Vilo. 2012. Robust Rank Aggregation for Gene List Integration and Meta-Analysis. Bioinformatics 28 (4). Oxford University Press: 573–80.

[3] Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles. Proceedings of the National Academy of Sciences 102 (43). National Acad Sciences: 15545–50.

[4] Phipson, Belinda, Jovana Maksimovic, and Alicia Oshlack. 2015. MissMethyl: An R Package for Analyzing Data from Illumina’s Humanmethylation450 Platform. Bioinformatics 32 (2). Oxford University Press: 286–88.

[5] Carlson M (2018). org.Hs.eg.db: Genome wide annotation for Human. R package version 3.6.0.

[6] Ligtenberg W (2018). reactome.db: A set of annotation maps for reactome. R package version 1.64.0.

[7] Hansen, KD. (2016). IlluminaHumanMethylation450kanno.ilmn12.hg19: Annotation for Illumina’s 450k Methylation Arrays. R Package, Version 0.6.0 1.

[8] Hansen, KD. (2017). IlluminaHumanMethylationEPICanno.ilm10b4.hg19: Annotation for Illumina’s Epic Methylation Arrays. R Package, Version 0.6.0 1.

[9] Mi, Gu, Yanming Di, Sarah Emerson, Jason S Cumbie, and Jeff H Chang. 2012. Length Bias Correction in Gene Ontology Enrichment Analysis Using Logistic Regression. PloS One 7 (10). Public Library of Science: e46128.

[10] Benjamini, Yoav, and Yosef Hochberg. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300.

[11] Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2012. Goseq: Gene Ontology Testing for Rna-Seq Datasets. R Bioconductor.

[12] Yu G (2018). enrichplot: Visualization of Functional Enrichment Result. R package version 1.0.2, https://github.com/GuangchuangYu/enrichplot.