Applying mitch to pathway analysis of Infinium Methylation array data

Antony Kaspi & Mark Ziemann

2024-04-19

Background

Applying enrichment analysis to methylation array data is difficult due to the presence of a variable number of probes per gene and the fact that a probe could belong to overlapping genes. There are existing over-representation based approaches to this, however they appear to lack sensitivity. To address this, we have developed a simple approach to aggregating differential methylation data to make it suitable for downstream use by mitch. The process begins with the differential probe methylation results from limma. Here, we summarise the limma t-statistics by gene using the arithmetic mean. The resulting gene level differential methylation scores then undergo mitch as usual.

Requirements

In addition to mitch v1.15.0 of higher, you will need an annotation set for the array you are using. These are conveniently provided as Bioconductor packages for HM450K and EPIC arrays.

HM450k: IlluminaHumanMethylation450kanno.ilmn12.hg19

EPIC: IlluminaHumanMethylationEPICanno.ilm10b4.hg19

One issue with these is that the annotations are quite old, which means that some of the gene names have changed (~12%), so it is a good idea to screen the list of gene names and update obsolete names using the HGNChelper package.

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("mitch")
suppressPackageStartupMessages({
    library("mitch")
    library("HGNChelper")
    library("IlluminaHumanMethylation450kanno.ilmn12.hg19")
    library("IlluminaHumanMethylationEPICanno.ilm10b4.hg19")
})

Gene sets

In this cut down example we will be using a sample of 200 Reactome gene sets.

data(genesetsExample)
head(genesetsExample,3)
## $`2-LTR circle formation`
##  [1] "Reactome Pathway" "BANF1"            "HMGA1"            "LIG4"            
##  [5] "PSIP1"            "XRCC4"            "XRCC5"            "XRCC6"           
##  [9] "gag"              "gag-pol"          "rev"              "vif"             
## [13] "vpr"              "vpu"             
## 
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] "Reactome Pathway" "PRPS1"            "PRPS1L1"          "PRPS2"           
## 
## $`A tetrasaccharide linker sequence is required for GAG synthesis`
##  [1] "Reactome Pathway" "AGRN"             "B3GALT6"          "B3GAT1"          
##  [5] "B3GAT2"           "B3GAT3"           "B4GALT7"          "BCAN"            
##  [9] "BGN"              "CSPG4"            "CSPG5"            "DCN"             
## [13] "GPC1"             "GPC2"             "GPC3"             "GPC4"            
## [17] "GPC5"             "GPC6"             "HSPG2"            "NCAN"            
## [21] "SDC1"             "SDC2"             "SDC3"             "SDC4"            
## [25] "VCAN"             "XYLT1"            "XYLT2"

Probe gene relationship for HM450K array data

In order to get mitch working, we need a 2 column table that maps probes to genes. The workflow shown here is for the HM450k array, and an EPIC example is show at the end of the report.

anno <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
head(myann)
##            UCSC_RefGene_Name   UCSC_RefGene_Group           Islands_Name
## cg00050873    TSPY4;FAM197Y2         Body;TSS1500   chrY:9363680-9363943
## cg00212031            TTTY14               TSS200 chrY:21238448-21240005
## cg00213748                                          chrY:8147877-8148210
## cg00214611     TMSB4Y;TMSB4Y        1stExon;5'UTR chrY:15815488-15815779
## cg00455876                                          chrY:9385471-9385777
## cg01707559 TBL1Y;TBL1Y;TBL1Y TSS200;TSS200;TSS200   chrY:6778574-6780028
##            Relation_to_Island
## cg00050873            N_Shore
## cg00212031             Island
## cg00213748            S_Shore
## cg00214611             Island
## cg00455876             Island
## cg01707559             Island
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt1 <- stack(gp2)
colnames(gt1) <- c("gene","probe")
gt1$probe <- as.character(gt1$probe)
dim(gt1)
## [1] 407090      2
head(gt1)
##       gene      probe
## 1    TSPY4 cg00050873
## 2 FAM197Y2 cg00050873
## 3   TTTY14 cg00212031
## 4   TMSB4Y cg00214611
## 5    TBL1Y cg01707559
## 6   TMSB4Y cg02004872

Update deprecated gene symbols

Update old gene symbols using HGNChelper (13% of 21k names).

new.hgnc.table <- getCurrentHumanMap()
## Fetching gene symbols from ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt
fix <- checkGeneSymbols(gt1$gene,map=new.hgnc.table)
## Warning in checkGeneSymbols(gt1$gene, map = new.hgnc.table): Human gene symbols
## should be all upper-case except for the 'orf' in open reading frames. The case
## of some letters was corrected.
## Warning in checkGeneSymbols(gt1$gene, map = new.hgnc.table): x contains
## non-approved gene symbols
fix2 <- fix[which(fix$x != fix$Suggested.Symbol),]
length(unique(fix2$x))
## [1] 2789
gt1$gene <- fix$Suggested.Symbol
head(gt1)
##       gene      probe
## 1    TSPY4 cg00050873
## 2 FAM197Y2 cg00050873
## 3   TTTY14 cg00212031
## 4   TMSB4Y cg00214611
## 5    TBL1Y cg01707559
## 6   TMSB4Y cg02004872

Importing profiling data

Here we will read in a table of differential probe methylation data generated by limma. We will use the t-statistics for downstream analysis.

x <- read.table("https://ziemann-lab.net/public/gmea/dma3a.tsv",header=TRUE,row.names=1)
head(x)
##                 logFC   AveExpr         t      P.Value adj.P.Val        B
## cg04905210 -0.2926788 -2.812451 -5.380966 2.372336e-07 0.1002013 5.906039
## cg09338148 -0.2938597 -1.155475 -5.114443 8.261969e-07 0.1744820 4.880538
## cg04247967 -0.2070451  3.180593 -4.986000 1.484616e-06 0.2090211 4.399278
## cg06458106 -0.1906572  1.675741 -4.793141 3.511038e-06 0.2413324 3.693145
## cg26425904 -0.2540917 -4.019515 -4.761482 4.034832e-06 0.2413324 3.579161
## cg19590707 -0.2471850 -1.237702 -4.716409 4.912691e-06 0.2413324 3.417844

Now that the profiling data is loaded, we need to import with mitch package, which establishes the probe-gene relationships and aggregates the data to gene level scores. As you can see, the input was a table of 422k probes and the output is 19,380 gene scores. Many probes not annotated to genes are discarded.

y <- mitch_import(x,DEtype="limma",geneTable=gt1)
## The input is a single dataframe; one contrast only. Converting
##         it to a list for you.
## Note: Mean no. genes in input = 422374
## Note: no. genes in output = 19380
## Warning in mitch_import(x, DEtype = "limma", geneTable = gt1): Warning: less than half of the input genes are also in the
##         output
head(y)
##                   x
## A1BG     -0.3505339
## A1BG-AS1 -0.1904379
## A1CF     -0.8443613
## A2M      -0.6794687
## A2ML1     0.2157592
## A4GALT   -0.4001430
dim(y)
## [1] 19380     1

Calculating enrichment

The mitch_calc function performs an enrichment test. If you imported multiple data tables in the previous step, mitch will conduct a multivariate enrichment test. The results can be prioritised by significance or effect size. My recommendation is to discard results with FDR>0.05 then prioritise by effect size, which for us is the mitch enrichment score called S distance. In this example I also set the minimum gene set size to 5.

res<-mitch_calc(y,genesetsExample,priority="effect",cores=2,minsetsize=5)
## Note: Enrichments with large effect sizes may not be
##             statistically significant.
head(res$enrichment_result,10)
##                                                 set setSize       pANOVA
## 114   Apoptotic cleavage of cell adhesion  proteins      11 6.207579e-04
## 1                            2-LTR circle formation       7 4.362176e-02
## 34     Acetylcholine Neurotransmitter Release Cycle      17 1.672291e-03
## 53                          Activation of C3 and C5       6 6.479358e-02
## 156                        Bicarbonate transporters      10 2.806954e-02
## 69                                Activation of SMO      18 4.483409e-03
## 89             Adenylate cyclase inhibitory pathway      14 1.240799e-02
## 88             Adenylate cyclase activating pathway      10 3.530185e-02
## 81             Acyl chain remodeling of DAG and TAG       5 1.416075e-01
## 135 Autodegradation of the E3 ubiquitin ligase COP1      50 4.346153e-06
##         s.dist p.adjustANOVA
## 114 -0.5958772  4.978187e-03
## 1    0.4403846  1.550996e-01
## 34  -0.4402727  1.070266e-02
## 53   0.4353429  2.115709e-01
## 156 -0.4010945  1.124647e-01
## 69  -0.3869091  2.561948e-02
## 89  -0.3859415  5.672226e-02
## 88  -0.3843986  1.285486e-01
## 81  -0.3795716  3.504316e-01
## 135  0.3755137  7.726494e-05

Downstream presentation

For presentation of the results you could consider making a volcano plot and/or making a barplot of S.dist values for selected gene sets that meet the FDR cutoff. You can also use the built in functions to make a set of charts and html report. It is vitally important to inspect the detailed resports to understand which genes are driving the enrichment in your dataset.

mitch_plots(res,outfile="methcharts.pdf")
## png 
##   2
mitch_report(res,"methreport.html")
## Dataset saved as " /tmp/RtmpOqGtaq/methreport.rds ".
## processing file: mitch.Rmd
## output file: /tmp/Rtmpa0jifH/Rbuild35bd544c8a8499/mitch/vignettes/mitch.knit.md
## /usr/bin/pandoc +RTS -K512m -RTS /tmp/Rtmpa0jifH/Rbuild35bd544c8a8499/mitch/vignettes/mitch.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/RtmpOqGtaq/mitch_report.html --lua-filter /home/biocbuild/bbs-3.19-bioc/R/site-library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /home/biocbuild/bbs-3.19-bioc/R/site-library/rmarkdown/rmarkdown/lua/latex-div.lua --self-contained --variable bs3=TRUE --section-divs --template /home/biocbuild/bbs-3.19-bioc/R/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/RtmpOqGtaq/rmarkdown-str35c06939a21f3c.html
## 
## Output created: /tmp/RtmpOqGtaq/mitch_report.html
## [1] TRUE

Probe gene relationship for EPIC array data

Here is the code to create a probe-gene table for EPIC array data. Be sure to update the gene symbols with HGNChelper before conducting enrichment analysis.

anno <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt <- stack(gp2)
colnames(gt) <- c("gene","probe")
gt$probe <- as.character(gt$probe)
dim(gt)
## [1] 684970      2
str(gt)
## 'data.frame':    684970 obs. of  2 variables:
##  $ gene : chr  "YTHDF1" "EIF2S3" "PKN3" "CCDC57" ...
##  $ probe: chr  "cg18478105" "cg09835024" "cg14361672" "cg01763666" ...

Session Info

sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0                                   
##  [2] pkgload_1.3.4                                      
##  [3] GGally_2.2.1                                       
##  [4] ggplot2_3.5.0                                      
##  [5] reshape2_1.4.4                                     
##  [6] beeswarm_0.4.0                                     
##  [7] gplots_3.1.3.1                                     
##  [8] gtools_3.9.5                                       
##  [9] tibble_3.2.1                                       
## [10] dplyr_1.1.4                                        
## [11] echarts4r_0.4.5                                    
## [12] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
## [13] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1 
## [14] minfi_1.49.1                                       
## [15] bumphunter_1.45.1                                  
## [16] locfit_1.5-9.9                                     
## [17] iterators_1.0.14                                   
## [18] foreach_1.5.2                                      
## [19] Biostrings_2.71.5                                  
## [20] XVector_0.43.1                                     
## [21] SummarizedExperiment_1.33.3                        
## [22] Biobase_2.63.1                                     
## [23] MatrixGenerics_1.15.1                              
## [24] matrixStats_1.3.0                                  
## [25] GenomicRanges_1.55.4                               
## [26] GenomeInfoDb_1.39.14                               
## [27] IRanges_2.37.1                                     
## [28] S4Vectors_0.41.6                                   
## [29] BiocGenerics_0.49.1                                
## [30] HGNChelper_0.8.1                                   
## [31] mitch_1.15.6                                       
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.0             later_1.3.2              
##   [3] BiocIO_1.13.0             bitops_1.0-7             
##   [5] preprocessCore_1.65.0     XML_3.99-0.16.1          
##   [7] lifecycle_1.0.4           lattice_0.22-6           
##   [9] MASS_7.3-60.2             base64_2.0.1             
##  [11] scrime_1.3.5              magrittr_2.0.3           
##  [13] limma_3.59.8              sass_0.4.9               
##  [15] rmarkdown_2.26            jquerylib_0.1.4          
##  [17] yaml_2.3.8                httpuv_1.6.15            
##  [19] doRNG_1.8.6               askpass_1.2.0            
##  [21] DBI_1.2.2                 RColorBrewer_1.1-3       
##  [23] abind_1.4-5               zlibbioc_1.49.3          
##  [25] quadprog_1.5-8            purrr_1.0.2              
##  [27] RCurl_1.98-1.14           GenomeInfoDbData_1.2.12  
##  [29] genefilter_1.85.1         annotate_1.81.2          
##  [31] svglite_2.1.3             DelayedMatrixStats_1.25.3
##  [33] codetools_0.2-20          DelayedArray_0.29.9      
##  [35] xml2_1.3.6                tidyselect_1.2.1         
##  [37] UCSC.utils_0.99.7         beanplot_1.3.1           
##  [39] illuminaio_0.45.0         GenomicAlignments_1.39.5 
##  [41] jsonlite_1.8.8            multtest_2.59.0          
##  [43] survival_3.5-8            systemfonts_1.0.6        
##  [45] tools_4.4.0               Rcpp_1.0.12              
##  [47] glue_1.7.0                gridExtra_2.3            
##  [49] SparseArray_1.3.5         xfun_0.43                
##  [51] HDF5Array_1.31.6          withr_3.0.0              
##  [53] fastmap_1.1.1             rhdf5filters_1.15.5      
##  [55] fansi_1.0.6               openssl_2.1.1            
##  [57] caTools_1.18.2            digest_0.6.35            
##  [59] R6_2.5.1                  mime_0.12                
##  [61] colorspace_2.1-0          RSQLite_2.3.6            
##  [63] utf8_1.2.4                tidyr_1.3.1              
##  [65] generics_0.1.3            data.table_1.15.4        
##  [67] rtracklayer_1.63.2        httr_1.4.7               
##  [69] htmlwidgets_1.6.4         S4Arrays_1.3.7           
##  [71] ggstats_0.6.0             pkgconfig_2.0.3          
##  [73] gtable_0.3.4              blob_1.2.4               
##  [75] siggenes_1.77.0           htmltools_0.5.8.1        
##  [77] scales_1.3.0              png_0.1-8                
##  [79] knitr_1.46                rstudioapi_0.16.0        
##  [81] tzdb_0.4.0                rjson_0.2.21             
##  [83] nlme_3.1-164              curl_5.2.1               
##  [85] cachem_1.0.8              rhdf5_2.47.7             
##  [87] stringr_1.5.1             KernSmooth_2.23-22       
##  [89] AnnotationDbi_1.65.2      restfulr_0.0.15          
##  [91] GEOquery_2.71.0           pillar_1.9.0             
##  [93] grid_4.4.0                reshape_0.8.9            
##  [95] vctrs_0.6.5               promises_1.3.0           
##  [97] xtable_1.8-4              evaluate_0.23            
##  [99] readr_2.1.5               GenomicFeatures_1.55.4   
## [101] cli_3.6.2                 compiler_4.4.0           
## [103] Rsamtools_2.19.4          rlang_1.1.3              
## [105] crayon_1.5.2              rngtools_1.5.2           
## [107] nor1mix_1.3-3             mclust_6.1               
## [109] plyr_1.8.9                stringi_1.8.3            
## [111] viridisLite_0.4.2         BiocParallel_1.37.1      
## [113] munsell_0.5.1             Matrix_1.7-0             
## [115] hms_1.1.3                 sparseMatrixStats_1.15.1 
## [117] bit64_4.0.5               Rhdf5lib_1.25.3          
## [119] KEGGREST_1.43.0           statmod_1.5.0            
## [121] shiny_1.8.1.1             highr_0.10               
## [123] memoise_2.0.1             bslib_0.7.0              
## [125] bit_4.0.5