library(knitr)

Introduction

Welcome to MEAT (Muscle Epigenetic Age Test)! If you are reading these lines, you are probably an inquisitive scientist who has put a lot of effort into collecting skeletal muscle samples from – hopefully – consenting humans. Your coin purse (grant) is now lighter after profiling these muscle samples with the Illumina HumanMethylation technology (HM27, HM450 and HMEPIC) and you are yearning to know what the skeletal muscle epigenome has to say about your samples' age. I am here to guide you in your quest to find out how old your skeletal muscle samples are, by simply looking at their DNA methylation profiles. DNA methylation doesn't lie, but it can be tricky to understand what it says. Are you ready to undertake your quest to uncover the secrets of the muscle epigenome?

You can view MEAT as a spellbook (package) that contains all the necessary spells (functions) to estimate epigenetic age in human skeletal muscle samples. However, the spells will only work if you cast them in a particular order (1. data cleaning, 2. data calibration, and 3. epigenetic age estimation). Starting from preprocessed data (matrix of beta-values that has been normalized/batch corrected, etc.), MEAT will estimate epigenetic age in each sample, based on a penalized regression model (elastic net regression) essentially similar to Horvath's original pan-tissue clock. There are two versions of MEAT:

Once MEAT has calculated epigenetic age, you may provide the actual age of each sample (if known), so MEAT can also calculate age acceleration as the difference between epigenetic age and real age (AAdiff), and as the residuals from a linear regression of epigenetic age against real age (AAresid). For more information on the distinction between AAdiff and AAresid, see our original paper.

Installation

Install the MEAT package:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("MEAT")

Then, load the package:

library(MEAT)

Step-by-step guide

Data requirements

To use this guide, you will need in your inventory:

data("GSE121961", envir = environment())

Table: Top rows of the GSE121961 matrix before cleaning and calibration.

GSM3450524 GSM3450528
cg00000029 0.35 0.35
cg00000103 NA NA
cg00000109 0.77 0.79
cg00000155 0.94 0.96
cg00000158 0.96 0.96
cg00000165 0.23 0.20
data("GSE121961_pheno", envir = environment())

Table: Phenotypes corresponding to GSE121961.

ID Sex Age Group Batch Position
GSM3450524 M NA Control 202128330007 R01C01
GSM3450528 F 14 SELENON 202128330007 R07C01

The phenotype table is useful if you want to discover the AA of your samples and to associate this AA with a phenotype of interest (e.g. do elves show systematically lower AA than humans, therefore explaining their exceptional longevity?)

Step 1: Data formatting

A good adventurer never embarks a quest without a minimum of preparation. That is particularly true for your inventory! The beta matrix, the phenotype table and the annotation table should be all bundled up into a single object, to coordinate the meta-data and assays when subsetting. For example, if you have skeletal muscle DNA methylation profiles from humans, elves and dwarves, but you only want to select the samples from humans and elves, you can select these samples in a single operation in both the beta-matrix and phenotype table. This ensures the beta matrix, phenotype table and annotation table remain in sync. SummarizedExperiment objects) have the ideal format for your inventory. Let's create such an object with the beta matrix and optional phenotype and annotation tables. Please ensure that you call the beta-matrix “beta” as it is essential for the upcoming functions.

library(SummarizedExperiment)
GSE121961_SE <- SummarizedExperiment(assays=list(beta=GSE121961),
colData=GSE121961_pheno)
GSE121961_SE
#> class: SummarizedExperiment 
#> dim: 866091 2 
#> metadata(0):
#> assays(1): beta
#> rownames(866091): cg00000029 cg00000103 ... ch.X.97737721F
#>   ch.X.98007042R
#> rowData names(0):
#> colnames(2): GSM3450524 GSM3450528
#> colData names(6): ID Sex ... Batch Position

Step 2: Data cleaning

The first important step is data 'cleaning', which essentially means reducing the beta matrix to the CpGs common to all datasets used in the muscle clock. If some of the CpGs are not present in your beta-matrix, these missing values will be imputed.

GSE121961_SE_clean <- clean_beta(SE = GSE121961_SE,
                                 version = "MEAT2.0")
#> Cluster size 18630 broken into 13250 5380 
#> Cluster size 13250 broken into 11328 1922 
#> Cluster size 11328 broken into 4159 7169 
#> Cluster size 4159 broken into 2956 1203 
#> Cluster size 2956 broken into 1771 1185 
#> Cluster size 1771 broken into 1010 761 
#> Done cluster 1010 
#> Done cluster 761 
#> Done cluster 1771 
#> Done cluster 1185 
#> Done cluster 2956 
#> Done cluster 1203 
#> Done cluster 4159 
#> Cluster size 7169 broken into 3481 3688 
#> Cluster size 3481 broken into 2253 1228 
#> Cluster size 2253 broken into 1450 803 
#> Done cluster 1450 
#> Done cluster 803 
#> Done cluster 2253 
#> Done cluster 1228 
#> Done cluster 3481 
#> Cluster size 3688 broken into 2428 1260 
#> Cluster size 2428 broken into 1101 1327 
#> Done cluster 1101 
#> Done cluster 1327 
#> Done cluster 2428 
#> Done cluster 1260 
#> Done cluster 3688 
#> Done cluster 7169 
#> Done cluster 11328 
#> Cluster size 1922 broken into 881 1041 
#> Done cluster 881 
#> Done cluster 1041 
#> Done cluster 1922 
#> Done cluster 13250 
#> Cluster size 5380 broken into 3374 2006 
#> Cluster size 3374 broken into 2031 1343 
#> Cluster size 2031 broken into 941 1090 
#> Done cluster 941 
#> Done cluster 1090 
#> Done cluster 2031 
#> Done cluster 1343 
#> Done cluster 3374 
#> Cluster size 2006 broken into 1020 986 
#> Done cluster 1020 
#> Done cluster 986 
#> Done cluster 2006 
#> Done cluster 5380

Table: Top rows of the GSE121961 beta matrix after cleaning.

GSM3450524 GSM3450528
cg21432842 0.30 0.39
cg15376097 0.20 0.26
cg05876918 0.11 0.10
cg25771195 0.64 0.67
cg21380842 0.12 0.13
cg00602891 0.09 0.08

Step 3: Data calibration

The second step is data 'calibration', which essentially means re-scaling the DNA methylation profiles to that of the gold standard dataset used to develop the muscle clock. This step harmonises differences in data processing, sample preparation, lab-to-lab variability, to obtain accurate measures of epigenetic age in your samples. It should be noted that this 'calibration' is entirely different from the previously mentioned data preprocessing (i.e. probe and sample filtering, normalisation of Type I and Type II probes, and correction for batch effects). The calibration implemented in BMIQcalibration() does use code from the original BMIQ algorithm, but it is not used to normalize TypeI and TypeII probe methylation distribution. The BMIQcalibration() of the MEAT package re-scales the methylation distribution of your samples to the gold standard dataset GSE50498.

GSE121961_SE_calibrated <- BMIQcalibration(SE = GSE121961_SE_clean,
                                           version = "MEAT2.0")
#> [1] Inf
#> [1] 0.001868228
#> [1] 0.00257888
#> [1] 0.002780609
#> [1] 0.002702647
#> ii= 1
#> ii= 2

Table: Top rows of the GSE121961 beta matrix after cleaning and calibration.

GSM3450524 GSM3450528
cg21432842 0.2990203 0.3925026
cg15376097 0.2017772 0.2653758
cg05876918 0.1084865 0.1002290
cg25771195 0.6296469 0.6663140
cg21380842 0.1189314 0.1325671
cg00602891 0.0877007 0.0788814

You can have a look at the distribution of DNA methylation before and after calibration with a density plot. On this plot, each line is an individual sample, and you can clearly see the bimodal distribution of DNA methylation data, with most CpGs harboring very low methylation levels (left side of the graph), very few CpGs with intermediate methylation levels, and some CpGs with high methylation levels. Before calibration, the profiles do not align well with that of the gold standard, and this is problematic to obtain accurate estimates of epigenetic age. However, after calibration, the samples' profiles overlap nicely with that of the gold standard.

data("gold.mean.MEAT2.0", envir = environment())
GSE121961_SE_clean_with_gold_mean <- cbind(assays(GSE121961_SE_clean)$beta,
                                           gold.mean.MEAT2.0$gold.mean) # add the gold mean
GSE121961_SE_calibrated_with_gold_mean <- cbind(assays(GSE121961_SE_calibrated)$beta,
                                                gold.mean.MEAT2.0$gold.mean) # add the gold mean
groups <- c(rep("GSE121961",
                ncol(GSE121961_SE_clean)), "Gold mean")

library(minfi)
par(mfrow = c(2, 1))
densityPlot(GSE121961_SE_clean_with_gold_mean,
  sampGroups = groups,
  main = "Before calibration",
  legend = FALSE
)
densityPlot(GSE121961_SE_calibrated_with_gold_mean,
  sampGroups = groups,
  main = "After calibration"
)

plot of chunk DNA methylation profile distribution before and after calibration

Step 4: Epigenetic age estimation

Your quest is almost over! The only spell left to cast is epiage_estimation() that uses methylation levels at the clock CpGs to estimate epigenetic age. If you do not have information on age, epiage_estimation() will only return epigenetic age (“DNAmage”). However, if you have information on age (and other phenotypes), epiage_estimation() will return:

While AAdiff is a straightforward way of calculating the error in age prediction, it is sensitive to the mean age of the dataset and to the pre-processing of the DNA methylation dataset; AAdiff can be biased upwards or downwards depending on how the dataset was normalized, and depending on the mean age and age variance of the dataset. In contrast, AAresid is insensitive to the mean age of the dataset and is robust against different pre-processing methods.

GSE121961_SE_epiage <- epiage_estimation(SE = GSE121961_SE_calibrated,
                                         age_col_name = "Age",
                                         version = "MEAT2.0")
#> function (x)  .Primitive("dim")

Table: Phenotypes corresponding to GSE121961 with AAdiff for each sample.

ID Sex Age Group Batch Position DNAmage AAdiff
GSM3450524 GSM3450524 M NA Control 202128330007 R01C01 47.46698 NA
GSM3450528 GSM3450528 F 14 SELENON 202128330007 R07C01 17.61979 3.619792

Session information

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] minfi_1.35.1                bumphunter_1.31.0          
#>  [3] locfit_1.5-9.4              iterators_1.0.12           
#>  [5] foreach_1.5.0               Biostrings_2.57.2          
#>  [7] XVector_0.29.3              SummarizedExperiment_1.19.9
#>  [9] Biobase_2.49.1              GenomicRanges_1.41.6       
#> [11] GenomeInfoDb_1.25.11        IRanges_2.23.10            
#> [13] S4Vectors_0.27.13           BiocGenerics_0.35.4        
#> [15] MatrixGenerics_1.1.3        matrixStats_0.57.0         
#> [17] MEAT_1.1.3                  knitr_1.30                 
#> [19] BiocStyle_2.17.1           
#> 
#> loaded via a namespace (and not attached):
#>   [1] colorspace_1.4-1          ellipsis_0.3.1           
#>   [3] siggenes_1.63.0           dynamicTreeCut_1.63-1    
#>   [5] mclust_5.4.6              base64_2.0               
#>   [7] affyio_1.59.0             bit64_4.0.5              
#>   [9] AnnotationDbi_1.51.3      xml2_1.3.2               
#>  [11] codetools_0.2-16          splines_4.0.2            
#>  [13] impute_1.63.0             methylumi_2.35.0         
#>  [15] scrime_1.3.5              Rsamtools_2.5.3          
#>  [17] annotate_1.67.1           cluster_2.1.0            
#>  [19] dbplyr_1.4.4              HDF5Array_1.17.14        
#>  [21] wateRmelon_1.33.0         readr_1.4.0              
#>  [23] BiocManager_1.30.10       compiler_4.0.2           
#>  [25] httr_1.4.2                assertthat_0.2.1         
#>  [27] Matrix_1.2-18             limma_3.45.14            
#>  [29] RPMM_1.25                 htmltools_0.5.0          
#>  [31] prettyunits_1.1.1         tools_4.0.2              
#>  [33] affy_1.67.1               glue_1.4.2               
#>  [35] GenomeInfoDbData_1.2.4    dplyr_1.0.2              
#>  [37] rappdirs_0.3.1            doRNG_1.8.2              
#>  [39] Rcpp_1.0.5                vctrs_0.3.4              
#>  [41] rhdf5filters_1.1.3        multtest_2.45.0          
#>  [43] preprocessCore_1.51.0     nlme_3.1-149             
#>  [45] rtracklayer_1.49.5        DelayedMatrixStats_1.11.1
#>  [47] xfun_0.18                 stringr_1.4.0            
#>  [49] ROC_1.65.0                lifecycle_0.2.0          
#>  [51] rngtools_1.5              XML_3.99-0.5             
#>  [53] beanplot_1.2              nleqslv_3.3.2            
#>  [55] zlibbioc_1.35.0           MASS_7.3-53              
#>  [57] hms_0.5.3                 rhdf5_2.33.11            
#>  [59] GEOquery_2.57.0           RColorBrewer_1.1-2       
#>  [61] yaml_2.2.1                curl_4.3                 
#>  [63] memoise_1.1.0             biomaRt_2.45.3           
#>  [65] reshape_0.8.8             stringi_1.5.3            
#>  [67] RSQLite_2.2.1             highr_0.8                
#>  [69] genefilter_1.71.0         lumi_2.41.0              
#>  [71] GenomicFeatures_1.41.3    BiocParallel_1.23.2      
#>  [73] shape_1.4.5               rlang_0.4.7              
#>  [75] pkgconfig_2.0.3           bitops_1.0-6             
#>  [77] nor1mix_1.3-0             evaluate_0.14            
#>  [79] lattice_0.20-41           purrr_0.3.4              
#>  [81] Rhdf5lib_1.11.3           GenomicAlignments_1.25.3 
#>  [83] bit_4.0.4                 tidyselect_1.1.0         
#>  [85] plyr_1.8.6                magrittr_1.5             
#>  [87] R6_2.4.1                  generics_0.0.2           
#>  [89] DelayedArray_0.15.16      DBI_1.1.0                
#>  [91] mgcv_1.8-33               pillar_1.4.6             
#>  [93] survival_3.2-7            RCurl_1.98-1.2           
#>  [95] tibble_3.0.3              crayon_1.3.4             
#>  [97] KernSmooth_2.23-17        BiocFileCache_1.13.1     
#>  [99] rmarkdown_2.4             progress_1.2.2           
#> [101] grid_4.0.2                data.table_1.13.0        
#> [103] blob_1.2.1                digest_0.6.25            
#> [105] xtable_1.8-4              tidyr_1.1.2              
#> [107] illuminaio_0.31.1         openssl_1.4.3            
#> [109] glmnet_4.0-2              askpass_1.1              
#> [111] quadprog_1.5-8