# Contents

R version: R Under development (unstable) (2022-10-25 r83175)

Bioconductor version: 3.17

Package: 1.23.0

# 1 Introduction

DNA methylation, the addition of a methyl group to a CG dinucleotide of the DNA, is the most extensively studied epigenetic mark due to its role in both development and disease (Bird 2002; Laird 2003). Although DNA methylation can be measured in several ways, the epigenetics community has enthusiastically embraced the Illumina HumanMethylation450 (450k) array (Bibikova et al. 2011) as a cost-effective way to assay methylation across the human genome. More recently, Illumina has increased the genomic coverage of the platform to >850,000 sites with the release of their MethylationEPIC (850k) array. As methylation arrays are likely to remain popular for measuring methylation for the foreseeable future, it is necessary to provide robust workflows for methylation array analysis.

Measurement of DNA methylation by Infinium technology (Infinium I) was first employed by Illumina on the HumanMethylation27 (27k) array (Bibikova et al. 2009), which measured methylation at approximately 27,000 CpGs, primarily in gene promoters. Like bisulfite sequencing, the Infinium assay detects methylation status at single base resolution. However, due to its relatively limited coverage the array platform was not truly considered “genome-wide” until the arrival of the 450k array. The 450k array increased the genomic coverage of the platform to over 450,000 gene-centric sites by combining the original Infinium I assay with the novel Infinium II probes. Both assay types employ 50bp probes that query a [C/T] polymorphism created by bisulfite conversion of unmethylated cytosines in the genome, however, the Infinium I and II assays differ in the number of beads required to detect methylation at a single locus. Infinium I uses two bead types per CpG, one for each of the methylated and unmethylated states (Figure 1a). In contrast, the Infinium II design uses one bead type and the methylated state is determined at the single base extension step after hybridization (Figure 1b). The 850k array also uses a combination of the Infinium I and II assays but achieves additional coverage by increasing the size of each array; a 450k slide contains 12 arrays whilst the 850k has only 8.

Regardless of the Illumina array version, for each CpG, there are two measurements: a methylated intensity (denoted by $$M$$) and an unmethylated intensity (denoted by $$U$$). These intensity values can be used to determine the proportion of methylation at each CpG locus. Methylation levels are commonly reported as either beta values ($$\beta = M/(M + U)$$) or M-values ($$Mvalue = log2(M/U)$$). For practical purposes, a small offset, $$\alpha$$, can be added to the denominator of the $$\beta$$ value equation to avoid dividing by small values, which is the default behaviour of the getBeta function in minfi. The default value for $$\alpha$$ is 100. It may also be desirable to add a small offset to the numerator and denominator when calculating M-values to avoid dividing by zero in rare cases, however the default getM function in minfi does not do this. Beta values and M-values are related through a logit transformation. Beta values are generally preferable for describing the level of methylation at a locus or for graphical presentation because percentage methylation is easily interpretable. However, due to their distributional properties, M-values are more appropriate for statistical testing (Du et al. 2010).

In this workflow, we will provide examples of the steps involved in analysing methylation array data using R (R Core Team 2014) and Bioconductor (Huber et al. 2015), including: quality control, filtering, normalization, data exploration and probe-wise differential methylation analysis. We will also cover other approaches such as differential methylation analysis of regions, differential variability analysis, gene ontology analysis and estimating cell type composition. Finally, we will provide some examples of useful ways to visualise methylation array data.

# 2 Differential methylation analysis

## 2.1 Obtaining the data

The data required for this workflow has been bundled with the R package that contains this workflow document. Alternatively, it can be obtained from figshare. If you choose to download it seperately, once the data has been downloaded, it needs to be extracted from the archive. This will create a folder called data, which contains all the files necessary to execute the workflow.

Once the data has been downloaded and extracted, there should be a folder called data that contains all the files necessary to execute the workflow.

# set up a path to the data directory
dataDirectory <- system.file("extdata", package = "methylationArrayAnalysis")
# list the files
list.files(dataDirectory, recursive = TRUE)
##  [1] "48639-non-specific-probes-Illumina450k.csv"
##  [2] "5975827018/5975827018_R06C02_Grn.idat"
##  [3] "5975827018/5975827018_R06C02_Red.idat"
##  [4] "6264509100/6264509100_R01C01_Grn.idat"
##  [5] "6264509100/6264509100_R01C01_Red.idat"
##  [6] "6264509100/6264509100_R01C02_Grn.idat"
##  [7] "6264509100/6264509100_R01C02_Red.idat"
##  [8] "6264509100/6264509100_R02C01_Grn.idat"
##  [9] "6264509100/6264509100_R02C01_Red.idat"
## [10] "6264509100/6264509100_R02C02_Grn.idat"
## [11] "6264509100/6264509100_R02C02_Red.idat"
## [12] "6264509100/6264509100_R03C01_Grn.idat"
## [13] "6264509100/6264509100_R03C01_Red.idat"
## [14] "6264509100/6264509100_R03C02_Grn.idat"
## [15] "6264509100/6264509100_R03C02_Red.idat"
## [16] "6264509100/6264509100_R04C01_Grn.idat"
## [17] "6264509100/6264509100_R04C01_Red.idat"
## [18] "6264509100/6264509100_R04C02_Grn.idat"
## [19] "6264509100/6264509100_R04C02_Red.idat"
## [20] "6264509100/6264509100_R05C01_Grn.idat"
## [21] "6264509100/6264509100_R05C01_Red.idat"
## [22] "6264509100/6264509100_R05C02_Grn.idat"
## [23] "6264509100/6264509100_R05C02_Red.idat"
## [24] "6264509100/6264509100_R06C01_Grn.idat"
## [25] "6264509100/6264509100_R06C01_Red.idat"
## [26] "6264509100/6264509100_R06C02_Grn.idat"
## [27] "6264509100/6264509100_R06C02_Red.idat"
## [28] "SampleSheet.csv"
## [29] "ageData.RData"
## [30] "human_c2_v5.rdata"
## [31] "model-based-cpg-islands-hg19-chr17.txt"
## [32] "wgEncodeRegDnaseClusteredV3chr17.bed"

To demonstrate the various aspects of analysing methylation data, we will be using a small, publicly available 450k methylation dataset (GSE49667)(Zhang et al. 2013). The dataset contains 10 samples in total: there are 4 different sorted T-cell types (naive, rTreg, act_naive, act_rTreg, collected from 3 different individuals (M28, M29, M30). For details describing sample collection and preparation, see Zhang et al. (2013). An additional birth sample (individual VICS-72098-18-B) is included from another study (GSE51180)(Cruickshank et al. 2013) to illustrate approaches for identifying and excluding poor quality samples.

There are several R Bioconductor packages available that have been developed for analysing methylation array data, including minfi (Aryee et al. 2014), missMethyl (Phipson, Maksimovic, and Oshlack 2016), wateRmelon (Pidsley et al. 2013), methylumi (Davis et al. 2015), ChAMP (Morris et al. 2014) and charm (Aryee et al. 2011). Some of the packages, such as minfi and methylumi include a framework for reading in the raw data from IDAT files and various specialised objects for storing and manipulating the data throughout the course of an analysis. Other packages provide specialised analysis methods for normalisation and statistical testing that rely on either minfi or methylumi objects. It is possible to convert between minfi and methylumi data types, however, this is not always trivial. Thus, it is advisable to consider the methods that you are interested in using and the data types that are most appropriate before you begin your analysis. Another popular method for analysing methylation array data is limma (Ritchie et al. 2015), which was originally developed for gene expression microarray analysis. As limma operates on a matrix of values, it is easily applied to any data that can be converted to a matrix in R. For a complete list of Bioconductor packages for analysing DNA methylation data, one can search for “DNAMethylation” in BiocViews (https://www.bioconductor.org/packages/release/BiocViews.html#___DNAMethylation) on the Bioconductor website.

We will begin with an example of a probe-wise differential methylation analysis using minfi and limma. By probe-wise analysis we mean each individual CpG probe will be tested for differential methylation for the comparisons of interest and p-values and moderated t-statistics (Smyth 2004) will be generated for each CpG probe.

It is useful to begin an analysis in R by loading all the packages that are likely to be required.

# load packages required for analysis
library(knitr)
library(limma)
library(minfi)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(IlluminaHumanMethylation450kmanifest)
library(RColorBrewer)
library(missMethyl)
library(minfiData)
library(Gviz)
library(DMRcate)
library(stringr)

The minfi, IlluminaHumanMethylation450kanno.ilmn12.hg19, IlluminaHumanMethylation450kmanifest, missMethyl, minfiData and DMRcate are methylation specific packages, while RColorBrewer and Gviz are visualisation packages. We use limma for testing differential methylation, and matrixStats and stringr have functions used in the workflow. The IlluminaHumanMethylation450kmanifest package provides the Illumina manifest as an R object which can easily be loaded into the environment. The manifest contains all of the annotation information for each of the CpG probes on the 450k array. This is useful for determining where any differentially methylated probes are located in a genomic context.

# get the 450k annotation data
ann450k <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
head(ann450k)
## DataFrame with 6 rows and 33 columns
##                    chr       pos      strand        Name    AddressA
##            <character> <integer> <character> <character> <character>
## cg00050873        chrY   9363356           -  cg00050873    32735311
## cg00212031        chrY  21239348           -  cg00212031    29674443
## cg00213748        chrY   8148233           -  cg00213748    30703409
## cg00214611        chrY  15815688           -  cg00214611    69792329
## cg00455876        chrY   9385539           -  cg00455876    27653438
## cg01707559        chrY   6778695           +  cg01707559    45652402
##               AddressB              ProbeSeqA              ProbeSeqB
##            <character>            <character>            <character>
## cg00050873    31717405 ACAAAAAAACAACACACAAC.. ACGAAAAAACAACGCACAAC..
## cg00212031    38703326 CCCAATTAACCACAAAAACT.. CCCAATTAACCGCAAAAACT..
## cg00213748    36767301 TTTTAACACCTAACACCATT.. TTTTAACGCCTAACACCGTT..
## cg00214611    46723459 CTAACTTCCAAACCACACTT.. CTAACTTCCGAACCGCGCTT..
## cg00455876    69732350 AACTCTAAACTACCCAACAC.. AACTCTAAACTACCCGACAC..
## cg01707559    64689504 ACAAATTAAAAACACTAAAA.. GCGAATTAAAAACACTAAAA..
##                   Type    NextBase       Color    Probe_rs Probe_maf
##            <character> <character> <character> <character> <numeric>
## cg00050873           I           A         Red          NA        NA
## cg00212031           I           T         Red          NA        NA
## cg00213748           I           A         Red          NA        NA
## cg00214611           I           A         Red          NA        NA
## cg00455876           I           A         Red          NA        NA
## cg01707559           I           A         Red          NA        NA
##                 CpG_rs   CpG_maf      SBE_rs   SBE_maf           Islands_Name
##            <character> <numeric> <character> <numeric>            <character>
## cg00050873          NA        NA          NA        NA   chrY:9363680-9363943
## cg00212031          NA        NA          NA        NA chrY:21238448-21240005
## cg00213748          NA        NA          NA        NA   chrY:8147877-8148210
## cg00214611          NA        NA          NA        NA chrY:15815488-15815779
## cg00455876          NA        NA          NA        NA   chrY:9385471-9385777
## cg01707559          NA        NA          NA        NA   chrY:6778574-6780028
##            Relation_to_Island       Forward_Sequence              SourceSeq
##                   <character>            <character>            <character>
## cg00050873            N_Shore TATCTCTGTCTGGCGAGGAG.. CGGGGTCCACCCACTCCAAA..
## cg00212031             Island CCATTGGCCCGCCCCAGTTG.. CGCACGTCTTCCCGACCGCA..
## cg00213748            S_Shore TCTGTGGGACCATTTTAACG.. CGCCCCCTCCTGCAGAACCT..
## cg00214611             Island GCGCCGGCAGGACTAGCTTC.. CGCCCGCGCCACACTGCAGC..
## cg00455876             Island CGCGTGTGCCTGGACTCTGA.. GACTCTGAGCTACCCGGCAC..
## cg01707559             Island AGCGGCCGCTCCCAGTGGTG.. CGCCCTCTGTCGCTGCAGCC..
##            Random_Loci Methyl27_Loci UCSC_RefGene_Name UCSC_RefGene_Accession
##            <character>   <character>       <character>            <character>
## cg00050873                              TSPY4;FAM197Y2 NM_001164471;NR_001553
## cg00212031                                      TTTY14              NR_001543
## cg00213748
## cg00214611                               TMSB4Y;TMSB4Y    NM_004202;NM_004202
## cg00455876
## cg01707559                           TBL1Y;TBL1Y;TBL1Y NM_134259;NM_033284;..
##              UCSC_RefGene_Group     Phantom         DMR    Enhancer
##                     <character> <character> <character> <character>
## cg00050873         Body;TSS1500
## cg00212031               TSS200
## cg00213748
## cg00214611        1stExon;5'UTR
## cg00455876
## cg01707559 TSS200;TSS200;TSS200
##                     HMM_Island Regulatory_Feature_Name Regulatory_Feature_Group
##                    <character>             <character>              <character>
## cg00050873   Y:9973136-9976273
## cg00212031 Y:19697854-19699393
## cg00213748   Y:8207555-8208234
## cg00214611 Y:14324883-14325218     Y:15815422-15815706   Promoter_Associated_..
## cg00455876   Y:9993394-9995882
## cg01707559   Y:6838022-6839951
##                    DHS
##            <character>
## cg00050873
## cg00212031
## cg00213748
## cg00214611
## cg00455876
## cg01707559

As for their many other BeadArray platforms, Illumina methylation data is usually obtained in the form of Intensity Data (IDAT) Files. This is a proprietary format that is output by the scanner and stores summary intensities for each probe on the array. However, there are Bioconductor packages available that facilitate the import of data from IDAT files into R (Smith et al. 2013). Typically, each IDAT file is approximately 8MB in size. The simplest way to import the raw methylation data into R is using the minfi function read.metharray.sheet, along with the path to the IDAT files and a sample sheet. The sample sheet is a CSV (comma-separated) file containing one line per sample, with a number of columns describing each sample. The format expected by the read.metharray.sheet function is based on the sample sheet file that usually accompanies Illumina methylation array data. It is also very similar to the targets file described by the limma package. Importing the sample sheet into R creates a data.frame with one row for each sample and several columns. The read.metharray.sheet function uses the specified path and other information from the sample sheet to create a column called Basename which specifies the location of each individual IDAT file in the experiment.

# read in the sample sheet for the experiment
targets

Now that we have imported the information about the samples and where the data is located, we can read the raw intensity signals into R from the IDAT files using the read.metharray.exp function. This creates an RGChannelSet object that contains all the raw intensity data, from both the red and green colour channels, for each of the samples. At this stage, it can be useful to rename the samples with more descriptive names.

# read in the raw data from the IDAT files
rgSet
## class: RGChannelSet
## dim: 622399 11
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(11): 6264509100_R01C01 6264509100_R02C01 ... 6264509100_R04C02
##   5975827018_R06C02
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
# give the samples descriptive names
targets$ID <- paste(targets$Sample_Group,targets$Sample_Name,sep=".") sampleNames(rgSet) <- targets$ID
rgSet
## class: RGChannelSet
## dim: 622399 11
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(11): naive.1 rTreg.2 ... act_rTreg.10 birth.11
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19

## 2.3 Quality control

Once the data has been imported into R, we can evaluate its quality. Firstly, we need to calculate detection p-values. We can generate a detection p-value for every CpG in every sample, which is indicative of the quality of the signal. The method used by minfi to calculate detection p-values compares the total signal $$(M + U)$$ for each probe to the background signal level, which is estimated from the negative control probes. Very small p-values are indicative of a reliable signal whilst large p-values, for example >0.01, generally indicate a poor quality signal.

Plotting the mean detection p-value for each sample allows us to gauge the general quality of the samples in terms of the overall signal reliability (Figure 2). Samples that have many failed probes will have relatively large mean detection p-values.

# calculate the detection p-values
detP <- detectionP(rgSet)
head(detP)
##                  naive.1       rTreg.2  act_naive.3     naive.4   act_naive.5
## cg00050873  0.000000e+00  0.000000e+00 0.000000e+00 0.00000e+00  0.000000e+00
## cg00212031  0.000000e+00  0.000000e+00 0.000000e+00 0.00000e+00  0.000000e+00
## cg00213748  2.139652e-88  4.213813e-31 1.181802e-12 1.29802e-47  8.255482e-15
## cg00214611  0.000000e+00  0.000000e+00 0.000000e+00 0.00000e+00  0.000000e+00
## cg00455876 1.400696e-234 9.349236e-111 4.272105e-90 0.00000e+00 3.347145e-268
## cg01707559  0.000000e+00  0.000000e+00 0.000000e+00 0.00000e+00  0.000000e+00
##              act_rTreg.6      naive.7       rTreg.8  act_naive.9  act_rTreg.10
## cg00050873  0.000000e+00  0.00000e+00  0.000000e+00 0.000000e+00  0.000000e+00
## cg00212031  0.000000e+00  0.00000e+00  0.000000e+00 0.000000e+00  0.000000e+00
## cg00213748  2.592206e-23  1.16160e-28  1.469801e-05 1.543654e-21  1.365951e-08
## cg00214611  0.000000e+00  0.00000e+00  0.000000e+00 0.000000e+00  0.000000e+00
## cg00455876 4.690740e-308 1.08647e-219 5.362780e-178 0.000000e+00 7.950724e-295
## cg01707559  0.000000e+00  0.00000e+00  0.000000e+00 0.000000e+00  0.000000e+00
##                 birth.11
## cg00050873  0.000000e+00
## cg00212031 2.638199e-237
## cg00213748  6.735224e-01
## cg00214611  7.344451e-01
## cg00455876 4.403634e-174
## cg01707559  0.000000e+00
# examine mean detection p-values across all samples to identify any failed samples
pal <- brewer.pal(8,"Dark2")
par(mfrow=c(1,2))
barplot(colMeans(detP), col=pal[factor(targets$Sample_Group)], las=2, cex.names=0.8, ylab="Mean detection p-values") abline(h=0.05,col="red") legend("topleft", legend=levels(factor(targets$Sample_Group)), fill=pal,
bg="white")

barplot(colMeans(detP), col=pal[factor(targets$Sample_Group)], las=2, cex.names=0.8, ylim=c(0,0.002), ylab="Mean detection p-values") abline(h=0.05,col="red") legend("topleft", legend=levels(factor(targets$Sample_Group)), fill=pal,
bg="white")

The minfi qcReport function generates many other useful quality control plots. The minfi vignette describes the various plots and how they should be interpreted in detail. Generally, samples that look poor based on mean detection p-value will also look poor using other metrics and it is usually advisable to exclude them from further analysis.

qcReport(rgSet, sampNames=targets$ID, sampGroups=targets$Sample_Group,
pdf="qcReport.pdf")

Poor quality samples can be easily excluded from the analysis using a detection p-value cutoff, for example >0.05. For this particular dataset, the birth sample shows a very high mean detection p-value, and hence it is excluded from subsequent analysis (Figure 2).

# remove poor quality samples
keep <- colMeans(detP) < 0.05
rgSet <- rgSet[,keep]
rgSet
## class: RGChannelSet
## dim: 622399 10
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
# remove poor quality samples from targets data
targets <- targets[keep,]
targets[,1:5]
##    Sample_Name Sample_Well Sample_Source Sample_Group Sample_Label
## 1            1          A1           M28        naive        naive
## 2            2          B1           M28        rTreg        rTreg
## 3            3          C1           M28    act_naive    act_naive
## 4            4          D1           M29        naive        naive
## 5            5          E1           M29    act_naive    act_naive
## 6            6          F1           M29    act_rTreg    act_rTreg
## 7            7          G1           M30        naive        naive
## 8            8          H1           M30        rTreg        rTreg
## 9            9          A2           M30    act_naive    act_naive
## 10          10          B2           M30    act_rTreg    act_rTreg
# remove poor quality samples from detection p-value table
detP <- detP[,keep]
dim(detP)
## [1] 485512     10

## 2.4 Normalisation

To minimise the unwanted variation within and between samples, various data normalisations can be applied. Many different types of normalisation have been developed for methylation arrays and it is beyond the scope of this workflow to compare and contrast all of them (Fortin et al. 2014; Wu et al. 2014; Sun et al. 2011; Wang et al. 2012; Maksimovic, Gordon, and Oshlack 2012; Mancuso et al. 2011; Touleimat and Tost 2012; Teschendorff et al. 2013; Pidsley et al. 2013; Triche et al. 2013). Several methods have been built into minfi and can be directly applied within its framework (Fortin et al. 2014; Triche et al. 2013; Maksimovic, Gordon, and Oshlack 2012; Touleimat and Tost 2012), whilst others are methylumi-specific or require custom data types (Wu et al. 2014; Sun et al. 2011; Wang et al. 2012; Mancuso et al. 2011; Teschendorff et al. 2013; Pidsley et al. 2013). Although there is no single normalisation method that is universally considered best, a recent study by Fortin et al. (2014) has suggested that a good rule of thumb within the minfi framework is that the preprocessFunnorm (Fortin et al. 2014) function is most appropriate for datasets with global methylation differences such as cancer/normal or vastly different tissue types, whilst the preprocessQuantile function (Touleimat and Tost 2012) is more suited for datasets where you do not expect global differences between your samples, for example a single tissue. Further discussion on appropriate choice of normalisation can be found in (Hicks and Irizarry 2015), and the accompanying quantro package includes data-driven tests for the assumptions of quantile normalisation. As we are comparing different blood cell types, which are globally relatively similar, we will apply the preprocessQuantile method to our data (Figure 3). This function implements a stratified quantile normalisation procedure which is applied to the methylated and unmethylated signal intensities separately, and takes into account the different probe types. Note that after normalisation, the data is housed in a GenomicRatioSet object. This is a much more compact representation of the data as the colour channel information has been discarded and the $$M$$ and $$U$$ intensity information has been converted to M-values and beta values, together with associated genomic coordinates. Note, running the preprocessQuantile function on this dataset produces the warning: ‘An inconsistency was encountered while determining sex’; this can be ignored as it is due to all the samples being from male donors.

# normalize the data; this results in a GenomicRatioSet object
mSetSq <- preprocessQuantile(rgSet) 
## [preprocessQuantile] Mapping to genome.
## Warning in .getSex(CN = CN, xIndex = xIndex, yIndex = yIndex, cutoff = cutoff):
## An inconsistency was encountered while determining sex. One possibility is
## that only one sex is present. We recommend further checks, for example with the
## plotSex function.
## [preprocessQuantile] Fixing outliers.
## [preprocessQuantile] Quantile normalizing.
# create a MethylSet object from the raw data for plotting
mSetRaw <- preprocessRaw(rgSet)
# visualise what the data looks like before and after normalisation
par(mfrow=c(1,2))
densityPlot(rgSet, sampGroups=targets$Sample_Group,main="Raw", legend=FALSE) legend("top", legend = levels(factor(targets$Sample_Group)),
text.col=brewer.pal(8,"Dark2"))
densityPlot(getBeta(mSetSq), sampGroups=targets$Sample_Group, main="Normalized", legend=FALSE) legend("top", legend = levels(factor(targets$Sample_Group)),
text.col=brewer.pal(8,"Dark2"))