Illumina Infinium HumanMethylation 450K BeadChip assay has become a standard tool to analyse methylation in human samples. Developed in 2011, it has already been used in projects such as The Cancer Genome Atlas (TCGA). Their 450.000 probes provide a good overall image of the methylation state of the genome, being one of the reasons of its success.
Given its complex design1 More information can be found at this minfi tutorial, many Bioconductor packages have been developed to assess normalization and pre-processing issues (e.g. minfi (Aryee et al. 2014) or lumi (Du, Kibbe, and Lin 2008)). In addition, these packages can detect differentially methylated probes (DMPs) and differentially methylated regions (DMRs). However, the interfaces are not very intuitive and several scripting steps are usually required.
MEAL aims to facilitate the analysis of Illumina Methylation 450K chips. We have included two methods to analyze DMPs (Differentially Methylated Probes), that test differences in means (limma) or differences in variance (DiffVar). We have included three DMRs (Differentially Methylated Regions) detection algorithms (bumphunter, blockFinder and DMRcate) and a new method to test differences in methylation in a target region (RDA). Finally, we have prepared plots for all these analyses as well as a wrapper to run all the analyses in the same dataset.
MEAL is meant to analyze methylation data already preprocessed. All our functions accept a
GenomicRatioSet as input, which is a class from minfi package designed to manage preprocessed methylation data. Users willing to preprocess their own data are encouraged to take a look to minfi’s vignette
In this vignette, we will use methylation data from minfiData package.
library(MEAL) library(MultiDataSet) library(minfiData) library(minfi) library(ggplot2) data("MsetEx")
MsetEx is a
MethylationRatioSet that contains measurements for 485512 CpGs and 6 samples, as well as some phenotypic variables such as age or sex. The first step will be to convert it to a
GenomicRatioSet. Then, we will add some extra features annotation. Finally, we will remove probes not measuring methylation, with SNPs or with NAs. For the sake of speed, we will select a subset of CpGs:
meth <- mapToGenome(ratioConvert(MsetEx)) rowData(meth) <- getAnnotation(meth)[, -c(1:3)] ## Remove probes measuring SNPs meth <- dropMethylationLoci(meth) ## Remove probes with SNPs meth <- dropLociWithSnps(meth) ## Remove probes with NAs meth <- meth[!apply(getBeta(meth), 1, function(x) any(is.na(x))), ] ## Select a subset of samples set.seed(0) meth <- meth[sample(nrow(meth), 100000), ]
runPipeline run all methods included in MEAL to the same dataset. We only need to pass to this function a
GenomicRatioSet and the name of our variable of interest. In our case, we will analyze the effect of cancer on methylation:
res <- runPipeline(set = meth, variable_names = "status")
runPipeline includes several parameters to customize the analyses. The most important parameters are
covariable_names is used to include covariates in our models.
betas allows the user choosing between running the analyis with beta (TRUE) or M-values (FALSE). If
sva is TRUE, Surrogate Variable Analysis is run and surrogate variables are included in the models. Finally, some parameters modify the behaviour of the methods included in the wrapper and they will be covered later on. More information about the parameters can be found in the documentation (by typing ?runPipeline).
We will run a new analysis including age as covariate:
resAdj <- runPipeline(set = meth, variable_names = "status", covariable_names = "age", analyses = c("DiffMean", "DiffVar")) resAdj
## Object of class 'ResultSet' ## . created with: runPipeline ## . sva: no ## . #results: 2 ( error: 0 ) ## . featureData: 100000 probes x 35 variables
runPipeline generates a
ResultSet is a class designed to encapsulate different results from the same dataset. It contains the results of the different methods, the feature data and other data required to get tables or plots. We can examine the analyses included in a
ResultSet with the function
##  "DiffMean" "DiffVar"
Both objects contains five analyses. DiffMean is an analysis of difference of means performed with limma while the others are named with the method name (DiffVar, bumphunter, blockFinder and dmrcate).
We can use the function
getAssociation to get a data.frame with the results, independent of the original method. This function has two main arguments:
object is the
ResultSet with our data and
rid is the name or the index of the analysis we want to extract.
## logFC CI.L CI.R AveExpr t P.Value ## cg09383816 -0.5938196 -0.6310536 -0.5565855 0.4885486 -42.93778 7.098649e-07 ## cg26729197 -0.5779628 -0.6186821 -0.5372435 0.3630791 -38.21419 1.176118e-06 ## cg19333963 -0.6532176 -0.7032282 -0.6032071 0.4092923 -35.16587 1.685708e-06 ## cg09150117 -0.6045812 -0.6509778 -0.5581846 0.4853409 -35.08275 1.703065e-06 ## cg13307451 -0.5417324 -0.5838896 -0.4995751 0.3924193 -34.59696 1.809029e-06 ## cg19859781 0.4257460 0.3904540 0.4610380 0.5983762 32.47878 2.377898e-06 ## adj.P.Val B SE ## cg09383816 0.01629257 5.595805 0.03090437 ## cg26729197 0.01629257 5.410263 0.02239892 ## cg19333963 0.01629257 5.258877 0.07809844 ## cg09150117 0.01629257 5.254326 0.04401591 ## cg13307451 0.01629257 5.227227 0.01771050 ## cg19859781 0.01629257 5.098421 0.07097404
## logFC CI.L CI.R AveExpr t P.Value ## cg11847929 -2.327910 -3.146772 -1.5090476 1.4656792 -6.813262 0.0003320046 ## cg06265760 -1.730019 -2.362333 -1.0977058 0.9442413 -6.557196 0.0004149426 ## cg02076607 -1.541192 -2.106092 -0.9762907 1.1829703 -6.538591 0.0004218303 ## cg05635754 -2.438282 -3.333548 -1.5430149 1.2934636 -6.527263 0.0004260870 ## cg22891595 -1.525000 -2.085639 -0.9643620 0.9589610 -6.519089 0.0004291892 ## cg01287975 -1.881061 -2.575312 -1.1868104 0.9752451 -6.493611 0.0004390231 ## adj.P.Val B SE ## cg11847929 0.1787794 0.23218162 0.10537499 ## cg06265760 0.1787794 0.08995047 0.09685284 ## cg02076607 0.1787794 0.07929272 0.16051859 ## cg05635754 0.1787794 0.07278219 0.12877918 ## cg22891595 0.1787794 0.06807333 0.16428448 ## cg01287975 0.1787794 0.05334128 0.15043591
DiffMean and DiffVar are internally stored as a MArrayLM, the class from limma results. This class allows testing different constrasts or evaluating different variables simultaneously. The function
getProbeResults helps the user performing these operations. It also has the arguments
coef is a numeric with the index of the coefficient from which we want the results. If we did not pass a custom model to
runPipeline, the first coefficient (coef = 1) is the intercept and the second coefficient (coef = 2) is the first variable that we included in
variable_names. We can evaluate different coefficients simultaneously by passing a vector to
contrast is a matrix with the contrasts that we want to evaluate. This option is useful when our variable of interest is a factor with several levels and we want to do all the different comparisons. Finally, the argument
fNames is used to select the variables from features annotation that will be added to the tables.
To exemplify the use of this function, we will evaluate our whole adjusted model, including age coefficient. We will also add some annotation of the CpGs:
head(getProbeResults(resAdj, rid = 1, coef = 2:3, fNames = c("chromosome", "start")))
## statusnormal age AveExpr F P.Value ## cg09383816 -0.5938196 -0.0026657333 0.4885486 929.9010 1.924508e-06 ## cg26729197 -0.5779628 -0.0005001219 0.3630791 730.3999 3.246391e-06 ## cg13307451 -0.5417324 -0.0064613777 0.3924193 635.4818 4.387672e-06 ## cg09150117 -0.6045812 0.0032853408 0.4853409 623.2986 4.575331e-06 ## cg19333963 -0.6532176 0.0019529459 0.4092923 620.7215 4.616527e-06 ## cg19859781 0.4257460 -0.0020866765 0.5983762 532.9428 6.419871e-06 ## adj.P.Val SE.statusnormal SE.age chromosome start ## cg09383816 0.04313762 0.0309043657 0.0014823190 chr8 67344556 ## cg26729197 0.04313762 0.0223989196 0.0010743577 chr17 71161415 ## cg13307451 0.04313762 0.0780984369 0.0037459690 chr10 22541442 ## cg09150117 0.04313762 0.0440159139 0.0021112106 chr7 96653867 ## cg19333963 0.04313762 0.0177105042 0.0008494792 chr19 1467979 ## cg19859781 0.04313762 0.0709740354 0.0034042491 chr13 40174685
When more than one coefficient is evaluated, a estimate for each coefficient is returned and the t-statistic is substituted by a F-statistic. More information about linear models, including a detailed section of how to create a constrast matrix can be found in limma users’ guide.
Finally, we can obtain the results of CpGs mapped to some genes with the function
getGeneVals. This function accepts the same arguments than
getProbeResults but includes the arguments
genecol to pass the names of the genes to be selected and the column name of feature data containing gene names.
We will retrieve the difference in variance results for all CpGs mapped to ARMS2. We can see in the rowData of
meth that gene names are in the column ‘UCSC_RefGene_Name’:
getGeneVals(resAdj, "ARMS2", genecol = "UCSC_RefGene_Name", fNames = c("chromosome", "start"))
## logFC CI.L CI.R AveExpr t P.Value ## cg00676728 0.1106853 0.01259430 0.2087763 0.8206485 3.037987 0.03458282 ## cg18222240 0.1874701 -0.01570234 0.3906426 0.5432686 2.484233 0.06300344 ## adj.P.Val B SE chromosome start ## cg00676728 0.1946025 -4.130757 0.01303783 chr10 124213760 ## cg18222240 0.2718642 -4.801220 0.02864301 chr10 124213527 ## UCSC_RefGene_Name ## cg00676728 ARMS2 ## cg18222240 ARMS2
We can easily get Manhattan plots, Volcano plots and QQ-plots for the probes results (DiffMean and DiffVar) using
plot method. Our extension of
plot method to
ResultSet includes the arguments
coef that were already present in
getProbeResult. In addition, the argument
type allows choosing between a Manhattan plot (“manhattan”), a Volcano plot (“volcano”) or a qq-plot (“qq”).
We can customize different aspects of a Manhattan plot. We can highlight the CpGs of a target region by passing a
GenomicRanges to the argument
highlight. Similarly, we can get a Manhattan plot with only the CpGs of our target region passing a
GenomicRanges to the argument
subset. It should be noticed that the
GenomicRange should have the chromosome as a number (1-24).
We will show these capabilities by highlighting and subsetting a region of ten Mb in chromosome X:
targetRange <- GRanges("23:13000000-23000000") plot(resAdj, rid = "DiffMean", type = "manhattan", highlight = targetRange)