1 Introduction

The ENmix package provides a set of quality control, preprocessing/correction and data analysis tools for Illumina Methylation Beadchips. It includes functions to read in raw idat data, background correction, dye bias correction, probe-type bias adjustment, along with a number of additional tools. These functions can be used to remove unwanted experimental noise and thus to improve accuracy and reproducibility of methylation measures. ENmix functions are flexible and transparent. Users have option to choose a single pipeline command to finish all data pre-processing steps (including quality control, background correction, dye-bias adjustment, between-array normalization and probe-type bias correction) or to use individual functions sequentially to perform data pre-processing in a more customized manner. In addition the ENmix package has selectable complementary functions for efficient data visualization (such as QC plots, data distribution plot, manhattan plot and Q-Q plot), quality control (identifing and filtering low quality data points, samples, probes, and outliers, along with imputation of missing values), identification of probes with multimodal distributions due to SNPs or other factors, exploration of data variance structure using principal component regression analysis plot, preparation of experimental factors related surrogate control variables to be adjusted in downstream statistical analysis, an efficient algorithm oxBS-MLE to estimate 5-methylcytosine and 5-hydroxymethylcytosine level; estimation of celltype proporitons; methlation age calculation and differentially methylated region (DMR) analysis.

Most ENmix package can also support the data structure used by several other related R packages, such as minfi, wateRmelon and ChAMP, providing straightforward integration of ENmix-corrected datasets for subsequent data analysis.

ENmix readidat function does not depend on array annotation R packages. It can directly read in Illuminal manifest file, which makes it easier to work with newer array, such as MethylationEPICv2.0 and mouse Beadchip.

The software is designed to support large scale data analysis, and provides multi-processor parallel computing options for most functions.

2 List of functions

Data acquisition

  • readidat(): Read idat files into R
  • readmanifest(): Read array manifest file into R

Quality control

  • QCinfo(): Extract and visualize QC information
  • plotCtrl(): Generate internal control plots
  • getCGinfo(): Extract CpG probe annotation information
  • calcdetP(): Compute detection P values
  • qcfilter(): Remove low quality values, samples or CpGs; remove outlier samples and perform imputation
  • nmode(): Identify “gap” probes, i.e. those with multimodal distribution from underlying caused by underlying SNPs
  • dupicc(): Calculate Introclass correlation coefficient (ICC) using data for duplicates
  • freqpoly(): Frequency polygon plot for single variable
  • multifreqpoly(): Frequency polygon plot for multiple variables

Preprocessing

  • mpreprocess(): Preprocessing pipeline
  • preprocessENmix(): ENmix background correction and dye bias correction
  • relic(): RELIC dye bias correction
  • norm.quantile(): Quantile normalization
  • rcp(): RCP probe design type bias correction

Differential methylated region (DMR) analysis

  • ipdmr(): ipDMR differentially methylated region analysis
  • combp(): Combp differentially methylated region analysis

Other functions

  • oxBS.MLE(): MLE estimates of 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC)
  • estimateCellProp(): Estimate white blood cell type proportions
  • methyAge(): Calculate methylation age
  • methscore(): calculate various methylation predictors, including DNA methylation age, exposures and plasma protein levels.
  • predSex(): Estimate sample sex
  • ctrlsva(): Derive surrogate variables to control for experimental confounding using non-negative internal control probes
  • pcrplot(): Principal component regression plot
  • mhtplot(): P value manhattan plot
  • p.qqplot(): P value Q-Q plot
  • B2M(): Convert Beta value to M value
  • M2B(): Convert M value to Beta value

3 ENmix classes

ENmix organizes data with two different classes.

rgDataSet contains raw data (including internal control probes) from IDAT file, CpG annotation from Illumina manifest file and/or sample inforamtion (plate, array, and phenotypes) provided by users. Array intensity data is organized by probe (not CpG locus) at red and green channel.

methDataSet contains methylated and unmethylated intensity values (organized by CpG), CpG annotation from Illumina manifest file and/or sample inforamtion (plate, array, and phenotypes) provided by users.