Contents

1 Introduction

The EpiDISH package provides tools to infer the fractions of a priori known cell subtypes present in a DNA methylation (DNAm) sample representing a mixture of such cell-types. Inference proceeds via one of 3 methods (Robust Partial Correlations-RPC(Teschendorff et al. 2017), Cibersort-CBS(Newman et al. 2015), Constrained Projection-CP(Houseman et al. 2012)), as determined by the user. Besides, we also provide a function - CellDMC which allows the identification of differentially methylated cell-types in Epigenome-Wide Association Studies(EWAS)(Zheng, Breeze, et al. 2018). For now, the package contains 7 DNAm reference matrices, three of which are designed for adult whole blood (Teschendorff et al. 2017) and (Luo et al. 2023), and one which is designed for blood tissue of any age, including cord-blood and blood from infants, children, adolescents and adults.

  1. centDHSbloodDMC.m: This DNAm reference matrix for blood will estimate fractions for 7 immune cell types (B-cells, NK-cells, CD4T and CD8T-cells, Monocytes, Neutrophils and Eosinophils).
  2. cent12CT.m: This DNAm reference matrix for blood and EPIC-arrays will estimate fractions for 12 immune-cell types (naive and mature B-cells, naive and mature CD4T-cells, naive and mature B-cells, T-regulatory cells, NK-cells, Neutrophils, Monocytes, Eosinophils, Basophils).
  3. cent12CT450k.m: This DNAm reference matrix for blood and Illumina 450k-arrays will estimate fractions for 12 immune-cell types (naive and mature B-cells, naive and mature CD4T-cells, naive and mature B-cells, T-regulatory cells, NK-cells, Neutrophils, Monocytes, Eosinophils, Basophils).
  4. centUniLIFE.m: This DNAm reference matrix for blood-tissue of any age will estimate fractions for 19 immune-cell types including 7 youthful cord-blood subtypes (B-cells, NK-cells, Granulocytes, Monocytes, nRBCs, CD4T and CD8T-cells) and 12 adult immune cell types (naive and mature B-cells, naive and mature CD4T-cells, naive and mature B-cells, T-regulatory cells, NK-cells, Neutrophils, Monocytes, Eosinophils, Basophils).

The other 3 DNAm reference matrices are designed for solid tissue-types (Zheng, Webster, et al. 2018):

  1. centEpiFibIC.m: This DNAm reference matrix is designed for a generic solid tissue that is dominated by an epithelial, stromal and immune-cell component. It will estimate fractions for 3 broad cell-types: a generic epithelial, fibroblast and immune-cell type.
  2. centBloodSub.m: This DNAm reference matrix is designed for a solid tissue-type and will estimate immune cell infiltration for 7 immune cell subtypes. This DNAm reference matrix is meant to be applied after centEpiFibIC.m to yield proportions for 7 immune cell subtypes alongside the total epithelial and total fibroblast fractions.
  3. centEpiFibFatIC.m: This DNAm reference matrix is a more specialised version for breast tissue and will estimate total epithelial, fibroblast, immune-cell and fat fractions.

2 How to estimate cell-type fractions in blood

We show an example of using our package to estimate 7 immune cell-type fractions in adult whole blood. We use a subset beta value matrix of GSE42861 (detailed description in manual page of LiuDataSub.m). First, we read in the required objects:

library(EpiDISH)
data(centDHSbloodDMC.m)
data(LiuDataSub.m)
BloodFrac.m <- epidish(beta.m = LiuDataSub.m, ref.m = centDHSbloodDMC.m, method = "RPC")$estF

We can easily check the inferred fractions with boxplots. From the boxplots, we observe that just as we expected, the major cell-type in whole blood is neutrophil.

boxplot(BloodFrac.m)

If we wanted to infer fractions at a higher resolution of 12 immune cell subtypes, we would replace centDHSbloodDMC.m in the above with cent12CT450k.m because this is a 450k DNAm dataset. For an EPIC whole blood dataset, you would use cent12CT.m.

3 How to estimate generic cell-type fractions in a solid tissue

To illustrate how this works, we first read in a dummy beta value matrix DummyBeta.m, which contains 2000 CpGs and 10 samples, representing a solid tissue:

data(centEpiFibIC.m)
data(DummyBeta.m)

Notice that centEpiFibIC.m has 3 columns, with names of the columns as EPi, Fib and IC. We go ahead and use epidish function with RPC mode to infer the cell-type fractions.

out.l <- epidish(beta.m = DummyBeta.m, ref.m = centEpiFibIC.m, method = "RPC") 

Then, we check the output list. estF is the matrix of estimated cell-type fractions. ref is the reference centroid matrix used, and dataREF is the subset of the input data matrix over the probes defined in the reference matrix.

out.l$estF
##            Epi        Fib           IC
## S1  0.08836819 0.06109607 0.8505357378
## S2  0.07652115 0.57326994 0.3502089007
## S3  0.15417391 0.75663136 0.0891947251
## S4  0.77082647 0.04171941 0.1874541181
## S5  0.03960599 0.31921224 0.6411817742
## S6  0.12751711 0.79642919 0.0760537000
## S7  0.18144315 0.72889883 0.0896580171
## S8  0.20220823 0.40929344 0.3884983293
## S9  0.19398079 0.80540932 0.0006098973
## S10 0.27976647 0.23671333 0.4835201992
dim(out.l$ref)
## [1] 599   3
dim(out.l$dataREF)
## [1] 599  10

Note: As part of the quality control step in DNAm data preprocessing, we might have to remove bad probes; consequently, not all probes in the reference matrix may be available in a given dataset. By checking ref and dataREF, we can extract the probes actually used for estimating cell-type fractions. As shown by us (Zheng, Webster, et al. 2018), if the proportion of missing reference matrix probes is more than a third, then estimated fractions may be unreliable.

4 How to estimate immune cell-type fractions in a solid tissue using HEpiDISH

HEpiDISH is an iterative hierarchical procedure of EpiDISH designed for solid tissues with significant immune-cell infiltration. HEpiDISH uses two distinct DNAm references, a primary reference for the estimation of total epithelial, fibroblast and immune-cell fractions, and a separate secondary non-overlapping DNAm reference for the estimation of underlying immune cell subtype fractions.