Contents

1 Introduction

The quantification of methylation can be expressed as the log-transformed ratio of methylated over unmethylated signal (M) or the ratio of methylated over total (methylated plus unmethylated) signal (β). It is important to use M values when batch effect correcting Infinium methylation data as these are unbounded. By definition, β is constrained between 0 and 1, and after correction there is no guarantee the values will still be within this range. M values can be batch-adjusted and then transformed back into the more readily interpretable β methylation values by a simple inverse logit transformation. With this transformation, very large negative and positive M values become asymptotic to β of 0 and 1, respectively.

Before detecting clusters, this conversion from M-values to β must be undertaken.

The Harman package is available from Bioconductor (Harman).

2 Transcriptome data examples

2.1 Working with a simple transciptome microarray dataset

First off, let’s load an example dataset from HarmanData where the batches and experimental variable are balanced.

library(Harman)
library(HarmanData)
data(OLF)

The olfactory stem cell study is an experiment to gauge the response of human olfactory neurosphere-derived (hONS) cells established from adult donors to ZnO nanoparticles. A total of 28 Affymetrix HuGene 1.0 ST arrays were normalised together using RMA.

The data comprises six treatment groups plus a control group, each consisting of four replicates:

table(olf.info)
##          Batch
## Treatment 1 2 3 4
##         1 1 1 1 1
##         2 1 1 1 1
##         3 1 1 1 1
##         4 1 1 1 1
##         5 1 1 1 1
##         6 1 1 1 1
##         7 1 1 1 1

The olf.info data.frame has the expt and batch variables across 2 columns, with each sample described in one row to give 28 rows.

olf.info[1:7,]
##    Treatment Batch
## c1         1     1
## c2         2     1
## c3         3     1
## c4         4     1
## c5         5     1
## c6         6     1
## c7         7     1

The data is a data.frame of normalised log2 expression values with dimensions: 33297 rows (features) x 28 columns (samples). This data can be handed to Harman as is, or first coerced into a matrix.

head(olf.data)
##         c1      c2      c3      c4      c5      c6      c7      c8      c9
## p1 5.05866 4.58076 5.58438 2.90481 5.39752 4.24041 2.46891 5.34241 2.86128
## p2 4.23886 4.08143 3.21386 3.53045 4.18741 3.70027 3.05552 4.62957 4.09687
## p3 3.66121 2.79664 4.13699 2.86271 3.17795 2.92988 3.05603 3.42135 2.70507
## p4 8.61399 9.09654 9.16841 9.10928 8.94949 8.70754 8.75558 9.31429 8.63934
## p5 2.84004 2.66609 3.03612 3.26561 3.22945 3.32247 3.05079 3.02775 3.18419
## p6 3.12234 3.05058 3.85761 3.07707 3.67759 3.72965 3.43910 3.15980 2.40544
##        c10     c11     c12     c13     c14     c15     c16     c17     c18
## p1 3.03601 4.16908 2.58603 3.14912 2.80076 5.84228 4.51905 3.63710 5.40139
## p2 3.59235 3.61548 3.87856 3.28656 4.12426 4.75150 4.44983 4.59084 4.87332
## p3 3.02844 3.11758 2.78865 3.82057 2.98588 4.34385 3.04461 3.47405 2.96032
## p4 8.67534 8.68344 9.06311 8.57974 9.01710 8.80506 8.82719 9.17436 8.72230
## p5 3.39400 3.27891 2.20645 3.26020 3.10178 3.72065 3.11529 3.75199 3.43958
## p6 3.33047 2.81670 3.34086 2.61524 2.38151 3.06240 4.51134 3.69132 3.08967
##        c19     c20     c21     c22     c23     c24     c25     c26     c27
## p1 4.61271 5.07018 5.11652 3.11507 3.63363 4.00769 4.57562 4.34872 4.81612
## p2 4.24876 4.43532 5.59789 2.64399 3.54669 3.26678 3.31107 4.03487 3.60095
## p3 3.31022 2.89191 3.84660 3.24867 2.56718 2.99377 3.18528 2.99099 3.14193
## p4 9.14509 9.27561 9.17592 8.79834 8.92382 9.42164 9.07371 8.91382 8.76901
## p5 3.41088 3.42294 3.77549 3.47989 2.79997 2.41906 2.64609 3.14506 3.39344
## p6 5.20479 4.35899 3.80101 2.95787 2.82707 2.76646 2.89902 3.28622 4.27340
##        c28
## p1 2.64101
## p2 4.10095
## p3 2.74735
## p4 8.30663
## p5 3.82901
## p6 3.21685

2.1.1 Running Harman

Okay, so let’s run Harman.

olf.harman <- harman(olf.data, expt=olf.info$Treatment, batch=olf.info$Batch, limit=0.95)

This creates a harmanresults S3 object which has a number of slots. These include the centre and rotation slots of a prcomp object which is returned after calling prcomp(t(x)), where x is the datamatrix supplied. These two slots are required for reconstructing the data later. The other slots are the factors supplied for the analysis (specified as expt and batch), the runtime parameters and some summary stats for Harman. Finally, there are the original and corrected PC scores.

str(olf.harman)
## List of 7
##  $ factors   :'data.frame':  28 obs. of  4 variables:
##   ..$ expt         : Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7 1 2 3 ...
##   ..$ batch        : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 2 2 2 ...
##   ..$ expt.numeric : int [1:28] 1 2 3 4 5 6 7 1 2 3 ...
##   ..$ batch.numeric: int [1:28] 1 1 1 1 1 1 1 2 2 2 ...
##  $ parameters:List of 4
##   ..$ limit     : num 0.95
##   ..$ numrepeats: int 100000
##   ..$ randseed  : num 7.35e+08
##   ..$ forceRand : logi FALSE
##  $ stats     :'data.frame':  28 obs. of  3 variables:
##   ..$ dimension : chr [1:28] "PC1" "PC2" "PC3" "PC4" ...
##   ..$ confidence: num [1:28] 0.95 0.949 0.95 0.95 0.951 ...
##   ..$ correction: num [1:28] 0.25 0.33 0.5 0.9 0.44 0.85 0.74 1 1 1 ...
##  $ center    : Named num [1:33297] 4.13 3.95 3.19 8.92 3.19 ...
##   ..- attr(*, "names")= chr [1:33297] "p1" "p2" "p3" "p4" ...
##  $ rotation  : num [1:33297, 1:28] -0.020637 -0.011028 -0.006488 -0.000785 -0.00551 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:33297] "p1" "p2" "p3" "p4" ...
##   .. ..$ : chr [1:28] "PC1" "PC2" "PC3" "PC4" ...
##  $ original  : num [1:28, 1:28] -26.72 3.79 -19.76 25.66 -3.14 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:28] "c1" "c2" "c3" "c4" ...
##   .. ..$ : chr [1:28] "PC1" "PC2" "PC3" "PC4" ...
##  $ corrected : num [1:28, 1:28] -27.17 3.34 -20.21 25.21 -3.59 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:28] "c1" "c2" "c3" "c4" ...
##   .. ..$ : chr [1:28] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "harmanresults"

2.1.2 Inspecting results

Harman objects can be inspected with methods such as pcaPlot and arrowPlot, as well as the generic functions plot and summary.

plot(olf.harman)

arrowPlot(olf.harman, main='Arrowplot of correction')

Using summary or inspecting the stats slot we can see Harman corrected the first 7 PCs and left the rest uncorrected.

summary(olf.harman)
## Total factor levels:
## 
##  expt batch 
##     7     4 
## 
## Experiment x Batch Design:
## 
##     batch
## expt 1 2 3 4
##    1 1 1 1 1
##    2 1 1 1 1
##    3 1 1 1 1
##    4 1 1 1 1
##    5 1 1 1 1
##    6 1 1 1 1
##    7 1 1 1 1
## 
## Simulation parameters:
## 100000 simulations (with seed of 734668396). ForceRand is FALSE.
## 
## Harman results with confidence limit of 0.95:
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 
## 0.25 0.33 0.50 0.90 0.44 0.85 0.74 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
## PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
## 
## Top batch-effected PCs:
##  PC1  PC2  PC5  PC3  PC7 
## 0.25 0.33 0.44 0.50 0.74
dimension confidence correction
PC1 0.9500996 0.25
PC2 0.9490076 0.33
PC3 0.9500139 0.50
PC4 0.9497873 0.90
PC5 0.9509498 0.44
PC6 0.9505247 0.85
PC7 0.9507247 0.74
PC8 0.8353300 1.00
PC9 0.8897871 1.00
PC10 0.7478036 1.00
PC11 0.7534338 1.00
PC12 0.6490559 1.00
PC13 0.8538693 1.00
PC14 0.5361958 1.00
PC15 0.7655424 1.00
PC16 0.4155103 1.00
PC17 0.6931610 1.00
PC18 0.7644623 1.00
PC19 0.2869585 1.00
PC20 0.1424034 1.00
PC21 0.8732809 1.00
PC22 0.8205077 1.00
PC23 0.8612620 1.00
PC24 0.9488360 1.00
PC25 0.5062507 1.00
PC26 0.5842571 1.00
PC27 0.9466016 1.00
PC28 0.9111912 1.00

As the confidence (limit) was 0.95, Harman will look for batch effects until this limit is met. It does this by reducing the batch means in an iterative way using a binary search algorithm until the chance that biological variance is being removed (the factor given in expt) is too high. Consider the values specified in the confidence column as the likelihood that separation of samples in this PC is due to batch alone and not due to the experimental variable. If this confidence value is less than limit, Harman will not make another iterative correction. For example, on PC8 the confidence is about 0.835 which is below the limit of 0.95, so Harman did not alter data on this PC. Let’s check this on a plot.

arrowPlot(olf.harman, 1, 8)

The horizontal arrows show us that, on PC1 the scores were shifted, but on PC8 they were not.

If both dimensions were not shifted, the arrowPlot will default to printing x instead of arrows (can’t have 0 length arrows!)

arrowPlot(olf.harman, 8, 9)

We can also easily colour the PCA plot by the experimental variable to compare:

par(mfrow=c(1,2))
pcaPlot(olf.harman, colBy='batch', pchBy='expt', main='Col by Batch')
pcaPlot(olf.harman, colBy='expt', pchBy='batch', main='Col by Expt')

2.1.3 Reconstruct the corrected data

So, if corrections have been made and we’re happy with the value of limit, then we reconstruct corrected data from the Harman adjusted PC scores. To do this, a harmanresults object is handed to the function reconstructData(). This function produces a matrix of the same dimensions as the input matrix filled with corrected data.

olf.data.corrected <- reconstructData(olf.harman)

This new data can be over-written into something like an ExpressionSet object with a command like exprs(eset) <- data.corrected. An example of this is below.

To convince you that Harman is working as it should in reconstructing the data back from the PCA domain, let’s test this principle on the original data.

olf.data.remade <- reconstructData(olf.harman, this='original')
all.equal(as.matrix(olf.data), olf.data.remade)
## [1] TRUE

So, the data is indeed the same (to within the machine epsilon limit for floating point error)!

2.1.4 How does the new data differ from the old data?

Let’s try graphically plotting the differences using image(). First, the rows (assays) are ordered by the variation in the difference between the original and corrected data.

olf.data.diff <- abs(as.matrix(olf.data) - olf.data.corrected)

diff_rowvar <- apply(olf.data.diff, 1, var)
by_rowvar <- order(diff_rowvar)

par(mfrow=c(1,1))
image(x=1:ncol(olf.data.diff),
      y=1:nrow(olf.data.diff),
      z=t(olf.data.diff[by_rowvar, ]),
      xlab='Sample',
      ylab='Probeset',
      main='Harman probe adjustments (ordered by variance)',
      cex.main=0.7,
      col=brewer.pal(9, 'Reds'))

We can see that most Harman adjusted probes were from batch 3 (samples 15 to 21), while batch 1 is relatively unchanged. In batches 2 and 3, often the same probes were adjusted to a larger extent in batch 3. This suggests some probes on this array design are prone to a batch effect.

2.2 Another simple transciptome dataset

The IMR90 Human lung fibroblast cell line data that was published in a paper by Johnson et al doi: 10.1093/biostatistics/kxj037 comes with Harman as an example dataset.

require(Harman)
require(HarmanData)
data(IMR90)

The data is structured like so:

Treatment Batch
_23_1CON 1 1
_23_2CON 2 1
_23_3NO0 3 1
_23_4NO7 4 1
_26_1CON 1 2
_26_2CON 2 2
_26_3NO0 3 2
_26_4NO7 4 2
CONTROLZ 1 3
CONTROL7 2 3
NO_ZERO 3 3
NO_7_5 4 3

It can be seen from the below table the experimental structure is completely balanced.

table(expt=imr90.info$Treatment, batch=imr90.info$Batch)
##     batch
## expt 1 2 3
##    1 1 1 1
##    2 1 1 1
##    3 1 1 1
##    4 1 1 1

2.2.1 Evidence of batch effects

While there isn’t glaring batch effects in PC1 and PC2, they are more apparent when plotting PC1 and PC3.

par(mfrow=c(1,2), mar=c(4, 4, 4, 2) + 0.1)
imr90.pca <- prcomp(t(imr90.data), scale. = TRUE)
prcompPlot(imr90.pca, pc_x=1, pc_y=2, colFactor=imr90.info$Batch,
           pchFactor=imr90.info$Treatment, main='IMR90, PC1 v PC2')
prcompPlot(imr90.pca, pc_x=1, pc_y=3, colFactor=imr90.info$Batch,
           pchFactor=imr90.info$Treatment, main='IMR90, PC1 v PC3')

2.2.2 Running Harman

imr90.hm <- harman(as.matrix(imr90.data), expt=imr90.info$Treatment,
                   batch=imr90.info$Batch)

We can see the most batch correction was actually on the 3rd principal component and the data was also corrected on the 1st and 4th component.

dimension confidence correction
PC1 0.9498640 0.76
PC2 0.9247775 1.00
PC3 0.9503938 0.35
PC4 0.9502018 0.69
PC5 0.6654655 1.00
PC6 0.2424223 1.00
PC7 0.2775396 1.00
PC8 0.7810260 1.00
PC9 0.1267338 1.00
PC10 0.3193122 1.00
PC11 0.1313642 1.00
PC12 0.5495756 1.00

Plotting PC1 v. PC2 and PC1 v. PC3…

plot(imr90.hm, pc_x=1, pc_y=2)