**PLSDA-batch** is a new batch effect correction method based on Projection
to Latent Structures Discriminant Analysis to correct data prior to any
downstream analysis. It estimates latent components related to treatment and
batch effects to remove batch variation. PLSDA-batch is highly suitable for
microbiome data as it is non-parametric, multivariate and allows for ordination
and data visualisation. Combined with centered log ratio transformation for
addressing uneven library sizes and compositional structure, PLSDA-batch
addresses all characteristics of microbiome data that existing correction
methods have ignored so far.

Apart from the main method, the R package also includes two variants called
1/ *weighted PLSDA-batch* for unbalanced batch x treatment designs that are
commonly encountered in studies with small sample size, and
2/ *sparse PLSDA-batch* for selection of discriminative variables to avoid
overfitting in classification problems. These two variants have widened the
scope of applicability of PLSDA-batch to different data
settings (**???**).

This vignette includes microbiome data pre-processing, batch effect detection and visualisation, the usage of PLSDA-batch series methods, assessment of batch effect removal and variable selection after batch effect correction. See “Batch Effects Management in Case Studies” for different choices of methods for batch effect management according to experimental purposes and designs.

First, we load the packages necessary for analysis, and check the version of each package.

```
# CRAN
library(pheatmap)
library(vegan)
library(gridExtra)
# Bioconductor
library(mixOmics)
library(Biobase)
library(TreeSummarizedExperiment)
library(PLSDAbatch)
# print package versions
package.version('pheatmap')
```

`## [1] "1.0.12"`

`package.version('vegan')`

`## [1] "2.6-4"`

`package.version('gridExtra')`

`## [1] "2.3"`

`package.version('mixOmics')`

`## [1] "6.29.0"`

`package.version('Biobase')`

`## [1] "2.65.0"`

`package.version('PLSDAbatch')`

`## [1] "1.1.0"`

We considered a case study to illustrate the application of PLSDA-batch. This study is described as follows:

\(\color{blue}{\bf{\text{Anaerobic digestion.}}}\) This study explored the
microbial indicators that could improve the efficacy of anaerobic digestion
(AD) bioprocess and prevent its failure (**???**). This data
include 75 samples and 567 microbial variables. The samples were treated with
two different ranges of phenol concentration (effect of interest) and
processed at five different dates (batch effect). This study includes a clear
and strong batch effect with an approx. balanced batch x treatment design.

We load the \(\color{blue}{\text{AD data}}\) stored internally with function
`data()`

, and then extract the batch and treatment information out.

```
# AD data
data('AD_data')
ad.count <- assays(AD_data$FullData)$Count
dim(ad.count)
```

`## [1] 75 567`

```
ad.metadata <- rowData(AD_data$FullData)
ad.batch = factor(ad.metadata$sequencing_run_date,
levels = unique(ad.metadata$sequencing_run_date))
ad.trt = as.factor(ad.metadata$initial_phenol_concentration.regroup)
names(ad.batch) <- names(ad.trt) <- rownames(ad.metadata)
```

The raw \(\color{blue}{\text{AD data}}\) include 567 OTUs and 75 samples.
We then use the function `PreFL()`

from our \(\color{orange}{\text{PLSDAbatch}}\)
R package to filter the data.

```
ad.filter.res <- PreFL(data = ad.count)
ad.filter <- ad.filter.res$data.filter
dim(ad.filter)
```

`## [1] 75 231`

```
# zero proportion before filtering
ad.filter.res$zero.prob
```

`## [1] 0.6328042`

```
# zero proportion after filtering
sum(ad.filter == 0)/(nrow(ad.filter) * ncol(ad.filter))
```

`## [1] 0.3806638`

After filtering, 231 OTUs remained, and the proportion of zeroes decreased from 63% to 38%.

Note: The `PreFL()`

function is only dedicated to raw counts, rather than
relative abundance data. We also recommend to start the pre-filtering on raw
counts, rather than relative abundance data to mitigate the compositionality
issue.

Prior to CLR transformation, we recommend adding 1 as the offset for the data
(e.g., \(\color{blue}{\text{AD data}}\)) that are raw count data, and 0.01 as
the offset for the data that are relative abundance data. We use
`logratio.transfo()`

function in \(\color{orange}{\text{mixOmics}}\) package to
CLR transform the data.

```
ad.clr <- logratio.transfo(X = ad.filter, logratio = 'CLR', offset = 1)
class(ad.clr) = 'matrix'
```

We apply `pca()`

function from \(\color{orange}{\text{mixOmics}}\) package to
the \(\color{blue}{\text{AD data}}\) and `Scatter_Density()`

function from
\(\color{orange}{\text{PLSDAbatch}}\) to represent the PCA sample plot with
densities.

```
# AD data
ad.pca.before <- pca(ad.clr, ncomp = 3, scale = TRUE)
Scatter_Density(object = ad.pca.before, batch = ad.batch, trt = ad.trt,
title = 'AD data', trt.legend.title = 'Phenol conc.')
```

In the above figure, we observed 1) the distinction between samples treated with different phenol concentrations and 2) the differences between samples sequenced at “14/04/2016”, “21/09/2017” and the other dates. Therefore, the batch effect related to dates needs to be removed.

We first identify the top OTU driving the major variance in PCA using
`selectVar()`

in \(\color{orange}{\text{mixOmics}}\) package. Each identified
OTU can then be plotted as boxplots and density plots using `box_plot()`

and
`density_plot()`

in \(\color{orange}{\text{PLSDAbatch}}\).

```
ad.OTU.name <- selectVar(ad.pca.before, comp = 1)$name[1]
ad.OTU_batch <- data.frame(value = ad.clr[,ad.OTU.name], batch = ad.batch)
box_plot(df = ad.OTU_batch, title = paste(ad.OTU.name, '(AD data)'),
x.angle = 30)
```

`density_plot(df = ad.OTU_batch, title = paste(ad.OTU.name, '(AD data)'))`

The boxplot and density plot indicated a strong date batch effect because of the differences between “14/04/2016”, “21/09/2017” and the other dates in the “OTU28”.

We also apply a linear regression model to the “OTU28” using `linear_regres()`

from \(\color{orange}{\text{PLSDAbatch}}\) with batch and treatment effects as
covariates. We set “14/04/2016” and “21/09/2017” as the reference batch
respectively with `relevel()`

from \(\color{orange}{\text{stats}}\).

```
# reference batch: 14/04/2016
ad.batch <- relevel(x = ad.batch, ref = '14/04/2016')
ad.OTU.lm <- linear_regres(data = ad.clr[,ad.OTU.name],
trt = ad.trt, batch.fix = ad.batch,
type = 'linear model')
summary(ad.OTU.lm$model$data)
```

```
##
## Call:
## lm(formula = data[, i] ~ trt + batch.fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9384 -0.3279 0.1635 0.3849 0.9887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8501 0.2196 3.871 0.000243 ***
## trt1-2 -1.6871 0.1754 -9.617 2.27e-14 ***
## batch.fix09/04/2015 1.5963 0.2950 5.410 8.55e-07 ***
## batch.fix01/07/2016 2.0839 0.2345 8.886 4.82e-13 ***
## batch.fix14/11/2016 1.7405 0.2480 7.018 1.24e-09 ***
## batch.fix21/09/2017 1.2646 0.2690 4.701 1.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared: 0.7546, Adjusted R-squared: 0.7368
## F-statistic: 42.44 on 5 and 69 DF, p-value: < 2.2e-16
```

```
# reference batch: 21/09/2017
ad.batch <- relevel(x = ad.batch, ref = '21/09/2017')
ad.OTU.lm <- linear_regres(data = ad.clr[,ad.OTU.name],
trt = ad.trt, batch.fix = ad.batch,
type = 'linear model')
summary(ad.OTU.lm$model$data)
```

```
##
## Call:
## lm(formula = data[, i] ~ trt + batch.fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9384 -0.3279 0.1635 0.3849 0.9887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1147 0.2502 8.453 2.97e-12 ***
## trt1-2 -1.6871 0.1754 -9.617 2.27e-14 ***
## batch.fix14/04/2016 -1.2646 0.2690 -4.701 1.28e-05 ***
## batch.fix09/04/2015 0.3317 0.3139 1.056 0.29446
## batch.fix01/07/2016 0.8193 0.2573 3.185 0.00218 **
## batch.fix14/11/2016 0.4759 0.2705 1.760 0.08292 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared: 0.7546, Adjusted R-squared: 0.7368
## F-statistic: 42.44 on 5 and 69 DF, p-value: < 2.2e-16
```

From the results of linear regression, we observed P < 0.001 for the regression coefficients associated with all the other batches when the reference batch was “14/04/2016”, which confirmed the difference between the samples from batch “14/04/2016” and the other samples as observed from previous plots. When the reference batch was “21/09/2017”, we also observed significant differences between batch “21/09/2017” and “14/04/2016”, between “21/09/2017” and “01/07/2016”. Therefore, the batch effect because of “21/09/2017” also exists.

We produce a heatmap using \(\color{orange}{\text{pheatmap}}\) package. The data first need to be scaled on both OTUs and samples.

```
# scale the clr data on both OTUs and samples
ad.clr.s <- scale(ad.clr, center = TRUE, scale = TRUE)
ad.clr.ss <- scale(t(ad.clr.s), center = TRUE, scale = TRUE)
ad.anno_col <- data.frame(Batch = ad.batch, Treatment = ad.trt)
ad.anno_colors <- list(Batch = color.mixo(seq_len(5)),
Treatment = pb_color(seq_len(2)))
names(ad.anno_colors$Batch) = levels(ad.batch)
names(ad.anno_colors$Treatment) = levels(ad.trt)
pheatmap(ad.clr.ss,
cluster_rows = FALSE,
fontsize_row = 4,
fontsize_col = 6,
fontsize = 8,
clustering_distance_rows = 'euclidean',
clustering_method = 'ward.D',
treeheight_row = 30,
annotation_col = ad.anno_col,
annotation_colors = ad.anno_colors,
border_color = 'NA',
main = 'AD data - Scaled')
```

In the heatmap, samples in the \(\color{blue}{\text{AD data}}\) from batch dated “14/04/2016” were clustered and distinct from other samples, indicating a batch effect.

We apply pRDA with `varpart()`

function from \(\color{orange}{\text{vegan}}\)
R package.

```
# AD data
ad.factors.df <- data.frame(trt = ad.trt, batch = ad.batch)
class(ad.clr) <- 'matrix'
ad.rda.before <- varpart(ad.clr, ~ trt, ~ batch,
data = ad.factors.df, scale = TRUE)
ad.rda.before$part$indfract
```

```
## Df R.squared Adj.R.squared Testable
## [a] = X1|X2 1 NA 0.08943682 TRUE
## [b] = X2|X1 4 NA 0.26604420 TRUE
## [c] 0 NA 0.01296248 FALSE
## [d] = Residuals NA NA 0.63155651 FALSE
```

In the result, `X1`

and `X2`

represent the first and second covariates fitted
in the model. `[a]`

, `[b]`

represent the independent proportion of variance
explained by `X1`

and `X2`

respectively, and `[c]`

represents the intersection
variance shared between `X1`

and `X2`

. In the \(\color{blue}{\text{AD data}}\),
batch variance (`X2`

) was larger than treatment variance (`X1`

) with some
interaction proportion (indicated in line `[c]`

, Adj.R.squared = 0.013). The
greater the intersection variance, the more unbalanced batch x treatment design
is. In this study, we considered the design as approx. balanced.

The `PLSDA_batch()`

function is implemented in
\(\color{orange}{\text{PLSDAbatch}}\) package. To use this function, we need to
specify the optimal number of components related to treatment (`ncomp.trt`

) or
batch effects (`ncomp.bat`

).

Here in the \(\color{blue}{\text{AD data}}\), we use `plsda()`

from
\(\color{orange}{\text{mixOmics}}\) with only treatment grouping information to
estimate the optimal number of treatment components to preserve.

```
# estimate the number of treatment components
ad.trt.tune <- plsda(X = ad.clr, Y = ad.trt, ncomp = 5)
ad.trt.tune$prop_expl_var #1
```

```
## $X
## comp1 comp2 comp3 comp4 comp5
## 0.18619506 0.07873817 0.08257396 0.09263497 0.06594757
##
## $Y
## comp1 comp2 comp3 comp4 comp5
## 1.00000000 0.33857374 0.17315267 0.10551296 0.08185822
```

We choose the number that explains 100% variance in the outcome matrix `Y`

,
thus from the result, 1 component was enough to preserve the treatment
information.

We then use `PLSDA_batch()`

function with both treatment and batch grouping
information to estimate the optimal number of batch components to remove.

```
# estimate the number of batch components
ad.batch.tune <- PLSDA_batch(X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, ncomp.bat = 10)
ad.batch.tune$explained_variance.bat #4
```

```
## $X
## comp1 comp2 comp3 comp4 comp5 comp6 comp7
## 0.17470922 0.11481264 0.10122717 0.07507395 0.03940245 0.03652759 0.02823771
## comp8 comp9 comp10
## 0.03431120 0.02100109 0.01193436
##
## $Y
## comp1 comp2 comp3 comp4 comp5 comp6 comp7
## 0.24746537 0.26157408 0.23013824 0.26082231 0.23015803 0.25859381 0.24530888
## comp8 comp9 comp10
## 0.26196649 0.00397279 0.23010220
```

`sum(ad.batch.tune$explained_variance.bat$Y[seq_len(4)])`

`## [1] 1`

Using the same criterion as choosing treatment components, we choose the number of batch components that explains 100% variance in the outcome matrix of batch. According to the result, 4 components were required to remove batch effects.

We then can correct for batch effects applying `PLSDA_batch()`

with treatment,
batch grouping information and corresponding optimal number of related
components.

```
ad.PLSDA_batch.res <- PLSDA_batch(X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, ncomp.bat = 4)
ad.PLSDA_batch <- ad.PLSDA_batch.res$X.nobatch
```

We apply sPLSDA-batch using the same function `PLSDA_batch()`

, but we specify
the number of variables to select on each component (usually only
treatment-related components `keepX.trt`

). To determine the optimal number of
variables to select, we use `tune.splsda()`

function from
\(\color{orange}{\text{mixOmics}}\) package (**???**) with all
possible numbers of variables to select for each component (`test.keepX`

).

```
# estimate the number of variables to select per treatment component
set.seed(777)
ad.test.keepX = c(seq(1, 10, 1), seq(20, 100, 10),
seq(150, 231, 50), 231)
ad.trt.tune.v <- tune.splsda(X = ad.clr, Y = ad.trt,
ncomp = 1, test.keepX = ad.test.keepX,
validation = 'Mfold', folds = 4,
nrepeat = 50)
ad.trt.tune.v$choice.keepX #100
```

Here the optimal number of variables to select for the treatment component was
100. Since we have adjusted the amount of treatment variation to preserve, we
need to re-choose the optimal number of components related to batch effects
using the same criterion mentioned in section *PLSDA-batch*.

```
# estimate the number of batch components
ad.batch.tune <- PLSDA_batch(X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, keepX.trt = 100,
ncomp.bat = 10)
ad.batch.tune$explained_variance.bat #4
```

```
## $X
## comp1 comp2 comp3 comp4 comp5 comp6 comp7
## 0.17420018 0.11477097 0.09813477 0.07894965 0.03072455 0.03485135 0.04756088
## comp8 comp9 comp10
## 0.02480997 0.01694516 0.01730399
##
## $Y
## comp1 comp2 comp3 comp4 comp5 comp6
## 0.2474774921 0.2606715531 0.2301080926 0.2617428622 0.2606902783 0.2504437944
## comp7 comp8 comp9 comp10
## 0.2463579460 0.2419577742 0.0005502072 0.2615359394
```

`sum(ad.batch.tune$explained_variance.bat$Y[seq_len(4)])`

`## [1] 1`

According to the result, we needed 4 batch related components to remove batch
variance from the data with function `PLSDA_batch()`

.

```
ad.sPLSDA_batch.res <- PLSDA_batch(X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, keepX.trt = 100,
ncomp.bat = 4)
ad.sPLSDA_batch <- ad.sPLSDA_batch.res$X.nobatch
```

Note: for unbalanced batch x treatment design (with the exception of the
nested design), we can specify `balance = FALSE`

in `PLSDA_batch()`

function
to apply weighted PLSDA-batch.

We apply different visualisation and quantitative methods to assessing batch effect correction.

**PCA**

In the \(\color{blue}{\text{AD data}}\), we compared the PCA sample plots before and after batch effect correction.

```
ad.pca.before <- pca(ad.clr, ncomp = 3, scale = TRUE)
ad.pca.PLSDA_batch <- pca(ad.PLSDA_batch, ncomp = 3, scale = TRUE)
ad.pca.sPLSDA_batch <- pca(ad.sPLSDA_batch, ncomp = 3, scale = TRUE)
```

```
# order batches
ad.batch = factor(ad.metadata$sequencing_run_date,
levels = unique(ad.metadata$sequencing_run_date))
ad.pca.before.plot <- Scatter_Density(object = ad.pca.before,
batch = ad.batch,
trt = ad.trt,
title = 'Before correction')
```

```
ad.pca.PLSDA_batch.plot <- Scatter_Density(object = ad.pca.PLSDA_batch,
batch = ad.batch,
trt = ad.trt,
title = 'PLSDA-batch')
```

```
ad.pca.sPLSDA_batch.plot <- Scatter_Density(object = ad.pca.sPLSDA_batch,
batch = ad.batch,
trt = ad.trt,
title = 'sPLSDA-batch')
```