About This Document »

Package name | ExpressionNormalizationWorkflow |

Built with Bioconductor (R) | 3.5 (3.4.0) |

Last Built | Mon, 10 Apr 2017 16:31:08 -0700 |

Last Modified | Fri, 07 Oct 2016 14:22:16 -0700 (r122071) |

Source Package | ExpressionNormalizationWorkflow_1.2.0.tar.gz |

Windows Binary | ExpressionNormalizationWorkflow_1.2.0.zip |

Mac OS X 10.10 (Yosemite) | ExpressionNormalizationWorkflow_1.2.0.tgz |

R Script | ExpressionNormalizationWorkflow.R |

To install this workflow under Bioconductor 3.4, start R and enter:

source("http://bioconductor.org/workflows.R") workflowInstall("ExpressionNormalizationWorkflow")

- Introduction
- Package Content
- Workflow steps
- A. Reading Input Data
- B. Creating an ExpressionSet Object
- C. Principal Variance Component Analysis of the raw data
- D. Surrogate Variable Analysis
- E. Computing the correlation between the surrogate variables and the covariates
- F. Principal Variance Component Analysis of the raw data with the surrogate variables included as covariates
- G. Supervised normalization of Microarrays
- H. Principal Variance Component Analysis on the normalized data

- Discussion
- References

Every Gene Expression study has an underlying question which the
experimenter tries to address.Before proceeding towards any downstream
analysis, the researcher has to decide how to normalize the gene
expression data to minimize the impact of technical and other
confounding factors on estimation of the environmental or biological
factors of interest. One approach is to simply remove all of the major
principal components of variation or specified technical effects, but
this can also remove the biological signal of interest.Supervised
normalization strategies encourage the user to define the major
parameters influencing the expression variation, and then systematically
remove them while shielding the biological factors of interest.

Here we outline a dynamic workflow that recognizes that there is not any
‘one algorithm fits all data sets’ analysis approach that works for
every experiment.The workflow employs three successive procedures
available in Bioconductor: Principal Variance Component Analysis (PVCA)
to explore how technical and biological factors correlate with the major
components of variance in the data set; Surrogate Variable Analysis
(SVA) to identify major unwanted sources of variation; and Supervised
Normalization of Microarrays (SNM) to efficiently remove these sources
of variation while retaining the biological factor(s) of interest. The
data is based on a study contrasting peripheral blood gene expression in
acute myocardial infarction and coronary artery disease patients,
described in Kim et al. (2014). Only a subset of the data is used, and
the labels have been changed for privacy protection, but the full data
set is accessible from the GEO data repository at
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49925.

The workflow package contains five functions:

1. expSetobj – function to create an Expression Set Object which
encapsulates the expression values, covariate information, and the
experimental metadata

2. pvcAnaly - Principal Variance Component Analysis functionality

3. surVarAnaly – function which identifies hidden or new surrogate
variables that are potential covariates hidden in the data

4. conTocat – function to convert continuous variables to categorical
variables (discretize)

5. snmAnaly – function which implements the Supervised Normalization of
Microarrays (SNM) normalization technique

The workflow runs on these sample two data files:

1. CAD_Expression.csv - file containing gene expression values (8000
probes for 100 samples)

2. CAD_Exptdsgn.csv - file containing the phenotype information (100
samples with 10 covariates)

The covariates are Study (a large batch effect since samples were run 2
years apart), Rin (RNA integrity number, a measure of RNA quality), BMI,
Age, Gender, CAD (disease status as Acute MI or Fine at the time of
sampling of patients in a CAD cohort), Rin3, BMI3, Age3 (discrete
categorizations of the three continuous measures, used in the PVCA), and
Array (a sample number) The concept of the analysis is to explore how
adjusting for Study, Rin, and other covariates if desired, influences
the interpretation of the impact of CAD on gene expression.That is, CAD
is the biological variable of interest.

- File containing the gene expression values (log transformed)

Format - Features x Samples (where the features could be probes in the case of microarrays, or gene or exon names in the case of RNASeq data), preferably a comma separated file (.csv) - File containing the covariates

Covariates are the different phenotypes that describe sample attributes, and can be biological, technical, or environmental.In the workflow we will also identify hidden covariates

Format - Samples x Covariates, preferably a comma separated file (.csv)

You should start by examining the sample data to get an idea of how the files should be structured.

```
## Bioconductor version 3.5 (BiocInstaller 1.25.3), ?biocLite for help
## BioC_mirror: https://bioconductor.org
## Using Bioconductor 3.5 (BiocInstaller 1.25.3), R 3.4.0 alpha (2017-04-04
## r72482).
## installation path not writeable, unable to update packages: survival
## Old packages: 'MSnbase', 'scran'
## Warning: replacing previous import 'Biobase::anyMissing' by
## 'matrixStats::anyMissing' when loading 'ExpressionNormalizationWorkflow'
## Warning: replacing previous import 'Biobase::rowMedians' by
## 'matrixStats::rowMedians' when loading 'ExpressionNormalizationWorkflow'
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colMeans,
## colnames, colSums, do.call, duplicated, eval, evalq, Filter,
## Find, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, Map, mapply, match, mget, order, paste, pmax,
## pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
## rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Set your working directory (or use Control + Shift + H to choose it)
setwd("working directory")
## read in the file containing the gene expression values
expData_file_name <- system.file("folder containing the data", "CAD_Expression.csv", package="ExpressionNormalizationWorkflow")
exprs <- read.table(expData_file_name, header=TRUE, sep=",", row.names=1, as.is=TRUE)
## read in the file containing the covariates
expDesign_file_name <- system.file("folder containing the data", "CAD_ExptDsgn.csv", package="ExpressionNormalizationWorkflow")
covrts <- read.table(expDesign_file_name, header=TRUE, sep=",", row.names=1, as.is=TRUE)
## A001 A002 A003 A004 A005
## ILMN_2389211 14.091 13.892 13.747 14.056 14.243
## ILMN_1667796 13.872 13.903 14.273 14.276 14.125
## ILMN_1683271 13.419 13.665 13.180 13.487 14.252
## ILMN_2100437 14.237 13.638 14.160 14.004 13.945
## ILMN_2242491 13.692 13.830 13.667 13.756 14.015
## Study Rin3 Gender Age3 BMI3 CAD Array Rin Age BMI
## A001 A HIGH FEM ADULT OVER ACUTE 1 9.2 56 31.0
## A002 A HIGH FEM ADULT OVER ACUTE 2 9.1 54 26.9
## A003 A HIGH MAL ADULT NORM ACUTE 3 9.1 51 20.7
## A004 A HIGH MAL ADULT NORM ACUTE 4 9.3 52 24.4
## A005 A HIGH MAL MATURE OVER ACUTE 5 9.3 62 26.1
```

An ExpressionSet Class (Falcon, Morgan and Gentleman, 2006) is a Biobase data structure that is used to conveniently store experimental information and associated meta data, all in one place. This command creates an object of the the ExpressionSet Class, which stores the gene expression values and the covariate data.The function GEOquery (Davis, 2016) can be used to download ExpressionSet objects from directly from GEO (https://www.ncbi.nlm.nih.gov/gds)

```
inpData <- expSetobj(exprs, covrts)
```

PVCA (Bushel, 2013) estimates the proportion of the variance of the principal components of the gene expression data that can be attributed to each of the given covariates.The remaining fraction is assigned as “residual” variance.It efficiently combines principal component analysis (PCA) to reduce the feature space and variance component analysis (VCA) which fits a mixed linear model using factors of interest as random effects to estimate and partition the total variability. The variance explained is computed as a weighted average of the contributions of each factor to each PC, and you have the option of specifying how many principal components to include, by setting a threshold amount of the variance that needs to be explained by the identified PCs. Here PVCA is used to estimate the proportion of variance due to CAD as well as due to the covariates like Gender, BMI, Rin, and Study.Since BMI and Rin are continuous variables, we use a categorization of each into Obese, Overweight, Normal weight and High, Moderate, and Low quality RNA respectively, through the BMI3 and Rin3 columns.

```
## Set the covariates whose effect size on the data needs to be calculated
cvrts_eff_var <- c("CAD", "BMI3", "Rin3", "Gender", "Study")
## Set a PVCA Threshold Value between 0 & 1
## PVCA Threshold Value is the percentile value of the minimum amount of the variabilities that the selected principal components need to explain, here requiring 75% of the expression variance to be captured by the PCs
pct_thrsh <- 0.75
## Perform the PVCA analysis
pvcAnaly(inpData, pct_thrsh, cvrts_eff_var)
```

```
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.02675992 0.03947968 0.2685283 0.02055835 0.03676139 0.6079124
##
## $label
## [1] "Rin3" "BMI3" "Study" "Gender" "CAD" "resid"
```

The analysis (also generated as a histogram) shows that the Study batch effect is a major technical artifact that explains ~27% of the expression variance.In comparison, CAD and BMI each only explain around 4% of the variance, RNA quality less than 3%, and Gender is a minor component.60% of the variance is residual.

Surrogate variables (Leek and Storey, 2007) are covariates constructed directly from high-dimensional data (gene expression or RNA-Seq data) that can be used in subsequent analyses to adjust for unknown/unmodeled covariates or latent sources of noise. The user provides one or more biological variable(s) that they are interested in estimating, and then estimates the surrogate variables (sv) given the presence of the biological signal. These are then appended as new covariates to the existing list of covariates, and we also add a step to convert them to categorical variables so as to estimate their contributions to the total variance.For a thorough introduction to SVA, refer to the paper by Leek and Storey (2007) .SVA is available from Bioconductor at https://bioconductor.org/packages/release/bioc/html/sva.html, and a version specifically for RNASeq data (svaseq) is also available from Github at https://github.com/jtleek/svaseq.

```
## Choose a biological variable that is to be used to calculate the surrogate variables
biol_var_sva <- "CAD"
## Perform SVA
sur_var_obj <- surVarAnaly(inpData, biol_var_sva)
## Number of significant surrogate variables is: 4
## Iteration (out of 5 ):1 2 3 4 5
## The newly generated surrogate variables sv1 through sv4 are appended to the ExpressionSet object
inpData_sv <- sur_var_obj$expSetobject
```

This step helps the researcher explore whether each newly identified surrogate variable is entirely independent of all the existing covariates or if they are correlated with either a technical factor or one of the biological covariates. If they are associated with a biological factor (especially the focus of the analysis, in this case CAD), then it would not be advisable to remove them during the normalization step, though they might be adjusted for.Otherwise, it will usually be advisable to remove them, or to remove the technical factor they capture.To see the correlations, we run a series of generalized linear models where the identified surrogate variables are modeled as a function of the existing covariates (Study, CAD, Rin, BMI; you could add Gender and Age if you wish – but they are not significant).

```
## Fit a generalized linear model for sv1
glm.sv1 <- glm(pData(inpData_sv)[,"sv1"]~pData(inpData_sv)[,"BMI"]+pData(inpData_sv)[,"Rin"]+pData(inpData_sv)[,"CAD"]
+pData(inpData_sv)[,"Study"])
summary(glm.sv1)
##
## Call:
## glm(formula = pData(inpData_sv)[, "sv1"] ~ pData(inpData_sv)[,
## "BMI"] + pData(inpData_sv)[, "Rin"] + pData(inpData_sv)[,
## "CAD"] + pData(inpData_sv)[, "Study"])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.127062 -0.040490 -0.009979 0.030200 0.139604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1832540 0.0711053 2.577 0.0115 *
## pData(inpData_sv)[, "BMI"] 0.0009422 0.0009208 1.023 0.3088
## pData(inpData_sv)[, "Rin"] -0.0159186 0.0074152 -2.147 0.0344 *
## pData(inpData_sv)[, "CAD"]FINE 0.0095449 0.0119928 0.796 0.4281
## pData(inpData_sv)[, "Study"]B -0.1699332 0.0129778 -13.094 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.00357263)
##
## Null deviance: 1.0000 on 99 degrees of freedom
## Residual deviance: 0.3394 on 95 degrees of freedom
## AIC: -272.79
##
## Number of Fisher Scoring iterations: 2
```

The analysis shows that the Study batch effect is very strongly associated with sv1.You can also see this by plotting the two measures from the pData(inpData_sv) file. Subsequent analyses of sv2 through sv4 with similar code shows that BMI is weakly associated with sv2, but strongly with sv3 (as is Study), while Rin is strongly associate with sv4.CAD is not associated with any of the surrogate variables, but Age contributes slightly to sv2 and sv3.

```
## Fit a generalized linear model for sv2
glm.sv2 <- glm(pData(inpData_sv)[,"sv2"]~pData(inpData_sv)[,"BMI"]+pData(inpData_sv)[,"Rin"]+pData(inpData_sv)[,"CAD"]
+pData(inpData_sv)[,"Study"])
summary(glm.sv2)
##
## Call:
## glm(formula = pData(inpData_sv)[, "sv2"] ~ pData(inpData_sv)[,
## "BMI"] + pData(inpData_sv)[, "Rin"] + pData(inpData_sv)[,
## "CAD"] + pData(inpData_sv)[, "Study"])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.21940 -0.04787 -0.00900 0.04228 0.33138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.261993 0.115073 -2.277 0.0250 *
## pData(inpData_sv)[, "BMI"] 0.003209 0.001490 2.154 0.0338 *
## pData(inpData_sv)[, "Rin"] 0.019394 0.012000 1.616 0.1094
## pData(inpData_sv)[, "CAD"]FINE 0.030581 0.019409 1.576 0.1184
## pData(inpData_sv)[, "Study"]B -0.016918 0.021003 -0.806 0.4225
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.009356872)
##
## Null deviance: 1.0000 on 99 degrees of freedom
## Residual deviance: 0.8889 on 95 degrees of freedom
## AIC: -176.51
##
## Number of Fisher Scoring iterations: 2
## Output should be similar to that shown above for sv1
```

The following PVCA step is performed to estimate the contributions of the newly identified surrogate variables to the overall expression variance, with or without the Study covariate in the model.

```
## First discretize the continuous surrogate variables
var_names <- c("sv1", "sv2", "sv3", "sv4")
pData(inpData_sv)<-conTocat(pData(inpData_sv), var_names)
## View them appended to the covariate matrix as additional covariate columns
View(pData(inpData_sv))
## Include the surrogate variables as covariates in addition to BMI3, Rin3, CAD and Study (be sure to use categorical measures of BMI and Rin rather than the continuous measure)
cvrts_eff_var <- c("BMI3", "Rin3", "CAD", "Study", "sv1_cat", "sv2_cat", "sv3_cat", "sv4_cat")
## Again set the PVCA Threshold to explain 75% of the expression variation
pct_thrsh <- 0.75
## Perform PVCA
pvcAnaly(inpData_sv, pct_thrsh, cvrts_eff_var)
```

```
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0812861 0.05778324 0.144267 0.1313617 0.00861124 0.006383906
## [,7] [,8] [,9]
## [1,] 0.1957262 0.01316701 0.3614136
##
## $label
## [1] "sv4_cat" "sv3_cat" "sv2_cat" "sv1_cat" "Rin3" "BMI3" "Study"
## [8] "CAD" "resid"
```

The analysis shows that each of the surrogate variables capture a large amount of the variance, while the Study batch effect remains important.The residual variance has reduced by almost half to 36%.For comparison, remove Study to see how much of it the surrogate variables capture

```
cvrts_eff_var <- c("BMI3", "Rin3", "CAD","sv1_cat", "sv2_cat", "sv3_cat", "sv4_cat")
## Again set the PVCA Threshold to explain 75% of the expression variation
pct_thrsh <- 0.75
## Perform PVCA
pvcAnaly(inpData_sv, pct_thrsh, cvrts_eff_var)
```

```
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0806385 0.075551 0.1537442 0.2760277 0.007578122 0.007032033
## [,7] [,8]
## [1,] 0.01256834 0.3868601
##
## $label
## [1] "sv4_cat" "sv3_cat" "sv2_cat" "sv1_cat" "Rin3" "BMI3" "CAD"
## [8] "resid"
```

This implies that together the 4 SVs capture as much variance as Study alone.Both analyses have slightly reduced the CAD contribution relative to the analysis without surrogate variables.

SNM (Mecham, Nelson and Storey, 2010) is a study specific, customizable normalization approach that adjusts for specified biological and technical variables as far as possible without biasing the estimate of the biological source of interest. Much like SVA, the user chooses the biological variable(s), and then fits desired covariates, while also allowing for “intensity dependent” effects (here, just array/sample differences) to be adjusted for. The SNM software at http://www.bioconductor.org/packages/release/bioc/html/snm.html actually allows you to either remove the adjustment variables using the rm=TRUE option, or simply to fit them.You can also run both sequentially with different covariates (for example, removing a batch effect but adjusting for gender). SNM fits both categorical and continuous variables.In the workflow below we remove the Study or the surrogate variable effects, running 5 iterations.

```
## Choose the biological variableof interest
bv <- c("CAD")
## Choose your adjustment variable of interest, starting with just 'Study'
av <- c("Study")
## The intensity-dependent adjustment variables adjust for array effects
iv <- c("Array")
## Run SNM
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv)
##
Iteration: 1
```

```
##
Iteration: 2
```

```
##
Iteration: 3
```

```
##
Iteration: 4
```

```
##
Iteration: 5
```

After 5 iterations, the π_{0} estimate suggests that almost 63%
of the probes are true negatives for the CAD effect: in other words,
that there is evidence that just over one third of the probes are
differentially expressed with acute MI. This is seen as an enrichment of
transcripts with small p-values. At each iteration, the π_{0}
estimates should have been 0.92, 0.62, 0.63, 0.63 , 0.63 implying
convergence. The output .csv file should have been saved to your working
directory as “snm_normalized_data.csv”.

```
## Now choose the adjustment variables to be all four SVs plus Rin
av <- c("Rin", "sv1", "sv2", "sv3", "sv4")
## Run SNM
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv)
```

This time the π_{0} estimates are 0.76, 0.75, 0.76, 0.76 and
0.75.However, the intensity-dependent effects are reduced considerably.A
third possibility is to remove Study, Rin and sv2 (which is
independent).This time the π_{0} estimates are 0.59, 0.54, 0.60,
0.63 and 0.64, and the intensity-dependent effects are more similar to
the first model.The SNM ietrations are shown below-

```
av <- c("Rin", "sv2", "Study")
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv)
##
Iteration: 1
```

```
##
Iteration: 2
```

```
##
Iteration: 3
```

```
##
Iteration: 4
```

```
##
Iteration: 5
```

```
## Create an expressionSet object of the normalized dataset(s)
sv_snmNorm_data <- sv_snmObj$norm.dat
colnames(sv_snmNorm_data) <- colnames(exprs)
sv_snm_data <- expSetobj(sv_snmNorm_data, pData(inpData_sv))
## Write this dateset to a table with rows as genes and columns as samples (with names the same as that from the input file)
write.table(sv_snmNorm_data, file="CAD_SNM.csv", sep=",")
```

This post SNM PVCA asks how the process of SNM has influenced the estimated contributions of the various covariates. SNM brings down their effect to a minimal possible value while trying to preserve the variation due to the biological signals.

```
## Specify the covariates whose effect size is to be calculated
cvrts_eff_var <- c("BMI3", "Rin3", "CAD", "sv1_cat", "sv2_cat","sv3_cat", "sv4_cat")
## If needed, keep the same PC Threshold Value
pct_thrsh <- 0.75
## Perform PVCA
pvcAnaly(sv_snm_data, pct_thrsh, cvrts_eff_var)
```

```
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0798637 0.1092931 0.01151301 0.06695379 0.009375329 0.01174651
## [,7] [,8]
## [1,] 0.02221493 0.6890396
##
## $label
## [1] "sv4_cat" "sv3_cat" "sv2_cat" "sv1_cat" "Rin3" "BMI3" "CAD"
## [8] "resid"
```

Fitting Study, sv2, and RIN has not completely removed those effects, but they are much reduced, and the residual unexplained variance is back up to 69%.Most of it is probably among individual differences.The CAD effect remains low, at 2% of the variance, but is slightly increased on the raw data, whereas BMI has dropped to 1%.The sv3 and sv4 effects have not been removed.

**Further exploration of different normalization approaches should help
you decide whether the data is now ap for downstream analysis.**

The workflow presents a general strategy for exploring how technical and biological factors influence the inference of gene expression differences between categories of interest.The PVCA routine identifies how much of the variance is explained by given covariates; the SVA routine finds latent variables not influenced by the biological factor; and the SNM normalizes the data set by removing known or unknown sources of variance – in this case including a large batch effect. Readers are encouraged to compare the distributions of overall transcript abundance before and after normalization to gain a sense of the impact of the procedures on the data set. Although the impact on detection of genes influences by acute MI was minimal in the sub-set of the Kim et al. (2014) data, application to the full data set has a larger influence.Similar strategies can be applied to RNASeq data, with the complication that low expression values have higher variance and are generally dealt with differently.

Bushel, P.R. (2013) *Principal Variance Component Analysis (PVCA)*.
Bioconductor.

Davis, S. (2016) *GEOquery*. Bioconductor.

Falcon, S., Morgan, M. and Gentleman, R. (2006) *An Introduction to
Bioconductor’s ExpressionSet Class*. Bioconductor.

Kim, J., Ghasemzadeh, N., Eapen, D.J., Chung, N.C., Storey, J.D.,
Quyyumi, A.A., et al. (2014) Gene expression profiles associated with
acute myocardial infarction and risk of cardiovascular death. *Genome
Medicine*, **6**.

Leek, J.T. and Storey, J.D. (2007) Capturing heterogeneity in gene
expression studies by surrogate variable analysis. *PLOS Genetics*,
**3**.

Mecham, B.H., Nelson, P.S. and Storey, J.D. (2010) Supervised
normalization of microarrays. *Bioinformatics*, **26**, 1308–1315.

Common Bioconductor workflows include:

- Introduction to Bioconductor for Sequence Data
- Oligonucleotide Arrays
- Counting Reads for Differential Expression (parathyroideSE vignette)
- Annotation Resources
- Annotating Genomic Ranges
- Annotating Genomic Variants
- Flow Cytometry and other assays
- Candidate Binding Sites for Known Transcription Factors
- Cloud-enabled cis-eQTL search and annotation
- RNA-Seq workflow: gene-level exploratory analysis and differential expression
- Changing genomic coordinate systems with rtracklayer::liftOver
- Mass spectrometry and proteomics data analysis
- Differential Binding from ChIP-seq data