About This Document »

Package name ExpressionNormalizationWorkflow
Built with Bioconductor (R) 3.5 (3.4.0)
Last Built Mon, 10 Apr 2017 16:31:08 -0700
Last Modified Fri, 07 Oct 2016 14:22:16 -0700 (r122071)
Source Package ExpressionNormalizationWorkflow_1.2.0.tar.gz
Windows Binary ExpressionNormalizationWorkflow_1.2.0.zip
Mac OS X 10.10 (Yosemite) ExpressionNormalizationWorkflow_1.2.0.tgz
R Script ExpressionNormalizationWorkflow.R

To install this workflow under Bioconductor 3.4, start R and enter:
source("http://bioconductor.org/workflows.R")
workflowInstall("ExpressionNormalizationWorkflow")

Gene Expression Normalization Workflow

Karthikeyan Murugesan and Greg Gibson

Georgia Institute of Technology, Atlanta, USA
http://cig.gatech.edu

Contents

Introduction

Every Gene Expression study has an underlying question which the experimenter tries to address.Before proceeding towards any downstream analysis, the researcher has to decide how to normalize the gene expression data to minimize the impact of technical and other confounding factors on estimation of the environmental or biological factors of interest. One approach is to simply remove all of the major principal components of variation or specified technical effects, but this can also remove the biological signal of interest.Supervised normalization strategies encourage the user to define the major parameters influencing the expression variation, and then systematically remove them while shielding the biological factors of interest.
Here we outline a dynamic workflow that recognizes that there is not any ‘one algorithm fits all data sets’ analysis approach that works for every experiment.The workflow employs three successive procedures available in Bioconductor: Principal Variance Component Analysis (PVCA) to explore how technical and biological factors correlate with the major components of variance in the data set; Surrogate Variable Analysis (SVA) to identify major unwanted sources of variation; and Supervised Normalization of Microarrays (SNM) to efficiently remove these sources of variation while retaining the biological factor(s) of interest. The data is based on a study contrasting peripheral blood gene expression in acute myocardial infarction and coronary artery disease patients, described in Kim et al. (2014). Only a subset of the data is used, and the labels have been changed for privacy protection, but the full data set is accessible from the GEO data repository at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49925.

Package Content

The workflow package contains five functions:
1. expSetobj – function to create an Expression Set Object which encapsulates the expression values, covariate information, and the experimental metadata
2. pvcAnaly - Principal Variance Component Analysis functionality
3. surVarAnaly – function which identifies hidden or new surrogate variables that are potential covariates hidden in the data
4. conTocat – function to convert continuous variables to categorical variables (discretize)
5. snmAnaly – function which implements the Supervised Normalization of Microarrays (SNM) normalization technique

Sample data

The workflow runs on these sample two data files:
1. CAD_Expression.csv - file containing gene expression values (8000 probes for 100 samples)
2. CAD_Exptdsgn.csv - file containing the phenotype information (100 samples with 10 covariates)
The covariates are Study (a large batch effect since samples were run 2 years apart), Rin (RNA integrity number, a measure of RNA quality), BMI, Age, Gender, CAD (disease status as Acute MI or Fine at the time of sampling of patients in a CAD cohort), Rin3, BMI3, Age3 (discrete categorizations of the three continuous measures, used in the PVCA), and Array (a sample number) The concept of the analysis is to explore how adjusting for Study, Rin, and other covariates if desired, influences the interpretation of the impact of CAD on gene expression.That is, CAD is the biological variable of interest.

Workflow steps

A. Reading Input Data

  1. File containing the gene expression values (log transformed)
    Format - Features x Samples (where the features could be probes in the case of microarrays, or gene or exon names in the case of RNASeq data), preferably a comma separated file (.csv)
  2. File containing the covariates
    Covariates are the different phenotypes that describe sample attributes, and can be biological, technical, or environmental.In the workflow we will also identify hidden covariates
    Format - Samples x Covariates, preferably a comma separated file (.csv)

You should start by examining the sample data to get an idea of how the files should be structured.

## Bioconductor version 3.5 (BiocInstaller 1.25.3), ?biocLite for help

## BioC_mirror: https://bioconductor.org

## Using Bioconductor 3.5 (BiocInstaller 1.25.3), R 3.4.0 alpha (2017-04-04
##   r72482).

## installation path not writeable, unable to update packages: survival

## Old packages: 'MSnbase', 'scran'

## Warning: replacing previous import 'Biobase::anyMissing' by
## 'matrixStats::anyMissing' when loading 'ExpressionNormalizationWorkflow'

## Warning: replacing previous import 'Biobase::rowMedians' by
## 'matrixStats::rowMedians' when loading 'ExpressionNormalizationWorkflow'

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colMeans,
##     colnames, colSums, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, Map, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
##     rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Set your working directory (or use Control + Shift + H to choose it)
setwd("working directory")
## read in the file containing the gene expression values
expData_file_name <- system.file("folder containing the data", "CAD_Expression.csv", package="ExpressionNormalizationWorkflow")
exprs <- read.table(expData_file_name, header=TRUE, sep=",", row.names=1, as.is=TRUE)
## read in the file containing the covariates
expDesign_file_name <- system.file("folder containing the data", "CAD_ExptDsgn.csv", package="ExpressionNormalizationWorkflow")
covrts <- read.table(expDesign_file_name, header=TRUE, sep=",", row.names=1, as.is=TRUE)

##                A001   A002   A003   A004   A005
## ILMN_2389211 14.091 13.892 13.747 14.056 14.243
## ILMN_1667796 13.872 13.903 14.273 14.276 14.125
## ILMN_1683271 13.419 13.665 13.180 13.487 14.252
## ILMN_2100437 14.237 13.638 14.160 14.004 13.945
## ILMN_2242491 13.692 13.830 13.667 13.756 14.015

##      Study Rin3 Gender   Age3 BMI3   CAD Array Rin Age  BMI
## A001     A HIGH    FEM  ADULT OVER ACUTE     1 9.2  56 31.0
## A002     A HIGH    FEM  ADULT OVER ACUTE     2 9.1  54 26.9
## A003     A HIGH    MAL  ADULT NORM ACUTE     3 9.1  51 20.7
## A004     A HIGH    MAL  ADULT NORM ACUTE     4 9.3  52 24.4
## A005     A HIGH    MAL MATURE OVER ACUTE     5 9.3  62 26.1

B. Creating an ExpressionSet Object

An ExpressionSet Class (Falcon, Morgan and Gentleman, 2006) is a Biobase data structure that is used to conveniently store experimental information and associated meta data, all in one place. This command creates an object of the the ExpressionSet Class, which stores the gene expression values and the covariate data.The function GEOquery (Davis, 2016) can be used to download ExpressionSet objects from directly from GEO (https://www.ncbi.nlm.nih.gov/gds)

inpData <- expSetobj(exprs, covrts)

C. Principal Variance Component Analysis of the raw data

PVCA (Bushel, 2013) estimates the proportion of the variance of the principal components of the gene expression data that can be attributed to each of the given covariates.The remaining fraction is assigned as “residual” variance.It efficiently combines principal component analysis (PCA) to reduce the feature space and variance component analysis (VCA) which fits a mixed linear model using factors of interest as random effects to estimate and partition the total variability. The variance explained is computed as a weighted average of the contributions of each factor to each PC, and you have the option of specifying how many principal components to include, by setting a threshold amount of the variance that needs to be explained by the identified PCs. Here PVCA is used to estimate the proportion of variance due to CAD as well as due to the covariates like Gender, BMI, Rin, and Study.Since BMI and Rin are continuous variables, we use a categorization of each into Obese, Overweight, Normal weight and High, Moderate, and Low quality RNA respectively, through the BMI3 and Rin3 columns.

## Set the covariates whose effect size on the data needs to be calculated
cvrts_eff_var <- c("CAD", "BMI3", "Rin3", "Gender", "Study")
## Set a PVCA Threshold Value between 0 & 1
## PVCA Threshold Value is the percentile value of the minimum amount of the variabilities that the selected principal components need to explain, here requiring 75% of the expression variance to be captured by the PCs
pct_thrsh <- 0.75 
## Perform the PVCA analysis
pvcAnaly(inpData, pct_thrsh, cvrts_eff_var)

## $dat
##            [,1]       [,2]      [,3]       [,4]       [,5]      [,6]
## [1,] 0.02675992 0.03947968 0.2685283 0.02055835 0.03676139 0.6079124
## 
## $label
## [1] "Rin3"   "BMI3"   "Study"  "Gender" "CAD"    "resid"

The analysis (also generated as a histogram) shows that the Study batch effect is a major technical artifact that explains ~27% of the expression variance.In comparison, CAD and BMI each only explain around 4% of the variance, RNA quality less than 3%, and Gender is a minor component.60% of the variance is residual.

D. Surrogate Variable Analysis

Surrogate variables (Leek and Storey, 2007) are covariates constructed directly from high-dimensional data (gene expression or RNA-Seq data) that can be used in subsequent analyses to adjust for unknown/unmodeled covariates or latent sources of noise. The user provides one or more biological variable(s) that they are interested in estimating, and then estimates the surrogate variables (sv) given the presence of the biological signal. These are then appended as new covariates to the existing list of covariates, and we also add a step to convert them to categorical variables so as to estimate their contributions to the total variance.For a thorough introduction to SVA, refer to the paper by Leek and Storey (2007) .SVA is available from Bioconductor at https://bioconductor.org/packages/release/bioc/html/sva.html, and a version specifically for RNASeq data (svaseq) is also available from Github at https://github.com/jtleek/svaseq.

## Choose a biological variable that is to be used to calculate the surrogate variables
biol_var_sva <- "CAD" 
## Perform SVA
sur_var_obj <- surVarAnaly(inpData, biol_var_sva)

## Number of significant surrogate variables is:  4 
## Iteration (out of 5 ):1  2  3  4  5

## The newly generated surrogate variables sv1 through sv4 are appended to the ExpressionSet object
inpData_sv <- sur_var_obj$expSetobject

E. Computing the correlation between the surrogate variables and the covariates

This step helps the researcher explore whether each newly identified surrogate variable is entirely independent of all the existing covariates or if they are correlated with either a technical factor or one of the biological covariates. If they are associated with a biological factor (especially the focus of the analysis, in this case CAD), then it would not be advisable to remove them during the normalization step, though they might be adjusted for.Otherwise, it will usually be advisable to remove them, or to remove the technical factor they capture.To see the correlations, we run a series of generalized linear models where the identified surrogate variables are modeled as a function of the existing covariates (Study, CAD, Rin, BMI; you could add Gender and Age if you wish – but they are not significant).

## Fit a generalized linear model for sv1
glm.sv1 <- glm(pData(inpData_sv)[,"sv1"]~pData(inpData_sv)[,"BMI"]+pData(inpData_sv)[,"Rin"]+pData(inpData_sv)[,"CAD"]
               +pData(inpData_sv)[,"Study"]) 
summary(glm.sv1)

## 
## Call:
## glm(formula = pData(inpData_sv)[, "sv1"] ~ pData(inpData_sv)[, 
##     "BMI"] + pData(inpData_sv)[, "Rin"] + pData(inpData_sv)[, 
##     "CAD"] + pData(inpData_sv)[, "Study"])
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.127062  -0.040490  -0.009979   0.030200   0.139604  
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     0.1832540  0.0711053   2.577   0.0115 *  
## pData(inpData_sv)[, "BMI"]      0.0009422  0.0009208   1.023   0.3088    
## pData(inpData_sv)[, "Rin"]     -0.0159186  0.0074152  -2.147   0.0344 *  
## pData(inpData_sv)[, "CAD"]FINE  0.0095449  0.0119928   0.796   0.4281    
## pData(inpData_sv)[, "Study"]B  -0.1699332  0.0129778 -13.094   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.00357263)
## 
##     Null deviance: 1.0000  on 99  degrees of freedom
## Residual deviance: 0.3394  on 95  degrees of freedom
## AIC: -272.79
## 
## Number of Fisher Scoring iterations: 2

The analysis shows that the Study batch effect is very strongly associated with sv1.You can also see this by plotting the two measures from the pData(inpData_sv) file. Subsequent analyses of sv2 through sv4 with similar code shows that BMI is weakly associated with sv2, but strongly with sv3 (as is Study), while Rin is strongly associate with sv4.CAD is not associated with any of the surrogate variables, but Age contributes slightly to sv2 and sv3.

## Fit a generalized linear model for sv2
glm.sv2 <- glm(pData(inpData_sv)[,"sv2"]~pData(inpData_sv)[,"BMI"]+pData(inpData_sv)[,"Rin"]+pData(inpData_sv)[,"CAD"]
               +pData(inpData_sv)[,"Study"]) 
summary(glm.sv2)

## 
## Call:
## glm(formula = pData(inpData_sv)[, "sv2"] ~ pData(inpData_sv)[, 
##     "BMI"] + pData(inpData_sv)[, "Rin"] + pData(inpData_sv)[, 
##     "CAD"] + pData(inpData_sv)[, "Study"])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.21940  -0.04787  -0.00900   0.04228   0.33138  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                    -0.261993   0.115073  -2.277   0.0250 *
## pData(inpData_sv)[, "BMI"]      0.003209   0.001490   2.154   0.0338 *
## pData(inpData_sv)[, "Rin"]      0.019394   0.012000   1.616   0.1094  
## pData(inpData_sv)[, "CAD"]FINE  0.030581   0.019409   1.576   0.1184  
## pData(inpData_sv)[, "Study"]B  -0.016918   0.021003  -0.806   0.4225  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.009356872)
## 
##     Null deviance: 1.0000  on 99  degrees of freedom
## Residual deviance: 0.8889  on 95  degrees of freedom
## AIC: -176.51
## 
## Number of Fisher Scoring iterations: 2

## Output should be similar to that shown above for sv1

F. Principal Variance Component Analysis of the raw data with the surrogate variables included as covariates

The following PVCA step is performed to estimate the contributions of the newly identified surrogate variables to the overall expression variance, with or without the Study covariate in the model.

## First discretize the continuous surrogate variables 
var_names <- c("sv1", "sv2", "sv3", "sv4") 
pData(inpData_sv)<-conTocat(pData(inpData_sv), var_names) 
## View them appended to the covariate matrix as additional covariate columns
View(pData(inpData_sv))
## Include the surrogate variables as covariates in addition to BMI3, Rin3, CAD and Study (be sure to use categorical measures of BMI and Rin rather than the continuous measure)
cvrts_eff_var <- c("BMI3", "Rin3", "CAD", "Study", "sv1_cat", "sv2_cat", "sv3_cat", "sv4_cat")
## Again set the PVCA Threshold to explain 75% of the expression variation
pct_thrsh <- 0.75 
## Perform PVCA
pvcAnaly(inpData_sv, pct_thrsh, cvrts_eff_var) 

## $dat
##           [,1]       [,2]     [,3]      [,4]       [,5]        [,6]
## [1,] 0.0812861 0.05778324 0.144267 0.1313617 0.00861124 0.006383906
##           [,7]       [,8]      [,9]
## [1,] 0.1957262 0.01316701 0.3614136
## 
## $label
## [1] "sv4_cat" "sv3_cat" "sv2_cat" "sv1_cat" "Rin3"    "BMI3"    "Study"  
## [8] "CAD"     "resid"

The analysis shows that each of the surrogate variables capture a large amount of the variance, while the Study batch effect remains important.The residual variance has reduced by almost half to 36%.For comparison, remove Study to see how much of it the surrogate variables capture

cvrts_eff_var <- c("BMI3", "Rin3", "CAD","sv1_cat", "sv2_cat", "sv3_cat", "sv4_cat")
## Again set the PVCA Threshold to explain 75% of the expression variation
pct_thrsh <- 0.75 
## Perform PVCA
pvcAnaly(inpData_sv, pct_thrsh, cvrts_eff_var)

## $dat
##           [,1]     [,2]      [,3]      [,4]        [,5]        [,6]
## [1,] 0.0806385 0.075551 0.1537442 0.2760277 0.007578122 0.007032033
##            [,7]      [,8]
## [1,] 0.01256834 0.3868601
## 
## $label
## [1] "sv4_cat" "sv3_cat" "sv2_cat" "sv1_cat" "Rin3"    "BMI3"    "CAD"    
## [8] "resid"

This implies that together the 4 SVs capture as much variance as Study alone.Both analyses have slightly reduced the CAD contribution relative to the analysis without surrogate variables.

G. Supervised normalization of Microarrays

SNM (Mecham, Nelson and Storey, 2010) is a study specific, customizable normalization approach that adjusts for specified biological and technical variables as far as possible without biasing the estimate of the biological source of interest. Much like SVA, the user chooses the biological variable(s), and then fits desired covariates, while also allowing for “intensity dependent” effects (here, just array/sample differences) to be adjusted for. The SNM software at http://www.bioconductor.org/packages/release/bioc/html/snm.html actually allows you to either remove the adjustment variables using the rm=TRUE option, or simply to fit them.You can also run both sequentially with different covariates (for example, removing a batch effect but adjusting for gender). SNM fits both categorical and continuous variables.In the workflow below we remove the Study or the surrogate variable effects, running 5 iterations.

## Choose the biological variableof interest
bv <- c("CAD")
## Choose your adjustment variable of interest, starting with just 'Study'
av <- c("Study") 
## The intensity-dependent adjustment variables adjust for array effects 
iv <- c("Array") 
## Run SNM
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv) 

## 
 Iteration:  1

## 
 Iteration:  2

## 
 Iteration:  3

## 
 Iteration:  4

## 
 Iteration:  5

After 5 iterations, the π0 estimate suggests that almost 63% of the probes are true negatives for the CAD effect: in other words, that there is evidence that just over one third of the probes are differentially expressed with acute MI. This is seen as an enrichment of transcripts with small p-values. At each iteration, the π0 estimates should have been 0.92, 0.62, 0.63, 0.63 , 0.63 implying convergence. The output .csv file should have been saved to your working directory as “snm_normalized_data.csv”.

## Now choose the adjustment variables to be all four SVs plus Rin
av <- c("Rin", "sv1", "sv2", "sv3", "sv4")
## Run SNM
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv) 

This time the π0 estimates are 0.76, 0.75, 0.76, 0.76 and 0.75.However, the intensity-dependent effects are reduced considerably.A third possibility is to remove Study, Rin and sv2 (which is independent).This time the π0 estimates are 0.59, 0.54, 0.60, 0.63 and 0.64, and the intensity-dependent effects are more similar to the first model.The SNM ietrations are shown below-

av <- c("Rin", "sv2", "Study")
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv)

## 
 Iteration:  1

## 
 Iteration:  2

## 
 Iteration:  3

## 
 Iteration:  4

## 
 Iteration:  5

## Create an expressionSet object of the normalized dataset(s)
sv_snmNorm_data <- sv_snmObj$norm.dat
colnames(sv_snmNorm_data) <- colnames(exprs)
sv_snm_data <- expSetobj(sv_snmNorm_data, pData(inpData_sv))

## Write this dateset to a table with rows as genes and columns as samples (with names the same as that from the input file)
write.table(sv_snmNorm_data, file="CAD_SNM.csv", sep=",")

H. Principal Variance Component Analysis on the normalized data

This post SNM PVCA asks how the process of SNM has influenced the estimated contributions of the various covariates. SNM brings down their effect to a minimal possible value while trying to preserve the variation due to the biological signals.

## Specify the covariates whose effect size is to be calculated
cvrts_eff_var <- c("BMI3", "Rin3", "CAD", "sv1_cat", "sv2_cat","sv3_cat", "sv4_cat")
## If needed, keep the same PC Threshold Value
pct_thrsh <- 0.75 
## Perform PVCA
pvcAnaly(sv_snm_data, pct_thrsh, cvrts_eff_var)

## $dat
##           [,1]      [,2]       [,3]       [,4]        [,5]       [,6]
## [1,] 0.0798637 0.1092931 0.01151301 0.06695379 0.009375329 0.01174651
##            [,7]      [,8]
## [1,] 0.02221493 0.6890396
## 
## $label
## [1] "sv4_cat" "sv3_cat" "sv2_cat" "sv1_cat" "Rin3"    "BMI3"    "CAD"    
## [8] "resid"

Fitting Study, sv2, and RIN has not completely removed those effects, but they are much reduced, and the residual unexplained variance is back up to 69%.Most of it is probably among individual differences.The CAD effect remains low, at 2% of the variance, but is slightly increased on the raw data, whereas BMI has dropped to 1%.The sv3 and sv4 effects have not been removed.

Further exploration of different normalization approaches should help you decide whether the data is now ap for downstream analysis.

Discussion

The workflow presents a general strategy for exploring how technical and biological factors influence the inference of gene expression differences between categories of interest.The PVCA routine identifies how much of the variance is explained by given covariates; the SVA routine finds latent variables not influenced by the biological factor; and the SNM normalizes the data set by removing known or unknown sources of variance – in this case including a large batch effect. Readers are encouraged to compare the distributions of overall transcript abundance before and after normalization to gain a sense of the impact of the procedures on the data set. Although the impact on detection of genes influences by acute MI was minimal in the sub-set of the Kim et al. (2014) data, application to the full data set has a larger influence.Similar strategies can be applied to RNASeq data, with the complication that low expression values have higher variance and are generally dealt with differently.

References

Bushel, P.R. (2013) Principal Variance Component Analysis (PVCA). Bioconductor.

Davis, S. (2016) GEOquery. Bioconductor.

Falcon, S., Morgan, M. and Gentleman, R. (2006) An Introduction to Bioconductor’s ExpressionSet Class. Bioconductor.

Kim, J., Ghasemzadeh, N., Eapen, D.J., Chung, N.C., Storey, J.D., Quyyumi, A.A., et al. (2014) Gene expression profiles associated with acute myocardial infarction and risk of cardiovascular death. Genome Medicine, 6.

Leek, J.T. and Storey, J.D. (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLOS Genetics, 3.

Mecham, B.H., Nelson, P.S. and Storey, J.D. (2010) Supervised normalization of microarrays. Bioinformatics, 26, 1308–1315.