Gordon Smyth
6 August 2007
This case study illustrates more advanced linear modeling with Affymetrix single-channel microarrays. The popular 2x2 factorial design is considered. Use of Bioconductor annotation for Affymetrix arrays is illustrated. The case study goes on to consider significance tests using gene sets.
The estrogen data set is required for this lab. Set your R working directory to be bioc2007limma/data/estrogen. By typing dir() you should see eight CEL files and three text files. On my computer, I see:
> setwd("estrogen") > dir() [1] "ERgenes.txt" "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" "low10-1.cel" [7] "low10-2.cel" "low48-1.cel" "low48-2.cel" "targets.txt"
To repeat this case study in full you will need to have the R packages affy, hgu95av2cdf and hgu95av2 installed.
The data gives results from a 2x2 factorial experiment on MCF7 breast cancer cells using Affymetrix HGU95av2 arrays. The factors in this experiment were estrogen (present or absent) and length of exposure (10 or 48 hours). The aim of the study is the identify genes which respond to estrogen and to classify these into early and late responders. Genes which respond early are putative direct-target genes while those which respond late are probably downstream targets in the molecular pathway.
This data is from the estrogen data package on the Bioconductor website http://www.bioconductor.org/data/experimental.html. Rather than loading the data package here we simply using the nine basic data files from that package, in order to save storage space and to more closely mimic a real data analysis situation. The data set is discussed further by Scholtens [1,2] and in the Limma User's Guide.
The first step in most analyses is to read the targets file which describes what RNA target has been hybridized to each array and, equally importantly, gives the names of the corresponding data files.
> library(limma) > targets <- readTargets() > targets
Now read the CEL file data into an AffyBatch object and normalize using RMA:
> library(affy) > library(hgu95av2cdf) > abatch <- ReadAffy(filenames=targets$filename) > eset <- rma(abatch)
Here eset is a data object of class exprSet.
It is usual and appropriate to check data quality before continuing your analysis. Due to brevity we will skip over this in this lab. A full set of quality assessment plots can be found at http://www.stat.berkeley.edu/~bolstad/PLMImageGallery/ under the title "estrogen". These plots show no significant quality problems with any arrays in this dataset.
We have four pairs of replicate arrays so we should estimate four parameters
in the linear model. There are many valid ways to choose a design matrix, but
perhaps the simplest is to make each column correspond to a particular treatment
combination. In this formulation, the four columns of the matrix correspond to absent10
,
present10
, absent48
and present48
,
respectively. This design matrix given above can be computed in R as follows:
> f <- paste(targets$estrogen,targets$time.h,sep="") > f <- factor(f) > f > design <- model.matrix(~0+f) > colnames(design) <- levels(f) > design
Now that we have defined our design matrix, fitting a linear model is as simple as:
> fit <- lmFit(eset, design)
fit
is an object of class MArrayLM
.
> names(fit)
The fitted coefficents fit$coef from the model fit are just the mean
log-expression for each treatment combination for each probe set. For this
reason, this choice of design matrix is called the group means
parametrization in the Limma User's Guide.
The idea now is to use contrasts to make any comparisons of interest between
the four treatment combination. Contrasts are linear combinations of the
coefficients from the linear model fit. We will estimate three contrasts (so our contrasts matrix will have three
columns). The first contrast is an estrogen effect at time 10 hours,
the second as an estrogen effect at time 48 hours and the third is the time effect in
the absence of estrogen. Many other contrasts could be made.
> cont.matrix <- makeContrasts(E10="present10-absent10",E48="present48-absent48",Time="absent48-absent10",levels=design) > cont.matrix
> fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2)
We now use the function topTable
to obtain a list genes differentially
expressed between Estrogen-Present and Estrogen-Absent at time 10 hours, followed by a list of
genes differentially expressed between Estrogen-Present and Estrogen-Absent at time 48 hours.
> colnames(fit2) > topTable(fit2,coef=1) > topTable(fit2,coef=2)
The function decideTests() provides a variety of ways to assign statistical significance to the contrasts while controlling for multiple testing.
> results <- decideTests(fit2) > summary(results) > vennDiagram(results)
Load the annotation package hgu95av2
, which can be obtained from
http://www.bioconductor.org/data/metaData.html.
> library(hgu95av2) > library(annotate) > fit2$genes$Symbol <- getSYMBOL(fit2$genes$ID, "hgu95av2") > topTable(fit2, coef=2)
In this lab, we move beyond the analysis of individual genes, and consider sets of genes in microarray experiments. Another approach is to form gene sets based on a priori knowledge of common biological features shared by the genes. We consider a particular approach called gene set enrichment. We begin with a known set of genes and then test whether this set as a whole is differentially expressed in a microarray experiment. This type of test is useful when comparing one's microarray data with that of previous authors who have performed similar microarray experiments, because the lists of most differentially expressed genes reported by the previous authors can be regarded as a "gene set" and tested to determine whether the genes are also differentially expressed in the current context.
Gene set testing was introduced by Mootha et al (2003) and Lamb et al (2003). Mootha et al define the concept of a gene set enrichment test. For a given set of genes, one can test whether the set as a whole is up-regulated, down-regulated or differentially expressed with individual genes possibly going in either direction. Sometimes performing the traditional differential expression analysis of individual genes will yield no statistically significant results, but there may be stronger evidence for differential expression of gene sets.
Now we turn our attention to tests for differential expression involving a set of genes. Mootha et al. [5] and Lamb et al. [6] made this methods popular in 2003. We will use a "gene set enrichment test", which is closely based on the one defined by Mootha et al. The gene set test can be used to test whether previous author's lists of differentially expressed genes are also differentially expressed in a current experiment similar to that of the previous authors. Another possible application is to try to find differential expression in microarray experiments which show no strong differential expression when testing for individual differentially expressed genes, but they might show more evidence of differential expression when testing a predefined set of genes. Defining a useful gene set for this sort of analysis is not always trivial. One possibility is to use a set of genes which share common gene ontologies, i.e. choose a set of genes which are all associated with GOs below a certain node in the GO DAG (Directed Acyclic Graph). We will begin with some artificial examples to illustrate the concept of gene set tests with a small number of made-up t-statistics. Then we will use two sets of genes thought to be regulated by the Estrogen Receptor (ERalpha) to demonstrate testing for differential expression of gene sets in the Estrogen data set.
We will again use the Estrogen data set through the fit2
linear model fit object.
We will use two sets of genes which are thought to be ER-regulated, i.e. regulated by the Estrogen Receptor alpha. The first set, from Jin et al (2003), contains genes which have been experimentally verified to be ER-regulated.
This gene set should be differentially expressed between the breast cancer cells with estrogen reintroduced and the serum-starved breast cancer cells with no estrogen, because in the cells reintroduced to estrogen, the estrogen receptors (ERs) will bind the estrogen and as a result become activated, gaining the ability to regulate gene expression in the cells, hence resulting in differential expression between the cells with and without estrogen.
The data required for this exercise is available from ERgenes.txt. Read the gene lists into R:
> fit2$genes$LocusLink <- getLL(fit2$genes$ID, "hgu95av2") > ERgenes <- read.delim("ERgenes.txt",as.is=TRUE) > inSet <- fit2$genes$LocusLink %in% ERgenes$LocusLink > geneSetTest(inSet, fit2$t[,"E10"], alt="mixed") > geneSetTest(inSet, fit2$t[,"E10"], alt="up") > geneSetTest(inSet, fit2$t[,"E48"], alt="mixed") > geneSetTest(inSet, fit2$t[,"E48"], alt="up")
We can see the results graphically:
> boxplot(fit2$t[,"E10"] ~ inSet) > boxplot(fit2$t[,"E48"] ~ inSet) > boxplot(fit2$t[,"Time"] ~ inSet)
Thanks to Denise Scholtens and Robert Gentleman for providing the estrogen data http://www.bioconductor.org/data/experimental.html. Thanks to Ben Bolstad for an analysis of this data. Thanks to Hui Tang and Terry Speed for suggesting the known ER-regulated gene sets of Jin et al.