- Home
- Help
- Courses and Conferences
- 2005
- BioC2005 Conference
- Differential expression analysis of microarray experiments, BioC2005
- Lab 5: Advanced Linear Modeling and Time-Series Analysis

Gordon Smyth

August 16, 2005

This exercise uses advanced linear modeling and empirical Bayes moderated F-statistics to identify genes with interesting patterns in a multiple time course experiment. The primary dataset we will use is a time-series dataset. Microarray time course datasets typically involve measurements of thousands of genes over a small number of time points, 4-10 time points are typical. Very few datasets have more than 20 timepoints.

This exercise uses the Drosophila Embryogenesis dataset, which is available for installation as an R package:

Package | Windows | MacOS X | Source |
---|---|---|---|

drosEmbryo_1.0 |
drosEmbryo_1.0.zip | drosEmbryo_1.0.tar.gz | drosEmbryo_1.0.tar.gz |

This lab uses the data from Tomancak et al (2002) from 36 DrosGenome1 Affymetrix microarrays. The goal was to investigate Drosophila embryogenesis using a time course experiment. Wild type CantonS flies were expanded to large quantities and the entire population was split into 12 population cages that were subsequently treated equally. For three consecutive days fresh apple juice plates were introduced to each cage in the morning to allow two hours clearing of the retained embryos. Subsequently females in each cage were allowed to lay eggs for one hour before all twelve plates were removed simultaneously and transferred to 25-degree incubator and aged for 30 minutes. From then on at the end of each hour for the next 12 hours embryos from one plate were washed of the plate dechorionated and frozen in liquid nitrogen. The overall effect was to produce three separate replicates of twelve one-hour embryogenesis windows.

First, load the packages and the dataset. Here
`drosEmbryoRMA`

is an `exprSet`

object containing RMA expression values for this
dataset.

library(Biobase) library(limma) library(drosEmbryo) data(drosEmbryoRMA)

It is important to check the quality of your data before proceeding to further analysis. For brevity this quality check is skipped in this lab. However, a set of quality plots for this dataset, available at http://www.stat.berkeley.edu/~bolstad/PLMImageGallery, show few significant quality problems. One thing to note is that the arrays for the second replicate Day have slightly, thought not significantly elevated NUSE values.

Examining the targets information for this dataset:

pData(drosEmbryoRMA)

shows the sample names for this lab, the original
filenames and time points. For the sample names the
nomenclature is *X*hr*Y* where *X* is
the time-point, with value 1 through 12, and *Y*
is the replication day, with value 1,2 or 3.

To begin our analysis the first thing we need to do is set up the model that will be fitted to each probe-set. It seems sensible to include an effect both for time-point and replicate day. To prevent confusion between time point and replicate day we will label the days using letters. This may be accomplished using:

times <- pData(drosEmbryoRMA)$Time rep.day <- rep(c("A","B","C"),12) design <- model.matrix(~factor(times) + factor(rep.day)) colnames(design) <- c(as.character(1:12),"B","C")

Next we should set up the contrast matrix for all the possible comparisons we are interested in. In this case we are interested in looking for any changes in expression so we use the following contrast matrix:

cont.matrix <- rbind(0,diag(11),0,0)

This contrast matrix simply drops the intercept
term, which is of no interest, as well as the effects
for the different days.

The next stage is to fit our model for each probe-set, find the values of the contrasts we wish to examine, then compute the moderated F statistic. Type:

fit <- lmFit(drosEmbryoRMA,design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) modF <- fit2$F

This yields, for each probe-set, an empirical Bayes
moderated F-statistic which tests for any change in
expression level over the time-course. P-values for the
F-statistics are in fit2$F.p.value.

We have just computed moderated F statistic values, looking at all possible changes between time points, for each probe-set. In would now be useful to examine the expression values of the most significant genes. We will do this graphically. First we need to set up a few things:

par(cex=0.7,mfrow=c(2,2),ask=TRUE) which.A <- seq(1,34,3) which.B <- seq(2,35,3) which.C <- seq(3,36,3)

Note that `par(ask=TRUE)`

means that R
will ask you to press enter to move to the next
plot.

In this lab we choose to look at the 500 probesets with most extreme values of the moderated F statistic. For each probe-set we will plot each of the three replicate time series with a different symbol and color. Each plot will have the probeset identifier, moderated test statistic value and overall ranking displayed in the title. This can all be accomplished using the following code:

o <- order(modF, decreasing=TRUE) for(i in 1:500) { indx <- o[i] plot(1:12, exprs(drosEmbryoRMA)[indx,which.A], type="b",pch="A", lwd=2,col="red",xlab="Time", ylab="Expression", main=paste(rownames(exprs(drosEmbryoRMA))[indx], "modF=", round(modF[indx],2), "Ranking=",i),ylim=range(exprs(drosEmbryoRMA)[indx,])) points(1:12, exprs(drosEmbryoRMA)[indx,which.B],pch="B", col="green", lwd=2,type="b") points(1:12, exprs(drosEmbryoRMA)[indx,which.C],pch="C", col="blue", lwd=2,type="b") }

Press enter after you have finished examining each set of four probesets. You can use ESC to abort the plots.

As you progress through the plots you will see a variety of different expression profiles. Some probesets have initially high expression which drops after a few time points. Others have low initial expression but increase in expression near the end of the time course. There are a few probesets which begin low in expression, increase in the middle of the time course and decrease in expression at the end of the time period examined. Other temporal patterns are also visible. In general we notice that the A,B and C replicate time courses seem to match up very well to each other. If we were looking only for specific temporal patterns, for instance only early responding genes, we would repeat the procedure with a different contrast matrix.

As we noted earlier, quality assessment showed that the arrays from the second replication of the time course had slightly elevated NUSE values. We can check if this has resulted in any differences in the temporal patterns for the three replicates. Do this using:

cont.matrix <- rbind(matrix(0,nrow=12,ncol=2), c(1,0), c(0,1)) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) modF <- fit2$F o <- order(modF, decreasing=TRUE) for(i in 1:100) { indx <- o[i] plot(1:12, exprs(drosEmbryoRMA)[indx,which.A], type="b",pch="A", lwd=2,col="red",xlab="Time", ylab="Expression", main=paste(rownames(exprs(drosEmbryoRMA))[indx], "modF=", round(modF[indx],2), "Ranking=",i),ylim=range(exprs(drosEmbryoRMA)[indx,])) points(1:12, exprs(drosEmbryoRMA)[indx,which.B],pch="B", col="green", lwd=2,type="b") points(1:12, exprs(drosEmbryoRMA)[indx,which.C],pch="C", col="blue", lwd=2,type="b") }

You will notice that in almost all cases the temporal patterns for A and C track each other closely, while the profile for B seems to show more difference.

- Tomancak P, Beaton A, Weiszmann R, Kwan E, Shu S,
Lewis SE, Richards S, Ashburner M, Hartenstein V,
Celniker SE, Rubin GM.
*Systematic determination of patterns of gene expression during Drosophila embryogenesis.*Genome Biol. 2002;3(12):RESEARCH0088. Epub 2002 Dec 23. http://genomebiology.com/2002/3/12/research/0088.1 - Berkeley Drosophila Genome Project http://www.fruitfly.org/cgi-bin/ex/insitu.pl
- Smyth, G. K. (2004). Linear models and empirical
Bayes methods for assessing differential expression
in microarray experiments.
*Statistical Applications in Genetics and Molecular Biology***3**, No. 1, Article 3. http://www.bepress.com/sagmb/vol3/iss1/art3/ (Introduces moderated F-statistics)

Thanks to Ben Bolstad for an earlier version of this
laboratory exercise and to Yu Chuan Tai http://www.stat.berkeley.edu/~yuchuan/
for suggestions about the analysis. Yu Chuan Tai will
be releasing a `timecourse`

software package
later in 2005 with a more sophisticated test statistic
(called MB) for time series analysis.

The authors of Tomancak et al (2002) should be commended for making their dataset available for public download.