Skip to content.

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.



% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Lab 4} %\VignetteDepends{Biobase,ctest,multtest,bioclabs} %\VignetteKeywords{Microarray} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}}


\title{Lab 4: Differential Gene Expression}



In this lab, we demonstrate how to use R to find genes that are differentially expressed in two populations. We demonstrate two useful plots, the MA-plot and the volcano plot. For a more formal assessment, we use the \verb+multtest+ package for obtaining adjusted $p$-values. <>= library(Biobase) library(ctest) library(multtest) library(bioclabs) @

We use expression data from an experiment where sixteen genes were spiked in at different known concentrations in different hybridizations and are thus differentially expressed. Expression measures are stored in an \Robject{exprSet} object available through the \Rpackage{bioclabs} package. <>= data("eset3") genenames <- colnames(pData(eset3))[-1] @ %%FIXME: couldn't we have better varLabels? The concentrations of the sixteen genes in each of six hybridizations are stored in the \Robject{phenoData} slot of the \Robject{exprSet} object \Robject{eset3}. <>= pData(eset3) @ Notice there are two populations and 3 replicates in each. We are interested in identifying genes that are differentially expressed in these two populations.

Let us create a matrix containing for each of the 12626 genes on the HGU95a chip (note this really is A and not Av2), its {\em A-value} or average log intensity, its {\em M-value} or difference of log intensities (log ratio), its two-sample $t$-statistic, and its nominal $p$-value from the $t$-distribution. The rows of the \Robject{scores} matrix correspond to genes and the columns to the four different types of statistics.

The data have already been transformed to the log scale. Base 2 logarithms were used.

<>= Index1 <- which(eset3$population==0) Index2 <- which(eset3$population==1) scores <- esApply(eset3,1,function(x){ tmp <- t.test(x[Index2],x[Index1],var.equal=TRUE) c(mean(tmp$estimate),-diff(tmp$estimate),tmp$statistic,tmp$p.value) }) scores <- t(scores) ##put genes back in rows colnames(scores) <- c("A","M","t.stat","p.value") @

The following commands produce an {\em MA-plot} of the differences of log intensities in the two populations, M, vs. the average log intensities, A. <>= plot(scores[,1],scores[,2],xlab="A",ylab="M",pch=".") #points(scores[genenames,1],scores[genenames,2],pch=16,col=1:16) text(scores[genenames,1],scores[genenames,2],genenames,col=1:16) abline(h=c(-1,1)) ##fc = 2 @

In the MA-plot, points with large vertical deviations (absolute M-values) suggest differentially expressed genes. The horizontal lines show the typical two-fold-change cutoff (because the expression data are on a log 2 scale). The colored symbols correspond to the sixteen genes that are truly differentially expressed, i.e., were spiked in at different concentrations. Other points with large M-values correspond to false positives. One of the sixteen known genes has a low M-value, corresponding to a false negative ({\tt 1597\_at}). Based on our knowledge of the concentrations for each of the sixteen genes, one expects only one of these genes ({\tt 684\_at}) to have a negative M-value, i.e., higher expression measure in population 0 than 1 (infinitely more abundant in population 0). This is indeed the point with the very low M-value.

Should we take the variability of the estimates into account? There are only 3 replicates but we can try a $t$-test. The following is a so-called {\em volcano plot} of the $t$-statistic vs. the numerator of the $t$-statistic, or M-value. Genes in the top left and right corners of the plot correspond to genes with both large absolute differences and large relative (to standard error) differences in expression between the two populations. Although the sixteen known genes tend to have large absolute $t$-statistics, they standout more in terms of their M-values.

In other situations you may have seen the volcano plot defined as a plot of minus the base 10 logarithm of the data versus fold change. The two versions (ours and this one) are equivalent since all $t$-statistics are based on the same number of degrees of freedom. <>= plot(scores[,2],abs(scores[,3]),xlab="M",ylab="t.stat",pch=".") points(scores[genenames,2],abs(scores[genenames,3]),pch=16,col=1:16) abline(v=c(-1,1)) a <- qt(.975,4) abline(h=a) @

Note that the gene with small M value really distorts our visual impression of the data. In order to remove that effect we replot the data with new limits. For a close-up, simply change the \verb+xlim+ and \verb+ylim+. <>= plot(scores[,2],abs(scores[,3]),xlab="M", ylab="t.stat",pch=".",xlim=c(-1.5,1.5),ylim=c(0,15)) text(scores[genenames,2],abs(scores[genenames,3]),genenames,col=1:16) abline(v=c(-1,1)); a <- qt(.975,4); abline(h=a) @

How many genes have $p$-values less than 0.05? How about 0.01? <>= sum(scores[,"p.value"]<=0.05) sum(scores[,"p.value"]<=0.01) @

One can adjust the $p$-values to account for multiple hypothesis testing using the \Rpackage{multtest} package. The function \Robject{mt.rawp2adjp} gives adjusted $p$-values according to various methods using only the raw $p$-values. <>= tmp <- mt.rawp2adjp(scores[,4]) adj.p.values <- tmp$adjp[order(tmp$index),-1] scores <- cbind(scores,adj.p.values) @ The function \Rfunction{maxT} computes permutation adjusted $p$-values for the Westfall \& Young step-down maxT procedure. One needs to use the original data again. Similarly, for the step-down minP procedure, one can use the function \verb+mt.minP+. <>= tmp <- mt.maxT(exprs(eset3),eset3$population) scores <- cbind(scores,tmp$adjp[order(tmp$index)]) colnames(scores)[12] <- "maxT" @ %$ -- to keep fontlocking happier

Let's see how we fared with the sixteen known to be differentially expressed genes. %' <>= round(scores[genenames,],2) @

One can make some pretty substantial arguments against this procedure. Make an adjustment for all 12626 probes is probably not appropriate in any situation and is certainly hard to support here. These data arose from a designed experiment where we know which genes are going to change. Not all 12626 were candidates for change. In a real experiment you are likely to find that approximately 40\% of the genes are not expressed in the tissue that you are studying. In that case you have corrected for something which should have been excluded from the analysis. There is of course an issue in determining which genes are not differentially expressed is very difficult but the gains from an approximate solution are likely to be quite large.

With 3 replicates we don't expect to have much power. Should we even use $t$-statistics over the more simple fold change estimates? Let's see which one does better at ranking the sixteen truly differentially expressed genes. Ideally, one would like the sixteen genes to have ranks 1 through 16. It seems like the simple M-value or fold change measure was more successful at identifying the sixteen known genes. %' <>= m.ranks <- rank(-abs(scores[,2])) ##by fold change names(m.ranks) <- rownames(scores) t.ranks <- rank(-abs(scores[,3])) ##by fold change names(t.ranks) <- rownames(scores) cbind(m.ranks,t.ranks)[genenames,] @

We can easily repeat the whole procedure with the dataset comprising 12 replicates, by simply using \verb+data(eset12)+ instead of \verb+data(eset3)+.



BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.


R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.