Skip to content.

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.



%%%%% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Lab 7} %\VignetteDepends{Biobase,affy,ctest,multtest,bioclabs,annotate,hgu95av2} %\VignetteKeywords{Microarray} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}


\title{Lab 7: Analyzing Microarray Data: From Images to List of Candidate Genes}



In this lab, we demonstrate how to use R and Bioconductor to 1) read-in Affymetrix cel files, 2) do some quality control checks, 3) pre-process the cel level data to obtain expression measures, 4) obtain scores for differential expression and cut-offs to create list, and finally 5) create user-friendly web pages to report results.

<>= library(Biobase,warn.conflicts = FALSE) library(affy,warn.conflicts = FALSE) library(ctest) library(multtest) library(bioclabs) library(annotate) library(hgu95av2) @

\section{Read-In Cel Files} The function \verb+ReadAffy+ can be used to read in cel files and enter phenotypic data as well as MIAME information.

\begin{Sinput} R> spikein <- ReadAffy(phenoData="phenodata.txt",description="miame.txt", verbose=TRUE) R> phenodata <- read.phenoData("phenodata.txt",check.names=FALSE) R> phenoData(spikein) <- phenodata \end{Sinput}

The data one would obtain by doing this is available from the \verb+bioclabs+ package.

<>= data("spikein") @

\section{Quality Control} Various functions exist in the \verb+affy+ package that can be used for quality control. Let's try a few...

<>= pops <- pData(spikein)[,1]+2 hist(spikein,col=pops,type="l") @

The different colors represent the two different populations.

We can also use \verb+image+, \verb+boxplot+, among others...

\section{Pre-processing} Now we need to convert the probe-level data into expression measures. Various methods are available through the function \verb+expresso+ and one can easily create a new one using the function \verb+express+.

Now we will use the function \verb+rma+ which is implemented in C and is therefore quite fast. <>= eset <- rma(spikein) @

\section{Differential Expression} In this section we will compute the average log ratio for between the two populations and the t-test as well. We will then obtain adjusted p-values and create a list with genes that are statistically significant.

First, notice there are two populations and 12 replicates in each. <>= eset$population Index1 <- which(eset$population==0) Index2 <- which(eset$population==1) @

Let's get average intensities, average log ratios, t-tests, and p-values using the \verb+t.test+ function. <>= scores <- esApply(eset,1,function(x){ tmp <- t.test(x[Index2],x[Index1],var.equal=TRUE) c(mean(tmp$estimate),-diff(tmp$estimate),tmp$statistic,tmp$p.value) }) @

Now let's make the genes be in the rows and give appropriate names to the columns:

<>= scores <- t(scores) ##put genes back in rows colnames(scores) <- c("a","m","t.test","p.value") @

Now let's make an M (average log ratio) vs A (average intensity plot). The horizontal line shows the typical two-fold-change cutoff.

<>= plot(scores[,1],scores[,2],xlab="A",ylab="M",pch=".") abline(h=c(-1,1)) ##fc = 2 @

Should we take the variability of the estimates into account? It's only 3 replicates but we can try a t-test. The following is a so-called volcano plot which plots the t-test versus the estimate.

<>= plot(scores[,2],abs(scores[,3]),xlab="M",ylab="t.test",pch=".") abline(v=c(-1,1)) a <- qt(.975,4) abline(h=a) @

How many genes have p-values less than 0.05? How about 0.01? <>= sum(scores[,4]<=0.05) sum(scores[,4]<=0.01) @

Maybe we should adjust the p-values. Let's use the \verb+multtest+ package to obtain adjusted p-values using Benjamini and Yekutieli's procedures for (strong) control of the false discovery rate (FDR).

The function \verb+mt.rawp2adjp+ gives adjusted p-values according to various methods using only the raw p-values.

<>= tmp <- mt.rawp2adjp(scores[,4],proc="BH") adj.p.values <- tmp$adjp[order(tmp$index),] scores <- cbind(scores,adj.p.values[,-1]) colnames(scores)[5]<- "FDR" @

This assumes that the t-test are actually t-distributed. If we had more time we could try a non-parametric method such as maxT using the function \verb+mt.maxT+.

At what FDR would we be happy? 0.01 is pretty conservative. Let's try it anyway. In the next section we will make a

\section{Creating Report}

First let's pick the AffyIDs corresponding to the genes with adjusted p-values of less than 0.01.

<>= Names <- geneNames(eset)[scores[,5]<=0.01] Names <- Names[order(scores[Names,5])] @

Now using the data available through the metadata package \verb+hgu95av2+ lets find the corresponding gene names and locus link IDs.

<>= ll <- multiget(Names,env=hgu95av2LOCUSID) sym <- multiget(Names,env=hgu95av2SYMBOL) @

We can now make a nice web-page <>= res <- data.frame(unlist(sym),signif(scores[Names,],2)) ll.htmlpage(ll,filename="report.html", title="HTML report", othernames=res, table.head=c("Locus ID","Gene Symbol",colnames(scores)), @ Use a web browser to look at this page.

How we faired with the known to be differentially expressed genes? Of the genes we called how many where actually differential expressed? <>= true <- colnames(pData(eset))[-1] tp <- sum(Names%in%true) cat(tp,"true positives",length(Names)-tp,"false positives.\n") @

Not bad... but not 0.01 FDR either.

Let's make an MVA plot with the gene names.

<>= plot(scores[,1],scores[,2],xlab="A",ylab="M",pch=".",ylim=c(-1,1)) text(scores[Names,1],scores[Names,2],sym,pch=16,col=rainbow(length(Names))) abline(h=c(-1,1)) ##fc = 2 fp <- Names[!Names%in%true] ##lets show the false positves points(scores[fp,1],scores[fp,2],pch=4,cex=4,col="red") @

Was it worth using a t-tests over the more simple fold change estimates?

Let's see which one does better at ranking the truly differentially expressed genes:

<>= m.ranks <- rank(-abs(scores[,2])) ##by fold change names(m.ranks) <- rownames(scores) t.ranks <- rank(-abs(scores[,3])) ##by fold change names(t.ranks) <- rownames(scores) cbind(m.ranks,t.ranks)[true,] @



BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.


R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.