Skip to content.

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.



\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}


\title{Lab 3B: An Introduction to the {\tt affy} package}

\begin{document} \maketitle

In this lab, we demonstrate the main functions in the \Rpackage{affy} package for pre-processing Affymetrix microarray data. We first load the packages that we will be using. <>= library(affy) library(affydata) @

For a more detailed introduction, consult the package vignettes which can be listed by the command \texttt{openVignette("affy")}. A demo can also be accessed by \texttt{demo(affy)}. A number of sample datasets are available in the package and in the \Rpackage{affydata} add-on; to list these, type \texttt{data(package="affy")} and \texttt{ data(package="affydata")}. To interact with this and other vignettes you can use \Rfunction{vExplorer}.

The function \Rfunction{ReadAffy} is available for reading CEL files. However, in this lab we will work mainly with the \Robject{Dilution} dataset, which is included in the package. For a description of \Robject{Dilution}, type \texttt{ ?Dilution}. To load this dataset

<>= data(Dilution) @

%%%%%%%%%%%%%%%%%%%%%%%%% %%% affy classes

One of the main classes in \Rpackage{affy} is the \Rfunction{AffyBatch} class. For details on this class consult the help file, \texttt{help("AffyBatch-class")}; methods for manipulating instances of this class are also described in the help file. Other classes include \verb+ProbeSet+ (PM and MM intensities for individual probe sets), \verb+Cdf+ (information contained in a CDF file), and \verb+Cel+ (single array cel intensity data). The object \verb+Dilution+ is an instance of the class \verb+AffyBatch+. Try the following commands to obtain information of this object

<>= class(Dilution) slotNames(Dilution) Dilution annotation(Dilution) @

For a description of the target samples hybridized to the arrays <>= phenoData(Dilution) pData(Dilution) @

The \verb+exprs+ slot contains a matrix with columns corresponding to arrays and rows to individual probes on the array. To obtain the matrix of intensities for all four arrays <>= e<-exprs(Dilution) nrow(Dilution)*ncol(Dilution) dim(e) @ The values in this array are the raw values for the mean probe expression in your \texttt{CEL} files.

You can access probe-level PM and MM intensities using <>= PM<-pm(Dilution) dim(PM) PM[1:5,] @ The array \Robject{PM} contains only the perfect match probes.

To get the probe set names (Affy IDs)

<>= gnames<-geneNames(Dilution) length(gnames) gnames[1:5] nrow(e)/length(gnames) @ The length of \Robject{gnames}, 12625, indicates that there are that many probesets on the chip.

As with other microarray objects in Bioconductor packages, you can use standard subsetting commands on {\tt AffyBatch} objects. The <>= dil1<-Dilution[1] class(dil1) dil1 cel1<-Dilution[[1]] class(cel1) cel1 @

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Reading in data

One of the main functions for reading in Affymetrix data is \verb+ReadAffy+. It reads in data from \verb+CEL+ and \verb+CDF+ files and creates objects of class \verb+AffyBatch+. Using \verb+ReadAffy(widget=TRUE)+ uses a widget for interactive data input.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Diagnostic plots

To produce a spatial image of probe log intensities and probe raw intensities <>= # Log transformation image(Dilution[1]) @

<>= # No transformation image(cel1) @

To produce boxplots of probe log intensities <>= boxplot(Dilution,col=c(2,2,3,3)) @ Note that scanner effects seem stronger than concentration effect (the first and third boxplots are from the same scanner).

To produce density plots of probe log intensities <>= hist(Dilution, type="l", col=c(2,2,3,3), lty=rep(1:2,2), lwd=3) @

These boxplots and histograms show that the Dilution data needs normalizaton. Arrays that should be the same are different. Arrays that should be different are similar. Because of these arrays have increasing concentrations they have to be normalized in concnetration groups.

<>= Dil10 <- normalize(Dilution[1:2]) ##conc. group 10 micrograms Dil20 <- normalize(Dilution[3:4]) ##conc. groun 20 mictograms normDil <- merge(Dil20,Dil10) @

Notice how the boxplot now looks better: <>= boxplot(normDil,col=c(2,2,3,3)) @

The \verb+affy+ package provides implementations for a number of methods for background correction, probe-level normalization (e.g., quantile, curve-fitting (Bolstad et al., 2002)), and computation of expression measures (e.g., MAS 4.0, MAS 5.0, MBEI (Li \& Wong, 2001), RMA (Irizarry et al., 2003)). To list available methods for \verb+AffyBatch+ objects

<>= bgcorrect.methods normalize.AffyBatch.methods pmcorrect.methods express.summary.stat.methods @

The main probe level pre-processing function is \verb+expresso+. You can select pre-processing methods interactively using widgets by typing {\tt expresso(Dilution, widget=TRUE)}. The function operates on objects of class \verb+AffyBatch+ and returns objects of class \verb+exprSet+.

The function \verb+rma+ provides a more efficient implementation of Robust Multi-array Average (RMA)

<>= ##rmaDil<-rma(normDil,normalize=FALSE) ##class(rmaDil) @

We don't normalize because we already did above. %' %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % CDF data packages %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Data packages for CDF information can be download from \url{}|. These packages contain environment objects which provide mappings between Affymetrix identifiers and matrices of probe locations, with rows corresponding to probe-pairs and columns to PM and MM cels. CDF environments for HGU95Av2 and HGU133A chips are already in the package. For information on the environment object {\tt ? hgu95av2cdf}

<>= annotation(Dilution) data(hgu95av2cdf) pnames<-ls(env=hgu95av2cdf) length(gnames) gnames[1:5] get(gnames[1],env=hgu95av2cdf) @

You can also use the \Rfunction{indexProbe}, \Rfunction{pmindex}, and \Rfunction{mmindex} to get information on probe location

<>= plocs<-indexProbes(Dilution,which="both") plocs[[1]] pmindex(Dilution,genenames=gnames[1], xy=TRUE) pmindex(Dilution,genenames=gnames[1]) @

Having access to PM and MM data can be useful. Let's look at a plot of PM vs. MM %' <>= plot(mm(Dilution[1]),pm(Dilution[1]),pch=".",log="xy") abline(0,1,col="red") @

An example of a \verb+ProbeSet+ is included in the package. The object represents the $PM$ and $MM$ intensities of a particular control probeset, \verb+AFFX-BioB-5\_at+, from 12 arrays from the spike-in data set. The transcripts related to this probeset where spiked-in at known concentrations in the hybridization mixture. The concentrations were varied from chip to chip. The different concentration values are used as the sample names:

<<>>= data(SpikeIn) #loads objects: concentrations and SpikeIn sampleNames(SpikeIn) @ Notice that the concentrations are growing exponentially.

The help file describes the data set in more detail. The The next figure uses this data set to demonstrate that the $MM$ also detect some transcript signal.

In the next plot we see that both the $MM$ values and the $PM$ values increase as the concentration does. This means that the $MM$ probes are detecting signal and not just background or non-specific hybridization.

<>= pms <- pm(SpikeIn) mms <- mm(SpikeIn)

##pms follow concentration par(mfrow=c(1,2)) concentrations <- matrix(as.numeric(sampleNames(SpikeIn)),20,12,byrow=TRUE) matplot(concentrations,pms,log="xy",main="PM",ylim=c(30,20000)) lines(concentrations[1,],apply(pms,2,mean),lwd=3) ##so do mms, but no as much matplot(concentrations,mms,log="xy",main="MM",ylim=c(30,20000)) lines(concentrations[1,],apply(mms,2,mean),lwd=3) @

These $PM$ and $MM$ intensities were obtained using the \Rfunction{probeset} method on an \Robject{AffyBatch} (not included or shown) representing a large spike-in experiment.

\section{Using sequence information}

Sequence information can be used to improve on pre-processing. Two examples are the use of GC content to adjust for non-specific hybridization and the use of probe location to take into account RNA degradation. We demonstrate this with some examples

\subsection{GC content} To explore issues of GC content we need to load a special library that contains information on the GC content of the probes. For any chip the R package \Rpackage{makeMatchAffy} can be used, together with the appropriate metadata package to construct the necessary package. The package named \Rpackage{mAffyu95A} was built using the HGU95Av2 chip. This package contains the probe sequences (for most probes) and their location on the RNA transcript represented by their probeset. We start by loading this information

<<>>= if(.Platform$OS.type == "unix") { library(mAffyu95A) data(affyprobeids) data(affylocs) } @

Now we are ready to put together each probe with their sequence and their GC-content.

<<>>= if( .Platform$OS.type=="unix") { Index <- affylocs[,1]+ affylocs[,2]*ncol(Dilution) + 1 probenames <- probeNames(Dilution) myindex <- unlist(pmindex(Dilution)) probenames <- probenames[match(Index,myindex)] seq <- getprobes(seq(along=Index)) tmp <- sapply(seq,cgcontent) tmp <- matrix(unlist(tmp),nrow=4) atc <- tmp[1,]+tmp[2,] gcc <- tmp[3,]+tmp[4,] } @

To see that the GC has an effect on hybridization notice how as GC content grows so does the measured intensity. So probes with higher GC content have higher intensities. This appears to be independent of the amount of transcript available.

<>= if( .Platform$OS.type=="unix") { y <- intensity(Dilution)[Index,1] boxplot(split(y, gcc),xlab="GC content", ylab="Iintensity", log="y",las=1, range=0,col="grey") } @

Now does this matter at the expression level? To see that the average GC across probes within probeset can be variable notice the distribution shown in the next figure. The average GC content in probesets is quite variable.

One still might ask whether this matters. If all we are doing is comparing samples for specific genes then it shouldn't matter too %' much, but there may be some cases where it does. For example if a probeset has a GC rich probe to which some mRNA other than the target hybridizes then we may perceive differences in the target genes expression, between the samples (if the samples differ in abundance of the other mRNA). We are still unsure as to how large that effect might be. These tools will help you to explore the data more extensively if desired.

<>= if(.Platform$OS.type=="unix") { avggcc <- tapply(gcc, probenames, mean) barplot(table(round(avggcc)),xlab="Average GC content in probeset",las=1) } @

Subtracting the $MM$ does a relatively good job at removing this effect. In the next figure the boxplots are centered around zero. <>= if(.Platform$OS.type=="unix") { ##pm - mm pmi <- unlist(pmindex(Dilution)) mmi <- unlist(mmindex(Dilution)) pmIndex <- match(pmi,Index)

y <- intensity(Dilution)[pmi,1]-intensity(Dilution)[mmi,1] x <- gcc[pmIndex] boxplot(split(y,x),xlab="GC content",ylab="PM-MM",las=1, range=0,col="grey",ylim=c(-300,500)) } @

If instead of using RMA we use MASS 5.0 (at least our interpretation of it) the effect due to GC content is much smaller. The reason for this is that MASS 5.0 does subtract the MM values and hence removes the GC content effect.

<>= if( .Platform$OS.type=="unix") { ##the following command will take a few minutes mas5Dil <- mas5(Dilution,normalize=FALSE)

boxplot(split(exprs(mas5Dil)[names(avggcc),1],round(avggcc)), log="y",range=0,las=1,yaxt="n", xlab="Average GC content in probe set", ylab="MAS 5.0 expression",col="grey")

axis(2,c(.1,10,1000),c("0.01","10","1000"),las=1) } @

However, as pointed out by Irizarry et al. (2003) subtracting $MM$ results in expression measures that are very noisy at low expression values. RMA is not as noisy, but the next plot shows it does not do as well in removing the dependence on GC content.

<>= if( .Platform$OS.type=="unix") {

RMA <- rma(Dilution)

## boxplot(split(exprs(rmaDil)[names(avggcc),1],round(avggcc)), ## log="y",range=0,las=1,yaxt="n", ## xlab="Average GC content in probe set", ## ylab="RMA (pm-only) expression",col="grey") ## ## axis(2,c(.1,10,1000),c("0.01","10","1000"),las=1) } @

We could write a new version of RMA that does the background adjustment within GC content strata (and indeed that work is under way).

\subsection{Probe Location}

The package also includes methods useful for assessing RNA quality. Individual probes in a probe set are ordered by location relative to the $5^\prime$ end of the targeted RNA molecule. Since RNA degradation typically starts from the $5'$ end of the molecule, we would expect probe intensities to be systematically lowered at that end of a probe set when compared to the $3'$ end. Affymetrix software includes some $5^\prime$ and $3^\prime$ probe sets that are used to assess this. The \Rpackage{affy} package includes some simple methods that complement the Affymetrix assessment.

When using the probe location functions described above the $PM$ and $MM$ are, in general, returned in location order. This makes it easy to perform the following assessment: On each chip, probe intensities are averaged by location in probe set, with the average taken over probe sets. The package provides a method that produces side-by-side plots of these means, making it easy to notice any $5^\prime$ to $3^\prime$ trend. The function \Rfunction{AffyRNAdeg} does all the computation. The object returned can be summarized and plotted using \Rfunction{summaryAffyRNAdeg} and \Rfunction{plotAffyRNAdeg} respectively.

The summary shows standardized slope estimates of the regression of intensity versus probe number (not location per se). The plots shows the average across all probes versus probe number. This plot can be used to compare arrays. An array with degradation should stand out becuase it has a bigger slope. In this example we dont see any evidence of this.

<<>>= if( .Platform$OS.type=="unix") {

rnadeg <- AffyRNAdeg(Dilution) summaryAffyRNAdeg(rnadeg) plotAffyRNAdeg(rnadeg) } @

If one wants to delve deaper one can use the MatchAffy package, which returns the exact location (as opposed to order location). The following code gives an example of how one can obtain a similar plot to the above:

The next plot uses the actual location instead of probe number, however, we re-scale because otherwise one does not see much (there is too much noise).

<>= if( .Platform$OS.type=="unix") {

scale01 <- function(x){r <- range(x); x-r[1]/(r[2]-r[1]) } locs <- tapply(as.double(affylocs[,3]), probenames, scale01) locs <- unlist(locs)

plot(locs, locs, type="n",xlab="5' <-----> 3'\nRelative position", ylab="Relative log intesity", yaxt="n", xaxt="n", ylim=c(-.02,.35)) legend(.1,.3,c("normal RNA","degraded RNA"), col=c("blue","red"), lty=c(2,1)) ##lets just show array 1 i <- 1 o <- sample(length(gcc), 5000) pms <- log2(intensity(Dilution)[Index, i]) res <- tapply(pms, probenames, function(x) x-median(x)) tmp <- loess(unlist(res)~gcc, subset=o, span=4/5, degree=1) oo <- seq(1,length(tmp$x), len=100) lines(sort(tmp$x)[oo], tmp$fitted[order(tmp$x)][oo], col="red") } @



BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.


R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.