Skip to content.

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.



% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{GSEA Case Study} %\VignetteDepends{genefilter, GO, hgu95av2, xtable, KEGG, ALL} %\VignetteKeywords{gene set enrichment} \documentclass[11pt]{article}

\usepackage{../Common/bioc} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in


\title{Gene Set Enrichment Analysis} \author{R. Gentleman and Seth Falcon}




We begin by loading up the appropriate libraries.

<>= library("Biobase") library("annotate") library("Category") library("hgu95av2") library("genefilter") @

In this case study we will see how to use gene set enrichment analysis \citep{Mootha, Tian}. We will make use of the data from an investigation into acute lymphoblastic leukemia (ALL) reported in \citet{Chiaretti2004} for our examples. We will primarily concentrate on the two sample problem, where the data can be divided into two distinct groups, and we want to understand differences in gene expression between the two groups. For the ALL data we will compare those samples with BCR/ABL to those that have no observed cytogenetic abnormalities (NEG).

The basic idea behind gene set enrichment analysis is that we want to use predefined sets of genes, perhaps based on function, in order to better interpret the observed gene expression data. In some ways the ideas here are quite similar to those that the usual Hypergeomtric testing is based on.


As for all analyses, we must first load the data and process it. In the code chunk below we load the data and then select the subset we are interested in.

<>= library("ALL") data(ALL) Bcell <- grep("^B", as.character(ALL$BT)) bcrAblOrNegIdx <- which(as.character(ALL$mol.biol) %in% c("NEG", "BCR/ABL")) eset <- ALL[, intersect(Bcell, bcrAblOrNegIdx)] eset$mol.biol <- factor(eset$mol.biol) numBN <- length(eset$mol.biol) @

\begin{Q} How many samples are in our subset? How many are BCR/ABL and how many NEG? \end{Q}

Next, we remove from our data set those probes that have no mapping to EntrezGene, since we will not be able to find any metadata for these probes.

<>= entrezIds <- mget(geneNames(eset), envir=hgu95av2LOCUSID) haveEntrezId <- names(entrezIds)[sapply(entrezIds, function(x) !] numNoEntrezId <- length(geneNames(eset)) - length(haveEntrezId) eset <- eset[haveEntrezId, ] @

Next we do some basic prefiltering. My preference is to filter genes according to their variability across samples. In the code below we compute the IQR (approximately) and then select for our gene set those genes that have an IQR above the median value.

<>= ## Non-specific filtering based on IQR lowQ = rowQ(eset, floor(0.25 numBN)) upQ = rowQ(eset, ceiling(0.75 numBN)) iqrs = upQ - lowQ

selected <- iqrs > 0.5

nsFiltered <- eset[selected, ]


In the next code chunk, we find all probes that map to a single gene. We want only one probe set to represent each gene (otherwise we have to do a lot of downstream adjustments) and our decision here is to chose the one that shows the most variation, as measured by the IQR, across samples.

<>= ## Reduce to unique probe <--> gene mapping by keeping largest IQR ## This gives us "unique genes" in the non-specific filtered gene ## set which simplifies further calculations. nsFilteredIqr <- iqrs[selected] uniqGenes <- findLargest(geneNames(nsFiltered), nsFilteredIqr, "hgu95av2") nsFiltered <- nsFiltered[uniqGenes, ]

## basic stats on our non-specific filter result numSelected <- length(geneNames(nsFiltered)) numBcrAbl <- sum(nsFiltered$mol.biol == "BCR/ABL") numNeg <- sum(nsFiltered$mol.biol == "NEG")



How many genes have been selected for our analysis?


Since we have done this selection without regard to phenotype, or the type of gene sets we are going to use we will be able to use this same subset of the data in all examples.

\section*{Using KEGG}

We now want to use KEGG to assign genes to pathways, and to then use those pathways for our gene sets.

<>= ## Remove genes with no PATH mapping havePATH <- sapply(mget(geneNames(nsFiltered), hgu95av2PATH), function(x) if (length(x) == 1 && FALSE else TRUE) numNoPATH<- sum(!havePATH) nsF <- nsFiltered[havePATH, ]



How many genes are we left with?


Now we must compute the incidence matrix, that is the matrix that maps between probes and the pathways. This matrix has zero's and ones in it. The last two commands rearrange the rows of the A matrix so that they are in the same order as the genes in our \Rclass{exprSet} object.


Am = PWAmat("hgu95av2") egN = unlist(mget(geneNames(nsF), hgu95av2LOCUSID))

sub1 = match(egN, row.names(Am))

Am = Am[sub1,] dim(Am) table(colSums(Am))



How many categories and how many genes are represented by the A matrix? How many categories have fewer than 10 genes in them? What is the largest number of categories a gene is in? \end{Q}

Next we will compute the per gene test statistics using the \Rfunction{rowttests} function from the \Rpackage{genefilter} package. There are several other fast test statistic computations that you can do as well (e.g. \Rfunction{rowFtests}).


rtt = rowttests(nsF, "mol.biol") rttStat = rtt$statistic


\begin{Q} How many test statistics are positive? How many are negative? How many have a $p$-value less than $0.01$? \end{Q}

Next we further reduce the A matrix, by removing all categories that have fewer than 10 genes in them. When carrying out your own analyses you should select a value you are comfortable with.

<>= Amat = t(Am) rs = rowSums(Amat) Amat2 = Amat[rs>10,] rs2 = rs[rs>10] nCats = length(rs2) @

And now it is fairly easy to compute the per category test statistics and to produce a qq-plot.


tA = as.vector(Amat2 %*% rttStat) tAadj = tA/sqrt(rs2)

names(tA) = names(tAadj) = row.names(Amat2)


And now for the qq-plot. We see that there is one pathway that has a remarkably low observed value (less than $-5$) so we will take a closer look at this pathway.

\begin{figure}[tp] \centering <>= qqnorm(tAadj) @ \includegraphics{GSEA-qqplot} \caption{\label{fig:qq}% The per category qq-plot.} \end{figure}

To find the pathway, we first find the value, and then use


smPW = tAadj[tAadj < -5] pwName = KEGGPATHID2NAME[[names(smPW)]] pwName @

Now we can produce some summary plots based on the genes annotated at this pathway. The mean plot presents a comparison of the average expression value for each of our two groups, for each gene in the specified pathway. That is, each point in this plot represents one gene and the value on the $x$-axis is the mean in the BCR/ABL samples while the value on the $y$-axis is the mean value in the NEG samples.

\begin{Q} What do you notice in this plot? \end{Q}

\begin{figure}[tp] \centering <>= KEGGmnplot(names(smPW), nsF, "hgu95av2", nsF$"mol.biol") @ \includegraphics{GSEA-mnplot} \caption{\label{fig:mnplot}% The mean plot for the \Sexpr{pwName} pathway.} \end{figure}

And finally a heatmap. \begin{Q} What sorts of things do you notice in the heatmap? The gene labeled \verb+41214_at+ has a very unusual pattern of expression. Can you guess what is happening? \textit{Hint:} look at which chromosome it is on. \end{Q}

\begin{figure}[tp] \centering <>= KEGG2heatmap(names(smPW), nsF, "hgu95av2") @ \includegraphics{GSEA-heatmap} \caption{\label{fig:hm}% A heatmap for the \Sexpr{pwName} pathway.} \end{figure}

\begin{Ex} Repeat these plots for the pathway with the largest observed average $t$-statistic. \end{Ex}

\subsection*{Permutation testing}

The analysis above was based on a presumption that the data are approximately Normally distributed. However, this is sometimes viewed as a relatively strong assumption and there is some interest in using another approach. For GSEA it is relatively straightforward to compute a permutation $t$-test. The \Rfunction{ttperm} in the \Rpackage{Category} package can be used for this purpose.

In the next code chunk we compute the permutation distribution for this same problem. The value returned by \Rfunction{ttperm} is a list, the first entry is the observed $t$-statistic (the return value of a call to \Rfunction{rowttests}) while the second element is itself a list of length \Rfunarg{B}, the number of permutations.

In the code chunk below we compute the permutation distribution based on 100 permutations, in practice you would typically use a much larger value.


NPERM = 100 ttp = ttperm(exprs(nsF), nsF$mol.biol, NPERM)

permDm ="cbind", lapply(ttp$perms, function(x) x$statistic))

permD = Amat2 %*% permDm

pvals = matrix(NA, nr=nCats, ncol=2) dimnames(pvals) = list(row.names(Amat2), c("Lower", "Upper"))

for(i in 1:nCats) { pvals[i,1] = sum(permD[i,] < tA[i])/NPERM pvals[i,2] = sum(permD[i,] > tA[i])/NPERM }

ord1 = order(pvals[,1]) lowC = (row.names(pvals)[ord1])[pvals[ord1,1]< 0.05]

highC = row.names(pvals)[pvals[,2] < 0.05]



lnhC = length(highC)



How many pathways have low $t$-statistics? How many have high? How do you interpret these? What $p$-value is the most extreme? What does the heatmap look like for this gene set?


\section*{Chromosome Bands}

Another interesting application is to use chromosome band information as the gene set. In doing this, you are studying whether there are anomolies in the pattern of gene expression that relate to chromosomal location. Unfortunately this analysis is incomplete in the sense that we have not adjusted for nesting of bands, but software for doing this will be available in the 1.9 release of Bioconductor.

The chromosome band information is contained in the metadata variable with the suffix \texttt{MAP}, so for us it is in \Robject{hgu95av2MAP}.

\begin{Q} What does the manual page say the interpretation of the MAP position \verb+17p33.2+ is? \end{Q}

A preliminary approach to finding relevant MAP positions is available in the \Rfunction{MAPAmat} function from the \Rpackage{Category} package (note this function will be improved over the next few months). Since we want to focus on relatively few categories we are going to restrict attention to MAP locations with at least 10 genes.


AmChr = MAPAmat("hgu95av2", minCount=5)


\begin{Q} How many genes were selected? How many map positions? \end{Q}

Next we must reduce this matrix to those genes we have selected for analysis. So in the next code chunk we repeat the steps performed above, for KEGG, to match row names to those selected and then to reduce the incidence matrix accordingly.


subC = row.names(AmChr) %in% egN

AmChr = AmChr[subC,] dim(AmChr) table(colSums(AmChr))


\begin{Ex} Further reduce \Robject{AmChr} so that only bands with at least 5 genes are retained. Produce a qq-plot and identify the interesting bands. Use Google, or some other search engine to determine whether others have identified these bands as interesting.

Repeat the analysis performed on the KEGG pathways. Produce mean plots, heatmaps etc. Try to identify a set of interesting bands.


The version number of R and the packages and their versions that were used to generate this document are listed below.

\begin{verbatim} <>= sessionInfo() @ \end{verbatim}




BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.


R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.