Skip to content.

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.



% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Lab 2a} %\VignetteDepends{Biobase} %\VignetteKeywords{Microarray} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}}

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\title{Lab 2: Bioinformatics (annotation package)} \bibliographystyle{plainnat}




In this lab you will be introduced to some of the functionality provided by the annotate package and by the different meta-data packages that are produced through the Bioconductor project.

While the experimental data are very large, as we have noted in the lectures the biological meta-data are much larger. The biological meta-data are constantly evolving and improving as more experiments are done, reported and verified. Since most analyses are concentrated on data from a single chip we have decided to package the meta-data in per chip or per instrument data packages. Many of the more popular Affymetrix chips are prepackaged and available from the Bioconductor website. For cDNA or custom arrays the software tools provided by \Rpackage{AnnBuilder} can be used to create custom data packages. This should only take a few hours and need only be done once or twice a year (depending on changes/improvements to the data resources that we rely on). Note, the Bioconductor project does not generate or validate any biological meta-data. We simply package it into what we believe are useful objects that can be used to enhance data analysis. The reader is cautioned to check all mappings and encouraged to report any failings to us.


Now we will look at how to map from Affymetrix identifiers to the different biological meta-data available from the Bioconductor Web site. Other data can be obtained from other sources and used in a similar fashion.

For each chip a separate data package is available. In this vignette we explore the \Rpackage{hu6800} data package. Each specific mapping is contained in an \textit{environment}. For our purposes an environment is simply a \textit{hash} table. It provides a very fast way of looking up values for specific text keys. Depending on which way you would like to map the keys are either Affymetrix identifiers (for example \Robject{hu6800CHR} maps from Affymetrix identifiers to the chromosome the gene is located on) or from a specific identifier to the Affymetrix, or probe, identifiers. For example, \Robject{hu6800GO2PROBE} maps from Gene Ontology numbers to the probes that are annotated at that GO term. We will describe GO and its uses more in a subsequent section.

Load the data and obtain the Affymetrix identifiers.

<>= library(annotate) library(hu6800) ls(pos=2)

affynames <- ls(env=hu6800CHR) length(affynames)

hu6800() @

After loading the data we see that there are a number of different data sets that have been loaded. These provide the many different mappings from Affymetrix identifiers to other biological data, such as chromosomal location which is found in \texttt{hu6800CHR}.

We see that there are 7129 identifiers (which matches the \texttt{exprSets} we have.

Finally, quality control data and counts etc. of the mappings found is available by calling the function \texttt{hu6800()} and by examining the help page, \texttt{?hu6800}. These data provide you with information on when the meta-data package was constructed, what data sources were used and how many different mappings were found.


mygene <- affynames[4001] get(mygene, env = hu6800ACCNUM) get(mygene, env = hu6800LOCUSID) get(mygene, env = hu6800SYMBOL) get(mygene, env = hu6800GENENAME) get(mygene, env = hu6800SUMFUNC) get(mygene, env = hu6800UNIGENE) get(mygene, env = hu6800CHR) get(mygene, env = hu6800CHRLOC) get(mygene, env = hu6800MAP) get(mygene, env = hu6800PMID) get(mygene, env = hu6800GO)

multiget(affynames[1:10], env = hu6800PMID)

@ We have arbitrarily selected the probe set with Affymetrix identifier \verb+"U18237_at"+. And then obtained all the mappings for it. A variety of information about this gene can now easily be obtained.

In some cases we need to obtain data on several genes at once. To do that we wrote a special function called \texttt{multiget}. Notice that in some cases there are a number of PubMed abstracts associated with the different genes.

We can do some other interesting things as well. <>=

whChrom <- multiget(affynames, env=hu6800CHR)


vv<-sapply(whChrom, length) table(vv)

whChrom[vv==2] @

So we see the distribution of probe sets according to chromosome. Also note that there are 10 genes that have two chromosomes assigned. We might want to explore these further by examining resources at the NCBI.

\section*{Querying PubMed}

For more details please look at the short article in R News, Volume 2/2, pages 28-30, by R. Gentleman and J. Gentry, entitled \textit{Querying PubMed}. The National Library of Medicine (NLM) provides a great deal of information on biological data. We have only just started to explore these data. We need further functionality on MeSH and other data that are there; but for now we concentrate on PubMed.

A mapping has been done between LocusLink and PubMed. This associates specific articles with specific genes. These can be queried interactively using tools available in R and other systems. Full text abstracts are readily available, in some cases other full text articles are available (we just don't have the tools to make use of them). We next demonstrate how you could interact with these.


pmids<-get(mygene, env = hu6800PMID) if(interactive() ) pubmed(pmids, disp="browser")

# With pm.getabst absts2<-pm.getabst(mygene,"hu6800") pm.titles(absts2) sapply(absts2, function(x) pm.abstGrep("[Pp]rotein", x)) @

We obtain the abstracts and then search for the word \texttt{protein}.

Another source of data is GenBank. We can explore interactions with it in much the same way as with PubMed.

<>= g10<-affynames[1:10] gbacc<-multiget(g10,hu6800ACCNUM) if( interactive() ) genbank(gbacc,disp="browser") gb<-genbank(gbacc[1],disp="data") @

Finally we consider interactions with LocusLink. Here, we use local data (LocusLink does not provide a good interface for automatic querying). We generate HTML output. This is often an easy way to communicate your results with other researchers working on the same analysis.

<>= llid<-multiget(affynames[1:10],hu6800LOCUSID) map<-multiget(affynames[1:10],hu6800MAP) symb<-multiget(affynames[1:10],hu6800SYMBOL)

res<-data.frame(affynames[1:10],cbind(unlist(symb),unlist(map))) names(res)<-c("Affy ID","Gene symbol", "Chromosomal location") ll.htmlpage(llid, filename="LocusLink.html", title="HTML report", othernames=res, table.head=c("LocusID",names(res)), = TRUE, repository="ll") @

The last command creates a page markedup using HTML with embedded links. You can use this page to explore the data yourself or you can supply it to your collaborators.

\section*{Other Meta-Data}

The discussion above considered the use of different NLM and NCBI resources for annotating genomic data. This is largely done by providing data on a per gene basis.

There are many other sources of biologic meta-data. These include KEGG (the Kyoto Encyclopedia of Genes and Genomes) located at \url{} and GO (the Gene Ontology Consortium) located at \url{}. Bioconductor produces and releases meta-data packages specific to each of these data resources. The packages are call \Rpackage{KEGG} and \Rpackage{GO} respectively. You are encouraged to explore these (and any other primary sources) for more details and if you find something new we would be happy to work with you to incorporate it into Bioconductor.

As part of our project to develop interactive widgets to help navigate the different data packages we have produced \Rfunction{DPExplorer} which is a tcl/tk widget for exploring the different data packages.

It allows the user to explore the contents of the different data packages and to select items from those packages for us in other settings.

In the next exercise we will locate and select the set of Affymetrix identifiers that are associated with apoptosis. We will use both GO and KEGG to do this. The KEGG mapping provides us with a list of genes that are engaged in the apoptosis pathway while the GO mappings will provide us with the list of genes which have been annotated at the apoptosis term. That basically means that they are believed to play a role in this biological process (but it need not be as direct as being part of the pathway).

<>= library(GO) bpcGO <- contents(GOBPID2TERM) go.bpc <- bpcGO[grep("^[Aa]poptosis", unlist(bpcGO))] go.bpc @

So we see that the GO label that has been associated with the term \textit{apoptosis} is \texttt{GO:0006915}. We can now find all the probes that are annotated at that term using the chip specific data package. Here we use the \Rpackage{hu6800} data package.


affyApop <- get(names(go.bpc), env=hu6800GO2PROBE) length(affyApop) affyAllApop <- get(names(go.bpc), env=hu6800GO2ALLPROBES) length(affyAllApop) @

These two lists of probe identifiers are quite different. The first is the set of probes (acutally LocusLink identifiers) that have been mapped to specifically to the GO term \texttt{GO:0006915}. The second, \Robject{affyAllApop} is the set of probe identifiers that have been mapped either to the term \texttt{GO:0006915} or to a more specific term. The idea behind GO is that genes associated with a biological process such as apoptosis can either be annotated at that node or at a more specific node.

We now repeat the procedure for KEGG.

<>= library(KEGG) apKEGG <- contents(KEGGPATHID2NAME) kegg.ap <- apKEGG[grep("^[Aa]poptosis", unlist(apKEGG))] kegg.ap

affykegg.ap <- get(names(kegg.ap), env=hu6800PATH2PROBE)

sum(affykegg.ap %in% affyApop)

sum(affykegg.ap %in% affyAllApop)


So we see that all of the probes in the pathway (according to KEGG) are annotated at the GO term \texttt{apoptosis}. Which is pretty nice.

What we really do not know how to do well is to incorporate these data into an analysis. We are hoping that by making available the infrastructure that you are using today that you will develop good analytic methods for using these meta-data and will make them available to others.



BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.


R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.