About This Document »

Package: annotation
Built with Bioconductor (R): 2.13 (3.0.1)
Source Package annotation_0.99.78934.tar.gz
Windows Binary annotation_0.99.78934.zip
Mac OS 10.6 (Snow Leopard) annotation_0.99.78934.tgz
R Script
Last Built At: Sat Aug 3 11:23:31 PDT 2013
SVN Revision: 78934
Install with (under BioC 2.13):
source("http://bioconductor.org/workflows.R")
workflowInstall("annotation")
  

Using Bioconductor for Annotation

Bioconductor has extensive facilities for mapping between microarray probe, gene, pathway, gene ontology, homology and other annotations.

Bioconductor has built-in representations of GO, KEGG, vendor, and other annotations, and can easily access NCBI, Biomart, UCSC, and other sources.

Package Types

Bioconductor contains many different types of annotation packages. You can browse the currently available types here here by simply using the bioconductor web site.

You will see that there are packages that contain annotation data about a particular microarray platform (ChipDb), there are packages that contain gene centered data about an organism (OrgDb), and even packages that contain genome centered data about an organisms transcriptome (TranscriptDb). This document will talk about typical uses for most of these more popular kinds of annotation package. As well as describe a newer meta package that wraps access to several different kinds of packages (OrganismDb).

Sample ChipDb Workflow

The following illustrates a typical R / Bioconductor session for a ChipDb package. It continues the differential expression workflow, taking a 'top table' of differentially expressed probesets and discovering the genes probed, and the Gene Ontology pathways to which they belong.

First lets consider some typical probe Ids. If you have done a microarray analysis before you have probably already run into IDs like this. They are typically manufacturer assigned and normally only relevant to a small number of chips. Below I am just going to demonstrate on 6 probe Ids from the u133 2.0 affymetrix platform.

## Affymetrix U133 2.0 array IDs of interest; these might be obtained from
## 
## tbl <- topTable(efit, coef=2) ids <- tbl[['ID']]
## 
## as part of a more extensive workflow.
ids <- c("39730_at", "1635_at", "1674_at", "40504_at", "40202_at")

Load libraries as sources of annotation

library("hgu95av2.db")

To list the kinds of things that can be retrieved, use the columns method.

columns(hgu95av2.db)

##  [1] "PROBEID"      "ENTREZID"     "PFAM"         "IPI"         
##  [5] "PROSITE"      "ACCNUM"       "ALIAS"        "CHR"         
##  [9] "CHRLOC"       "CHRLOCEND"    "ENZYME"       "MAP"         
## [13] "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [17] "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
## [21] "GENENAME"     "UNIPROT"      "GO"           "EVIDENCE"    
## [25] "ONTOLOGY"     "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL" 
## [29] "OMIM"         "UCSCKG"

To list the kinds of things that can be used as keys we can use the keytypes method

keytypes(hgu95av2.db)

##  [1] "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
##  [5] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"      
##  [9] "CHRLOCEND"    "ENZYME"       "MAP"          "PATH"        
## [13] "PMID"         "REFSEQ"       "SYMBOL"       "UNIGENE"     
## [17] "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"    
## [21] "UNIPROT"      "GO"           "EVIDENCE"     "ONTOLOGY"    
## [25] "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL"  "PROBEID"     
## [29] "OMIM"         "UCSCKG"

To extract viable keys of a particular kind, use the keys method.

head(keys(hgu95av2.db, keytype = "ENTREZID"))

## [1] "10"        "100"       "1000"      "10000"     "100008586" "10001"

The select method allows you to map probe ids to ENTREZ gene ids...

select(hgu95av2.db, keys = ids, columns = "ENTREZID", keytype = "PROBEID")

##    PROBEID ENTREZID
## 1 39730_at       25
## 2  1635_at       25
## 3  1674_at     7525
## 4 40504_at     5445
## 5 40202_at      687

Or to gene names and gene symbols etc.

select(hgu95av2.db, keys = ids, columns = c("ENTREZID", "GENENAME", "SYMBOL"), 
    keytype = "PROBEID")

##    PROBEID ENTREZID                                           GENENAME
## 1 39730_at       25     c-abl oncogene 1, non-receptor tyrosine kinase
## 2  1635_at       25     c-abl oncogene 1, non-receptor tyrosine kinase
## 3  1674_at     7525 v-yes-1 Yamaguchi sarcoma viral oncogene homolog 1
## 4 40504_at     5445                                      paraoxonase 2
## 5 40202_at      687                              Kruppel-like factor 9
##   SYMBOL
## 1   ABL1
## 2   ABL1
## 3   YES1
## 4   PON2
## 5   KLF9

We can also find and extract the GO ids associated with the first id (there are quite a few of these)

res <- select(hgu95av2.db, keys = ids[1], columns = "GO", keytype = "PROBEID")

## Warning: 'select' resulted in 1:many mapping between keys and return rows

head(res)

##    PROBEID         GO EVIDENCE ONTOLOGY
## 1 39730_at GO:0000115      TAS       BP
## 2 39730_at GO:0000287      IDA       MF
## 3 39730_at GO:0003677      NAS       MF
## 4 39730_at GO:0003785      TAS       MF
## 5 39730_at GO:0004515      TAS       MF
## 6 39730_at GO:0004713      IDA       MF

And then once we have done that, we can also use the GO.db package to find the Terms associated with those GOIDs. How will this work? Well the GO.db package will load a GO.db object that can be used in a manner similar to what we just saw with our ChipDb object hgu95av2.db. So we can use the same four methods that we just learned about (columns, keytypes, keys and select), to extract whatever data we need from this object.

library("GO.db")

## 

head(res$GO)  ## shows what we are using as keys

## [1] "GO:0000115" "GO:0000287" "GO:0003677" "GO:0003785" "GO:0004515"
## [6] "GO:0004713"

head(select(GO.db, keys = res$GO, columns = "TERM", keytype = "GOID"))

##         GOID
## 1 GO:0000115
## 2 GO:0000287
## 3 GO:0003677
## 4 GO:0003785
## 5 GO:0004515
## 6 GO:0004713
##                                                                    TERM
## 1 regulation of transcription involved in S phase of mitotic cell cycle
## 2                                                 magnesium ion binding
## 3                                                           DNA binding
## 4                                                 actin monomer binding
## 5                    nicotinate-nucleotide adenylyltransferase activity
## 6                                      protein tyrosine kinase activity

[ Back to top ]

Sample OrgDb Workflow

The organism wide gene centered packages (OrgDb packages) all contain gene centered data for an organism. These packages are accessed in much the same way as the platform based (ChipDb) package and the Go.db package the we just discussed. But because they are general, they don't contain information like probe IDs that would relate to any specific microarray platform.

But the more important thing to understand is that the same methods apply. So for example you can look up information in this way:

library(org.Hs.eg.db)
keys <- head(keys(org.Hs.eg.db, keytype = "ENTREZID"), n = 2)
columns <- c("PFAM", "GO", "SYMBOL")
select(org.Hs.eg.db, keys, columns, keytype = "ENTREZID")

## Warning: 'select' resulted in 1:many mapping between keys and return rows

##   ENTREZID    PFAM         GO EVIDENCE ONTOLOGY SYMBOL
## 1        1    <NA> GO:0003674       ND       MF   A1BG
## 2        1    <NA> GO:0005576      IDA       CC   A1BG
## 3        1    <NA> GO:0008150       ND       BP   A1BG
## 4       10 PF00797 GO:0004060      IEA       MF   NAT2
## 5       10 PF00797 GO:0005829      TAS       CC   NAT2
## 6       10 PF00797 GO:0006805      TAS       BP   NAT2
## 7       10 PF00797 GO:0044281      TAS       BP   NAT2

[ Back to top ]

Exercises for ChipDb and OrgDb objects.

Exercise 1: Look at the help page for the different columns and keytypes values with: help("SYMBOL"). Now use this information and what we just described to look up the entrez gene, probe id and chromosome for the gene symbol "MSX2".

Exercise 2: Examine the gene symbols for both the hgu95av2.db and the org.Hs.eg.db packages. Which one has more gene symbols? Which one has more gene symbols that can be mapped to an entrez gene ID? Which object seems to contain more information?

Exercise 3: In the previous exercise we had to use gene symbols as keys. But in the past this kind of behavior has sometimes been inadvisable because some gene symbols are used as the official symbol for more than one gene. To learn if this is still happening take advantage of the fact that entrez gene ids are uniquely assigned, and extract all of the gene symbols and their associated entrez gene ids from the org.Hs.eg.db package. Then check the symbols for redundancy.

[ Back to top ]

Sample TranscriptDb Workflow

The genome centered TranscriptDb packages support the same interface as that ChipDb and the OrgDb packages.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene  ## done for convenience
keys <- head(keys(txdb, keytype = "GENEID"), n = 2)
columns <- c("TXNAME", "TXSTART", "TXSTRAND")
select(txdb, keys, columns, keytype = "GENEID")

## Warning: 'select' resulted in 1:many mapping between keys and return rows

##   GENEID     TXNAME TXSTRAND  TXSTART
## 1      1 uc002qsd.4        - 58858172
## 2      1 uc002qsf.2        - 58859832
## 3     10 uc003wyw.1        + 18248755

But in addition to supporting the standard set of methods (select, keytypes, keys and columns). The TranscriptDb objects also support methods to retrieve the annotations as ranges. These accessors break down into two basic categories. The most basic will return annotations as GRanges objects. Some examples of these are: transcripts(), exons() and cds().

This for example will return all the transcripts as ranges:

transcripts(txdb)

## GRanges with 80922 ranges and 2 metadata columns:
##           seqnames               ranges strand   |     tx_id     tx_name
##              <Rle>            <IRanges>  <Rle>   | <integer> <character>
##       [1]     chr1     [ 11874,  14409]      +   |         1  uc001aaa.3
##       [2]     chr1     [ 11874,  14409]      +   |         2  uc010nxq.1
##       [3]     chr1     [ 11874,  14409]      +   |         3  uc010nxr.1
##       [4]     chr1     [ 69091,  70008]      +   |         4  uc001aal.1
##       [5]     chr1     [321084, 321115]      +   |         5  uc001aaq.2
##       ...      ...                  ...    ... ...       ...         ...
##   [80918]     chrY [27605645, 27605678]      -   |     76946  uc004fwx.1
##   [80919]     chrY [27606394, 27606421]      -   |     76947  uc022cpc.1
##   [80920]     chrY [27607404, 27607432]      -   |     76948  uc004fwz.3
##   [80921]     chrY [27635919, 27635954]      -   |     76949  uc022cpd.1
##   [80922]     chrY [59358329, 59360854]      -   |     76950  uc011ncc.1
##   ---
##   seqlengths:
##                    chr1                 chr2 ...       chrUn_gl000249
##               249250621            243199373 ...                38502

And this will return all the exons as ranges:

exons(txdb)

## GRanges with 286852 ranges and 1 metadata column:
##            seqnames               ranges strand   |   exon_id
##               <Rle>            <IRanges>  <Rle>   | <integer>
##        [1]     chr1       [11874, 12227]      +   |         1
##        [2]     chr1       [12595, 12721]      +   |         2
##        [3]     chr1       [12613, 12721]      +   |         3
##        [4]     chr1       [12646, 12697]      +   |         4
##        [5]     chr1       [13221, 14409]      +   |         5
##        ...      ...                  ...    ... ...       ...
##   [286848]     chrY [27607404, 27607432]      -   |    274767
##   [286849]     chrY [27635919, 27635954]      -   |    274768
##   [286850]     chrY [59358329, 59359508]      -   |    274769
##   [286851]     chrY [59360007, 59360115]      -   |    274770
##   [286852]     chrY [59360501, 59360854]      -   |    274771
##   ---
##   seqlengths:
##                    chr1                 chr2 ...       chrUn_gl000249
##               249250621            243199373 ...                38502

But these operations will also support the extraction of extra metadata. All extra data will be inserted into the metadata slot of the returned GRanges object. So for example you could spice up your call to transcripts by using the columns argument like this.

transcripts(txdb, columns = c("tx_id", "tx_name", "gene_id"))

## GRanges with 80922 ranges and 3 metadata columns:
##           seqnames               ranges strand   |     tx_id     tx_name
##              <Rle>            <IRanges>  <Rle>   | <integer> <character>
##       [1]     chr1     [ 11874,  14409]      +   |         1  uc001aaa.3
##       [2]     chr1     [ 11874,  14409]      +   |         2  uc010nxq.1
##       [3]     chr1     [ 11874,  14409]      +   |         3  uc010nxr.1
##       [4]     chr1     [ 69091,  70008]      +   |         4  uc001aal.1
##       [5]     chr1     [321084, 321115]      +   |         5  uc001aaq.2
##       ...      ...                  ...    ... ...       ...         ...
##   [80918]     chrY [27605645, 27605678]      -   |     76946  uc004fwx.1
##   [80919]     chrY [27606394, 27606421]      -   |     76947  uc022cpc.1
##   [80920]     chrY [27607404, 27607432]      -   |     76948  uc004fwz.3
##   [80921]     chrY [27635919, 27635954]      -   |     76949  uc022cpd.1
##   [80922]     chrY [59358329, 59360854]      -   |     76950  uc011ncc.1
##                   gene_id
##           <CharacterList>
##       [1]                
##       [2]                
##       [3]                
##       [4]           79501
##       [5]                
##       ...             ...
##   [80918]                
##   [80919]                
##   [80920]                
##   [80921]                
##   [80922]                
##   ---
##   seqlengths:
##                    chr1                 chr2 ...       chrUn_gl000249
##               249250621            243199373 ...                38502

The 2nd kind of range accessor supported by TranscriptDb objects are the ones that return GRangesList objects. Some examples of these are: transcriptsBy(), exonsBy() or cdsBy(). These accessors just allow you to return a GRangesList object that contains the desired ranges by split up by some important feature type that is specified using the "by" argument. A typical case is to extract all the transcript ranges known for all the genes. You can do that like this:

transcriptsBy(txdb, by = "gene")

## GRangesList of length 22932:
## $1 
## GRanges with 2 ranges and 2 metadata columns:
##       seqnames               ranges strand |     tx_id     tx_name
##          <Rle>            <IRanges>  <Rle> | <integer> <character>
##   [1]    chr19 [58858172, 58864865]      - |     68796  uc002qsd.4
##   [2]    chr19 [58859832, 58874214]      - |     68797  uc002qsf.2
## 
## $10 
## GRanges with 1 range and 2 metadata columns:
##       seqnames               ranges strand | tx_id    tx_name
##   [1]     chr8 [18248755, 18258723]      + | 31203 uc003wyw.1
## 
## $100 
## GRanges with 1 range and 2 metadata columns:
##       seqnames               ranges strand | tx_id    tx_name
##   [1]    chr20 [43248163, 43280376]      - | 70442 uc002xmj.3
## 
## ...
## <22929 more elements>
## ---
## seqlengths:
##                  chr1                 chr2 ...       chrUn_gl000249
##             249250621            243199373 ...                38502

[ Back to top ]

Exercises for TranscriptDb objects.

Exercise 4: Use the accessors for the TxDb.Hsapiens.UCSC.hg19.knownGene package to retrieve the gene id, transcript name and transcript chromosome for all the transcripts. Do this using both the select() method and also using the transcripts() method. What is the difference in the output?

Exercise 5: Load the TxDb.Athaliana.BioMart.plantsmart16 package. This package is not from UCSC and it is based on plantsmart. Now use select or one of the range based accessors to look at the gene ids from this TranscriptDb object. How tdo they compare to what you saw in the TxDb.Hsapiens.UCSC.hg19.knownGene package?

[ Back to top ]

Sample OrganismDb Workflow

What if you wanted to combine all the good stuff from the GO.db package with what you find in the appropriate TranscriptDb and OrgDb packages for an organism? Then you would want to use an OrganismDb package. An example of an OrganismDb package is the Homo.sapiens package. Like the OrgDb, ChipDb and TranscriptDb packages, it supports the use of select, keytypes, keys and columns.

library(Homo.sapiens)
keys <- head(keys(Homo.sapiens, keytype = "ENTREZID"), n = 2)
columns <- c("SYMBOL", "TXNAME")
select(Homo.sapiens, keys, columns, keytype = "ENTREZID")

## Warning: 'select' resulted in 1:many mapping between keys and return rows

## Warning: 'select' resulted in 1:many mapping between keys and return rows

##   ENTREZID SYMBOL     TXNAME
## 1        1   A1BG uc002qsd.4
## 2        1   A1BG uc002qsf.2
## 3       10   NAT2 uc003wyw.1

When an OrganismDb package knows about a relevant TranscriptDb package, it can also support the ranged accessors introduced with the TranscriptDb objects.

transcripts(Homo.sapiens, columns = c("TXNAME", "SYMBOL"))

## GRanges with 80922 ranges and 2 metadata columns:
##           seqnames               ranges strand   |          TXNAME
##              <Rle>            <IRanges>  <Rle>   | <CharacterList>
##       [1]     chr1     [ 11874,  14409]      +   |      uc001aaa.3
##       [2]     chr1     [ 11874,  14409]      +   |      uc010nxq.1
##       [3]     chr1     [ 11874,  14409]      +   |      uc010nxr.1
##       [4]     chr1     [ 69091,  70008]      +   |      uc001aal.1
##       [5]     chr1     [321084, 321115]      +   |      uc001aaq.2
##       ...      ...                  ...    ... ...             ...
##   [80918]     chrY [27605645, 27605678]      -   |      uc004fwx.1
##   [80919]     chrY [27606394, 27606421]      -   |      uc022cpc.1
##   [80920]     chrY [27607404, 27607432]      -   |      uc004fwz.3
##   [80921]     chrY [27635919, 27635954]      -   |      uc022cpd.1
##   [80922]     chrY [59358329, 59360854]      -   |      uc011ncc.1
##                    SYMBOL
##           <CharacterList>
##       [1]              NA
##       [2]              NA
##       [3]              NA
##       [4]           OR4F5
##       [5]              NA
##       ...             ...
##   [80918]              NA
##   [80919]              NA
##   [80920]              NA
##   [80921]              NA
##   [80922]              NA
##   ---
##   seqlengths:
##                    chr1                 chr2 ...       chrUn_gl000249
##               249250621            243199373 ...                38502

Making an OrganismDb package

You might be surprised to learn that an OrganismDb package does not itself contain very much information. Instead, it "knows where to find it", but referencing other packages that themselves implement a select interface. So to create an OrganismDb package, you really only need to specify where the information needs to come from. Configuring an OrganismDb object is therefore pretty simple. You simply create a special list object that describes which IDs from each package are the same kind of IDs in other packages to be included, along with the relevant package names. So in the following example, the "GOID" values from the GO.db package act as foreign keys for the "GO" values in the org.Hs.eg.db package and so on.

gd <- list(join1 = c(GO.db = "GOID", org.Hs.eg.db = "GO"), join2 = c(org.Hs.eg.db = "ENTREZID", 
    TxDb.Hsapiens.UCSC.hg19.knownGene = "GENEID"))

makeOrganismPackage(pkgname = "Homo.sapiens", graphData = gd, organism = "Homo sapiens", 
    version = "1.0.0", maintainer = "Package Maintainer<maintainer@somewhere.org>", 
    author = "Some Body", destDir = ".", license = "Artistic-2.0")

In this way, you can create a custom OrganismDb package for any organism of interest, providing that you have also have access to the supporting packages. There is a vignette that covers this topic in more detail here.

[ Back to top ]

Exercises for OrganismDb objects.

Exercise 6: Use the Homo.sapiens object to look up the gene symbol, transcript start and chromosome using select(). Then do the same thing using transcripts. You might expect that this call to transcripts will look the same as it did for the TranscriptDb object, but (temporarily) it will not.

Exercise 7: Look at the results from call the columns method on the Homo.sapiens object and compare that to what happens when you call columns on the org.Hs.eg.db object and then look at a call to columns on the TxDb.Hsapiens.UCSC.hg19.knownGene object. What is the difference between TXSTART and CHRLOC? Which one do you think you should use for transcripts or other genomic information?

[ Back to top ]

Sample AnnotationHub Workflow

So far we have been discussing annotations that are fairly well established and that represent consensus findings from the scientific community. These kinds of annotations are usually curated at large governmental institutions like NCBI or ensembl and for the most part everyone basically agrees about what they mean and how to use them.

But sometimes the annotations that you need are not as well established. Sometimes (for example) we just need to compare our results to the data from a recent large study such as the encode project. The AnnotationHub package is designed to be useful for getting access to data like this. AnnotationHub allows you to get access to data from a range of different data reposotories, with the caveat that the data objects in AnnotationHub have all been pre-processed into appropriate R objects for you.

To make use of AnnotationHub, you need to load the package and then create an AnnotationHub object. Notice that unlike the other packages, with AnnotationHub, you have to create an AnnotationHub object when you 1st start up your R session.

library(AnnotationHub)

ah = AnnotationHub()

Once you have done this, you can "find" any of the available resources just by tab completing along a path like this.

res <- ah$goldenpath.hg19.encodeDCC.wgEncodeUwTfbs.wgEncodeUwTfbsMcf7CtcfStdPkRep1.narrowPeak_0.0.1.RData

res

## GRanges with 82163 ranges and 6 metadata columns:
##           seqnames                 ranges strand   |        name     score
##              <Rle>              <IRanges>  <Rle>   | <character> <integer>
##       [1]     chr1       [237640, 237790]      *   |           .         0
##       [2]     chr1       [544660, 544810]      *   |           .         0
##       [3]     chr1       [567480, 567630]      *   |           .         0
##       [4]     chr1       [569820, 569970]      *   |           .         0
##       [5]     chr1       [714200, 714350]      *   |           .         0
##       ...      ...                    ...    ... ...         ...       ...
##   [82159]     chrX [154764540, 154764690]      *   |           .         0
##   [82160]     chrX [154807400, 154807550]      *   |           .         0
##   [82161]     chrX [154881060, 154881210]      *   |           .         0
##   [82162]     chrX [154892100, 154892250]      *   |           .         0
##   [82163]     chrX [154916040, 154916190]      *   |           .         0
##           signalValue    pValue    qValue      peak
##             <numeric> <numeric> <numeric> <integer>
##       [1]          30    26.892        -1        -1
##       [2]           6     8.164        -1        -1
##       [3]         100    56.718        -1        -1
##       [4]          85    49.654        -1        -1
##       [5]          17    13.184        -1        -1
##       ...         ...       ...       ...       ...
##   [82159]          26     25.29        -1        -1
##   [82160]          22     27.65        -1        -1
##   [82161]          17     16.42        -1        -1
##   [82162]          72    101.61        -1        -1
##   [82163]          32     32.52        -1        -1
##   ---
##   seqlengths:
##         chr1     chr10     chr11     chr12 ...      chr8      chr9      chrX
##    249250621 135534747 135006516 133851895 ... 146364022 141213431 155270560

In the above example, AnnotationHub will retrieve, download and cache locally the file that you tab-completed to, and then store the results in "res".

Now you can see how many ways there are to currently complete that path, by checking the length of the AnnotationHub object:

length(ah)

## [1] 7855

The AnnotationHub is still a pretty new resource, and we already hav a LOT of things in there! How can we narrow this down? Right now we can use filters. By default, there are no filters applied, so calling filters() on our AnnotationHub is just an empty list.

filters(ah)

## list()

What things can be used as filters? We can use the columns() method to find out.

columns(ah)

##  [1] "AnnotationHubRoot"     "BiocVersion"          
##  [3] "Coordinate_1_based"    "DataProvider"         
##  [5] "Description"           "Genome"               
##  [7] "Maintainer"            "RDataClass"           
##  [9] "RDataDateAdded"        "RDataLastModifiedDate"
## [11] "RDataPath"             "RDataSize"            
## [13] "RDataVersion"          "Recipe"               
## [15] "RecipeArgs"            "SourceFile"           
## [17] "SourceSize"            "SourceUrl"            
## [19] "SourceVersion"         "Species"              
## [21] "Tags"                  "TaxonomyId"           
## [23] "Title"

What values can be used with these filters? Here, the keys method will give us an answer.

head(keys(ah, keytype = "Species"))

## [1] "Ailuropoda melanoleuca" "Anolis carolinensis"   
## [3] "Anopheles gambiae"      "Apis mellifera"        
## [5] "Aplysia californica"    "Bos taurus"

So now we know what we need to apply a filter to our AnnotationHub. The following filter will limit our AnnotationHub to just those entries that correspond to cattle (Bos taurus).

filters(ah) <- list(Species = "Bos taurus")

length(ah)

## [1] 116

[ Back to top ]

Exercises for AnnotationHub.

Exercise 8: Set the AnnotationHub filter to NULL to clear it out, and then set ip up so that it is extracting data that originated with the UCSC data provider and that is also limited to Homo sapiens and the hg19 genome.

Exercise 9: Now that you have basically narrowed things down to the hg19 annotations from UCSC genome browser, lets get one of these annotations. Now tab complete your way to the oreganno track and save it into a local variable.

[ Back to top ]

Using biomaRt

Another valuable resource is the biomaRt package. The biomaRt package exposes a huge family of online annotation resources called marts. Here is a brief run down of how to use it. For the first step, load the package and decide which "mart" you want to use, then use the useMart() method to create a mart object

library("biomaRt")
head(listMarts())

##               biomart                             version
## 1             ensembl        ENSEMBL GENES 72 (SANGER UK)
## 2                 snp    ENSEMBL VARIATION 72 (SANGER UK)
## 3 functional_genomics   ENSEMBL REGULATION 72 (SANGER UK)
## 4                vega                VEGA 52  (SANGER UK)
## 5       fungi_mart_18           ENSEMBL FUNGI 18 (EBI UK)
## 6 fungi_variations_18 ENSEMBL FUNGI VARIATION 18 (EBI UK)

ensembl <- useMart("ensembl")
ensembl

## Object of class 'Mart':
##  Using the ensembl BioMart database
##  Using the  dataset

Next you need to decide on a dataset. This can also be specified in the mart object that is created when you call the the useMart() method.

head(listDatasets(ensembl))

##                          dataset
## 1         oanatinus_gene_ensembl
## 2          tguttata_gene_ensembl
## 3        cporcellus_gene_ensembl
## 4        gaculeatus_gene_ensembl
## 5         lafricana_gene_ensembl
## 6 itridecemlineatus_gene_ensembl
##                                  description     version
## 1     Ornithorhynchus anatinus genes (OANA5)       OANA5
## 2    Taeniopygia guttata genes (taeGut3.2.4) taeGut3.2.4
## 3            Cavia porcellus genes (cavPor3)     cavPor3
## 4     Gasterosteus aculeatus genes (BROADS1)     BROADS1
## 5         Loxodonta africana genes (loxAfr3)     loxAfr3
## 6 Ictidomys tridecemlineatus genes (spetri2)     spetri2

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
ensembl

## Object of class 'Mart':
##  Using the ensembl BioMart database
##  Using the hsapiens_gene_ensembl dataset

Next we need to think about filters and values. In the biomaRt package, filters are things that can be used with values to restrict or choose what comes back. So you might choose a filter of "affy_hg_u133_plus_2" to go with specific values. For example you might choose c("202763_at","209310_s_at","207500_at") to go with the filter "affy_hg_u133_plus_2". Together these two things would request things that matched those probeset IDs on the platform listed as the filter. There is an accessor for the kinds of filters that are available from a given mart/dataset:

head(listFilters(ensembl))

##              name     description
## 1 chromosome_name Chromosome name
## 2           start Gene Start (bp)
## 3             end   Gene End (bp)
## 4      band_start      Band Start
## 5        band_end        Band End
## 6    marker_start    Marker Start

Also, you need to know about attributes. Attributes here mean the things that you want returned. So if you want to know the gene symbol or something like that. You would list that as an attribute. There are accessors to list the kinds of attributes you can look up too:

head(listAttributes(ensembl))

##                    name           description
## 1       ensembl_gene_id       Ensembl Gene ID
## 2 ensembl_transcript_id Ensembl Transcript ID
## 3    ensembl_peptide_id    Ensembl Protein ID
## 4       ensembl_exon_id       Ensembl Exon ID
## 5           description           Description
## 6       chromosome_name       Chromosome Name

Once you are done exploring and know what you want to extract, you can call the getBM method to get your data like this:

affyids = c("202763_at", "209310_s_at", "207500_at")
getBM(attributes = c("affy_hg_u133_plus_2", "entrezgene"), filters = "affy_hg_u133_plus_2", 
    values = affyids, mart = ensembl)

##   affy_hg_u133_plus_2 entrezgene
## 1         209310_s_at        837
## 2           207500_at        838
## 3           202763_at        836

Now what would you do if you didn't know what the possible values are for a given filter? Well you could just request all the possible values by not specifying the filter, and instead only specifying it as an attribute like this:

head(getBM(attributes = "affy_hg_u133_plus_2", mart = ensembl))

##   affy_hg_u133_plus_2
## 1           231136_at
## 2         210338_s_at
## 3           236573_at
## 4           235207_at
## 5           244669_at
## 6           225155_at

[ Back to top ]

Exercises for biomaRt.

Exercise 10: Pull down GO terms for entrez gene id "1" from human by using the ensembl "hsapiens_gene_ensembl" dataset.

Exercise 11: Now compare the GO terms you just pulled down to the same GO terms from the org.Hs.eg.db package (which you can now retrieve using select()). What differences do you notice? Why do you suspect that is?

[ Back to top ]

Installation and Use

Follow installation instructions to start using these packages. To install the annotations associated with the Affymetrix Human Genome U95 V 2.0, and with Gene Ontology, use

source("http://bioconductor.org/biocLite.R")
biocLite(c("hgu95av2.db", "GO.db"))

Package installation is required only once per R installation. View a full list of available software and annotation packages.

To use the AnnotationDbi and GO.db package, evaluate the commands

library(AnnotationDbi)
library(GO.db)

These commands are required once in each R session.

[ Back to top ]

Exploring Package Content

Packages have extensive help pages, and include vignettes highlighting common use cases. The help pages and vignettes are available from within R. After loading a package, use syntax like

help(package="GO.db")
?select

to obtain an overview of help on the GO.db package, and the select method. The AnnotationDbi package is used by most .db packages. View the vignettes in the AnnotationDbi package with

browseVignettes(package = "AnnotationDbi")

To view vignettes (providing a more comprehensive introduction to package functionality) in the AnnotationDbi package. Use

help.start()

To open a web page containing comprehensive help resources.

[ Back to top ]

Annotation Resources

The following guides the user through key annotation packages. Users interested in how to create custom chip packages should see the vignettes in the AnnotationForge package. There is additional information in the AnnotationDbi, OrganismDbi and GenomicFeatures packages for how to use some of the extra tools provided. You can also refer to the complete list of annotation packages.

Key Packages

Types of Annotation Packages

[ Back to top ]

Answers for exercises:

Exercise 1:

keys <- "MSX2"
columns <- c("ENTREZID", "PROBEID", "CHR")
select(hgu95av2.db, keys, columns, keytype = "SYMBOL")

## Warning: 'select' resulted in 1:many mapping between keys and return rows

##   SYMBOL ENTREZID    PROBEID CHR
## 1   MSX2     4488 40733_f_at   5
## 2   MSX2     4488 40734_r_at   5

Exercise 2:

Initially you might expect that hgu95av2.db will have less information in it. After all, it's an old Affymetrix platform that was developed before we even had a very complete human genome. So you might try something like this:

chipSymbols <- keys(hgu95av2.db, keytype = "SYMBOL")
orgSymbols <- keys(org.Hs.eg.db, keytype = "SYMBOL")
length(orgSymbols)

## [1] 43819

length(chipSymbols)

## [1] 43819

And you might feel confused and so you might try this:

dim(select(org.Hs.eg.db, orgSymbols, "ENTREZID", "SYMBOL"))

## Warning: 'select' resulted in 1:many mapping between keys and return rows

## [1] 43827     2

dim(select(hgu95av2.db, chipSymbols, "ENTREZID", "SYMBOL"))

## Warning: 'select' resulted in 1:many mapping between keys and return rows

## [1] 43827     2

And you might also have noticed this:

length(columns(org.Hs.eg.db)) < length(columns(hgu95av2.db))

## [1] TRUE

Well the answer you have in front of you is actually correct. There actually is more information available in the hgu95av2.db object than in the org.Hs.eg.db object. This is because even though the hgu95av2.db object technically can only have probes for some genes in the genome, it still (behind the scenes) retrieves data about gene names etc. from the org.Hs.eg.db package. So it effectively has access to all the data from the org package PLUS the probes for that platform and what those map to. So that means that for there will be information about many gene symbols that don't actually match up to any probeset Ids. And that is what we see if we use gene symbols to look up the probes Ids.

head(select(hgu95av2.db, chipSymbols, "PROBEID", "SYMBOL"))

##   SYMBOL  PROBEID
## 1   A1BG     <NA>
## 2    A2M     <NA>
## 3  A2MP1     <NA>
## 4   NAT1 38187_at
## 5   NAT2 38912_at
## 6   AACP     <NA>

Exercise 3:

egr <- select(org.Hs.eg.db, orgSymbols, "ENTREZID", "SYMBOL")

## Warning: 'select' resulted in 1:many mapping between keys and return rows

length(egr$ENTREZID)

## [1] 43827

length(unique(egr$ENTREZID))

## [1] 43827

## VS:
length(egr$SYMBOL)

## [1] 43827

length(unique(egr$SYMBOL))

## [1] 43819

## So lets trap these symbols that are redundant and look more closely...
redund <- egr$SYMBOL
badSymbols <- redund[duplicated(redund)]
select(org.Hs.eg.db, badSymbols, "ENTREZID", "SYMBOL")

## Warning: 'select' resulted in 1:many mapping between keys and return rows

##     SYMBOL  ENTREZID
## 1      HBD      3045
## 2      HBD 100187828
## 3     RNR1      4549
## 4     RNR1      6052
## 5     RNR2      4550
## 6     RNR2      6053
## 7      TEC      7006
## 8      TEC 100124696
## 9    MEMO1      7795
## 10   MEMO1     51072
## 11    PRG4     10216
## 12    PRG4     23572
## 13 KIR3DL3    115653
## 14 KIR3DL3 100133046
## 15    MMD2    221938
## 16    MMD2 100505381

Exercise 4:

So to retrieve this information using select you need to do it like this:

res1 <- select(TxDb.Hsapiens.UCSC.hg19.knownGene, keys(TxDb.Hsapiens.UCSC.hg19.knownGene, 
    keytype = "TXID"), columns = c("GENEID", "TXNAME", "TXCHROM"), keytype = "TXID")

head(res1)

##   TXID GENEID     TXNAME TXCHROM
## 1    1   <NA> uc001aaa.3    chr1
## 2    2   <NA> uc010nxq.1    chr1
## 3    3   <NA> uc010nxr.1    chr1
## 4    4  79501 uc001aal.1    chr1
## 5    5   <NA> uc001aaq.2    chr1
## 6    6   <NA> uc001aar.2    chr1

And to do it using transcripts you do it like this:

res2 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = c("gene_id", 
    "tx_name"))
head(res2)

## GRanges with 6 ranges and 2 metadata columns:
##       seqnames           ranges strand |         gene_id     tx_name
##          <Rle>        <IRanges>  <Rle> | <CharacterList> <character>
##   [1]     chr1 [ 11874,  14409]      + |                  uc001aaa.3
##   [2]     chr1 [ 11874,  14409]      + |                  uc010nxq.1
##   [3]     chr1 [ 11874,  14409]      + |                  uc010nxr.1
##   [4]     chr1 [ 69091,  70008]      + |           79501  uc001aal.1
##   [5]     chr1 [321084, 321115]      + |                  uc001aaq.2
##   [6]     chr1 [321146, 321207]      + |                  uc001aar.2
##   ---
##   seqlengths:
##                    chr1                 chr2 ...       chrUn_gl000249
##               249250621            243199373 ...                38502

Notice that in the 2nd case we don't have to ask for the chromosome, as transcripts() returns a GRanges object, so the chromosome will automatically be returned as part of the object.

Exercise 5:

library(TxDb.Athaliana.BioMart.plantsmart16)
res <- transcripts(TxDb.Athaliana.BioMart.plantsmart16, columns = c("gene_id"))

You will notice that the gene ids for this package are TAIR locus IDs and are NOT entrez gene IDs like what you saw in the TxDb.Hsapiens.UCSC.hg19.knownGene package. It's important to always pay attention to the kind of gene id is being used by the TranscriptDb you are looking at.

Exercise 6:

library(Homo.sapiens)
keys <- keys(Homo.sapiens, keytype = "TXID")
res1 <- select(Homo.sapiens, keys = keys, columns = c("SYMBOL", "TXSTART", "TXCHROM"), 
    keytype = "TXID")

head(res1)

And to do it using transcripts you do it like this:

library(Homo.sapiens)
res2 <- transcripts(Homo.sapiens, columns = "SYMBOL")
head(res2)

## GRanges with 6 ranges and 1 metadata column:
##       seqnames           ranges strand |          SYMBOL
##          <Rle>        <IRanges>  <Rle> | <CharacterList>
##   [1]     chr1 [ 11874,  14409]      + |              NA
##   [2]     chr1 [ 11874,  14409]      + |              NA
##   [3]     chr1 [ 11874,  14409]      + |              NA
##   [4]     chr1 [ 69091,  70008]      + |           OR4F5
##   [5]     chr1 [321084, 321115]      + |              NA
##   [6]     chr1 [321146, 321207]      + |              NA
##   ---
##   seqlengths:
##                    chr1                 chr2 ...       chrUn_gl000249
##               249250621            243199373 ...                38502

Exercise 7:

columns(Homo.sapiens)

##  [1] "GOID"         "TERM"         "ONTOLOGY"     "DEFINITION"  
##  [5] "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
##  [9] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"      
## [13] "CHRLOCEND"    "ENZYME"       "MAP"          "PATH"        
## [17] "PMID"         "REFSEQ"       "SYMBOL"       "UNIGENE"     
## [21] "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"    
## [25] "UNIPROT"      "GO"           "EVIDENCE"     "GOALL"       
## [29] "EVIDENCEALL"  "ONTOLOGYALL"  "OMIM"         "UCSCKG"      
## [33] "CDSID"        "CDSNAME"      "CDSCHROM"     "CDSSTRAND"   
## [37] "CDSSTART"     "CDSEND"       "EXONID"       "EXONNAME"    
## [41] "EXONCHROM"    "EXONSTRAND"   "EXONSTART"    "EXONEND"     
## [45] "GENEID"       "TXID"         "EXONRANK"     "TXNAME"      
## [49] "TXCHROM"      "TXSTRAND"     "TXSTART"      "TXEND"

columns(org.Hs.eg.db)

##  [1] "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
##  [5] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"      
##  [9] "CHRLOCEND"    "ENZYME"       "MAP"          "PATH"        
## [13] "PMID"         "REFSEQ"       "SYMBOL"       "UNIGENE"     
## [17] "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"    
## [21] "UNIPROT"      "GO"           "EVIDENCE"     "ONTOLOGY"    
## [25] "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL"  "OMIM"        
## [29] "UCSCKG"

columns(TxDb.Hsapiens.UCSC.hg19.knownGene)

##  [1] "CDSID"      "CDSNAME"    "CDSCHROM"   "CDSSTRAND"  "CDSSTART"  
##  [6] "CDSEND"     "EXONID"     "EXONNAME"   "EXONCHROM"  "EXONSTRAND"
## [11] "EXONSTART"  "EXONEND"    "GENEID"     "TXID"       "EXONRANK"  
## [16] "TXNAME"     "TXCHROM"    "TXSTRAND"   "TXSTART"    "TXEND"

## You might also want to look at this:
transcripts(Homo.sapiens, columns = c("SYMBOL", "CHRLOC"))

## Warning: 'select' resulted in 1:many mapping between keys and return rows

## Warning: 'select' resulted in 1:many mapping between keys and return rows

## GRanges with 80922 ranges and 3 metadata columns:
##           seqnames               ranges strand   |        CHRLOC
##              <Rle>            <IRanges>  <Rle>   | <IntegerList>
##       [1]     chr1     [ 11874,  14409]      +   |            NA
##       [2]     chr1     [ 11874,  14409]      +   |            NA
##       [3]     chr1     [ 11874,  14409]      +   |            NA
##       [4]     chr1     [ 69091,  70008]      +   |         69091
##       [5]     chr1     [321084, 321115]      +   |            NA
##       ...      ...                  ...    ... ...           ...
##   [80918]     chrY [27605645, 27605678]      -   |            NA
##   [80919]     chrY [27606394, 27606421]      -   |            NA
##   [80920]     chrY [27607404, 27607432]      -   |            NA
##   [80921]     chrY [27635919, 27635954]      -   |            NA
##   [80922]     chrY [59358329, 59360854]      -   |            NA
##                 CHRLOCCHR          SYMBOL
##           <CharacterList> <CharacterList>
##       [1]              NA              NA
##       [2]              NA              NA
##       [3]              NA              NA
##       [4]               1           OR4F5
##       [5]              NA              NA
##       ...             ...             ...
##   [80918]              NA              NA
##   [80919]              NA              NA
##   [80920]              NA              NA
##   [80921]              NA              NA
##   [80922]              NA              NA
##   ---
##   seqlengths:
##                    chr1                 chr2 ...       chrUn_gl000249
##               249250621            243199373 ...                38502

The key difference is that the TXSTART refers to the start of a transcript and originates in the TranscriptDb object from the TxDb.Hsapiens.UCSC.hg19.knownGene package, while the CHRLOC refers to the same thing but originates in the OrgDb object from the org.Hs.eg.db package. The point of origin is significant because the TranscriptDb object represents a transcriptome from UCSC and the OrgDb is primarily gene centric data that originates at NCBI. The upshot is that CHRLOC will not have as many regions represented as TXSTART, since there has to be an official gene for there to even be a record. The CHRLOC data is also locked in for org.Hs.eg.db as data for hg19, whereas you can swap in a different TranscriptDb object to match the genome you are using to make it hg18 etc. For these reasons, we strongly recommend using TXSTART instead of CHRLOC. Howeverm CHRLOC still remains in the org packages for historical reasons.

Exercise 8:

The 1st thing you need to do is look at the keytypes:

keytypes(ah)

##  [1] "AnnotationHubRoot"     "BiocVersion"          
##  [3] "Coordinate_1_based"    "DataProvider"         
##  [5] "Description"           "Genome"               
##  [7] "Maintainer"            "RDataClass"           
##  [9] "RDataDateAdded"        "RDataLastModifiedDate"
## [11] "RDataPath"             "RDataSize"            
## [13] "RDataVersion"          "Recipe"               
## [15] "RecipeArgs"            "SourceFile"           
## [17] "SourceSize"            "SourceUrl"            
## [19] "SourceVersion"         "Species"              
## [21] "Tags"                  "TaxonomyId"           
## [23] "Title"

Then you want to look at possible values for DataProvider and for Genome.

keys(ah, keytype = "DataProvider")

## [1] "EncodeDCC"                  "ftp.ensembl.org"           
## [3] "ftp://ftp.ncbi.nih.gov/snp" "hgdownload.cse.ucsc.edu"

head(keys(ah, keytype = "Genome"))

## [1] "ailMel1"   "anoCar1"   "anoCar2"   "AnoCar2.0" "anoGam1"   "apiMel1"



filters(ah) <- NULL
filters(ah) <- list(Species = "Homo sapiens", DataProvider = "hgdownload.cse.ucsc.edu", 
    Genome = "hg19")
length(ah)

## [1] 118

Exercise 9:

This pulls down the oreganno annotations. Which are described on the UCSC site thusly: "This track displays literature-curated regulatory regions, transcription factor binding sites, and regulatory polymorphisms from ORegAnno (Open Regulatory Annotation). For more detailed information on a particular regulatory element, follow the link to ORegAnno from the details page."

res <- ah$goldenpath.hg19.database.oreganno_0.0.1.RData

Exercise 10:

library("biomaRt")
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
ids = c("1")
getBM(attributes = c("go_id", "entrezgene"), filters = "entrezgene", values = ids, 
    mart = ensembl)

##        go_id entrezgene
## 1 GO:0008150          1
## 2 GO:0005576          1
## 3 GO:0003674          1
## 4 GO:0005515          1

Exercise 11:

library(org.Hs.eg.db)
ids = c("1")
select(org.Hs.eg.db, keys = ids, columns = "GO", keytype = "ENTREZID")

## Warning: 'select' resulted in 1:many mapping between keys and return rows

##   ENTREZID         GO EVIDENCE ONTOLOGY
## 1        1 GO:0003674       ND       MF
## 2        1 GO:0005576      IDA       CC
## 3        1 GO:0008150       ND       BP

When this exercise was written, there was a different number of GO terms returned from biomaRt than from org.Hs.eg.db. This may not always be true in the future though as both of these resources are updated. It is expected however that this web service, (which is updated continuously) will fall in and out of sync with the org.Hs.eg.db package (which is updated twice a year). This is an important difference as each approach has different advantages and disadvantages. The advantage to updating continuously is that you always have the very latest annotations which are frequently different for something like GO terms. The advantage to using a package is that the results are frozen to a release of Bioconductor. And this can help you to get the same answers that you get today (reproducibility), a few years from now.

[ Back to top ]

SessionInfo

sessionInfo()

## R version 3.0.1 (2013-05-16)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=C                 LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] TxDb.Athaliana.BioMart.plantsmart16_2.9.0
##  [2] biomaRt_2.17.2                           
##  [3] AnnotationHub_1.1.8                      
##  [4] Homo.sapiens_1.1.1                       
##  [5] OrganismDbi_1.3.10                       
##  [6] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2  
##  [7] GenomicFeatures_1.13.21                  
##  [8] GenomicRanges_1.13.35                    
##  [9] XVector_0.1.0                            
## [10] IRanges_1.19.19                          
## [11] GO.db_2.9.0                              
## [12] hgu95av2.db_2.9.0                        
## [13] org.Hs.eg.db_2.9.0                       
## [14] RSQLite_0.11.4                           
## [15] DBI_0.2-7                                
## [16] AnnotationDbi_1.23.18                    
## [17] Biobase_2.21.6                           
## [18] BiocGenerics_0.7.3                       
## [19] knitr_1.2                                
## 
## loaded via a namespace (and not attached):
##  [1] BiocInstaller_1.11.4 Biostrings_2.29.14   bitops_1.0-5        
##  [4] BSgenome_1.29.1      digest_0.6.3         evaluate_0.4.4      
##  [7] formatR_0.8          graph_1.39.3         RBGL_1.37.2         
## [10] RCurl_1.95-4.1       rjson_0.2.12         Rsamtools_1.13.26   
## [13] rtracklayer_1.21.9   stats4_3.0.1         stringr_0.6.2       
## [16] tools_3.0.1          XML_3.98-1.1         zlibbioc_1.7.0

[ Back to top ]

Fred Hutchinson Cancer Research Center