22-26 July | CSAMA 2019

Description

There are various annotation packages provided by the Bioconductor project that can be used to incorporate additional information to results from high-throughput experiments. This can be as simple as mapping Ensembl IDs to corresponding HUGO gene symbols, to much more complex queries involving multiple data sources. We will briefly cover the various classes of annotation packages, what they contain, and how to use them efficiently.

Task

  1. Start with set of identifers that are measured

  2. Map to new identifiers.

Why:

  • more familiar to collaborators
  • can be used for further analyses.

As an example, RNA-Seq data may only have Entrez Gene IDs for each gene measured, and as part of the output you may want to include the gene symbols, which are more likely to be familiar to a Biologist.

What do we mean by annotation?

Map a known ID to other functional or positional information

Annotation sources

Package type Example
OrgDb org.Hs.eg.db
TxDb/EnsDb TxDb.Hsapiens.UCSC.hg19.knownGene; EnsDb.Hsapiens.v75
OrganismDb Homo.sapiens
BSgenome BSgenome.Hsapiens.UCSC.hg19
Others GO.db
AnnotationHub Online resource
biomaRt Online resource
ChipDb hugene20sttranscriptcluster.db

Interacting with AnnoDb packages

The main function is select:

AnnotationDbi::select(annopkg, keys, columns, keytype)

Where

  • annopkg is the annotation package

  • keys are the IDs that we know

  • columns are the values we want

  • keytype is the type of key used
    • if the keytype is the central key, it can remain unspecified

help: ?AnnotationDbi::select
other useful functions: columns, keytypes, mapIds

Simple Example

The data in the airway package is a RangedSummarizedExperiment constructed from an RNA-Seq experiment. Let map the ensembl gene identifiers to gene symbol.

library(airway)
library(org.Hs.eg.db)
data(airway)
ids = head(rownames(airway))
ids
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"
select(org.Hs.eg.db, ids, "SYMBOL", "ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
##           ENSEMBL   SYMBOL
## 1 ENSG00000000003   TSPAN6
## 2 ENSG00000000005     TNMD
## 3 ENSG00000000419     DPM1
## 4 ENSG00000000457    SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938      FGR

Questions!

How do you know what the central keys are?

  • If it's a ChipDb, the central key are the manufacturer's probe IDs

  • It's sometimes in the name - org.Hs.eg.db, where 'eg' means Entrez Gene ID

  • You can see examples using e.g., head(keys(annopkg)), and infer from that

  • But note that it's never necessary to know the central key, as long as you specify the keytype

More questions!

What keytypes or columns are available for a given annotation package?

library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"
columns(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

Another example

There is one issue with select however.

brca <- c("BRCA1", "BRCA2")
select(org.Hs.eg.db, brca, c("MAP", "ONTOLOGY"), "SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
##   SYMBOL      MAP ONTOLOGY
## 1  BRCA1 17q21.31       BP
## 2  BRCA1 17q21.31       CC
## 3  BRCA1 17q21.31       MF
## 4  BRCA2  13q13.1       BP
## 5  BRCA2  13q13.1       CC
## 6  BRCA2  13q13.1       MF

The mapIds function

An alternative to select is mapIds, which gives control of duplicates

  • Same arguments as select with slight differences

    • The columns argument can only specify one column

    • The keytype argument must be specified

    • An additional argument, multiVals used to control duplicates

mapIds(org.Hs.eg.db, brca, "ONTOLOGY", "SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
## BRCA1 BRCA2 
##  "BP"  "BP"

Choices for multiVals

Default is first, where we just choose the first of the duplicates. Other choices are list, CharacterList, filter, asNA or a user-specified function.

mapIds(org.Hs.eg.db, brca, "ONTOLOGY", "SYMBOL", multiVals = "list")
## 'select()' returned 1:many mapping between keys and columns
## $BRCA1
## [1] "BP" "CC" "MF"
## 
## $BRCA2
## [1] "BP" "CC" "MF"
mapIds(org.Hs.eg.db, brca, "ONTOLOGY", "SYMBOL", multiVals = "CharacterList")
## 'select()' returned 1:many mapping between keys and columns
## CharacterList of length 2
## [["BRCA1"]] BP CC MF
## [["BRCA2"]] BP CC MF

What about positional annotation?

TxDb packages

TxDb packages contain positional information; the contents can be inferred by the package name

TxDb.Species.Source.Build.Table

  • TxDb.Hsapiens.UCSC.hg19.knownGene

    • Homo sapiens

    • UCSC genome browser

    • hg19 (their version of GRCh37)

    • knownGene table

TxDb.Dmelanogaster.UCSC.dm3.ensGene TxDb.Athaliana.BioMart.plantsmart22

Transcript packages

As with ChipDb and OrgDb packages, select and mapIds can be used to make queries

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)
##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
## [21] "TXTYPE"
select(TxDb.Hsapiens.UCSC.hg19.knownGene, c("1","10"),
       c("TXNAME","TXCHROM","TXSTART","TXEND"), "GENEID")
## 'select()' returned 1:many mapping between keys and columns
##   GENEID     TXNAME TXCHROM  TXSTART    TXEND
## 1      1 uc002qsd.4   chr19 58858172 58864865
## 2      1 uc002qsf.2   chr19 58859832 58874214
## 3     10 uc003wyw.1    chr8 18248755 18258723

But using select and mapIds are not how one normally uses TxDb objects…