Contents

Original Authors: Martin Morgan, Sonali Arora
Presenting Authors: Martin Morgan, Lori Shepherd
Date: 12 June, 2017
Back: Monday labs

Objective: Learn about Bioconductor resources for gene and genome annotation.

Lessons learned:

1 Gene annotation

1.1 Data packages

Organism-level (‘org’) packages contain mappings between a central identifier (e.g., Entrez gene ids) and other identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.). The name of an org package is always of the form org.<Sp>.<id>.db (e.g. org.Sc.sgd.db) where <Sp> is a 2-letter abbreviation of the organism (e.g. Sc for Saccharomyces cerevisiae) and <id> is an abbreviation (in lower-case) describing the type of central identifier (e.g. sgd for gene identifiers assigned by the Saccharomyces Genome Database, or eg for Entrez gene ids). The “How to use the ‘.db’ annotation packages” vignette in the AnnotationDbi package (org packages are only one type of “.db” annotation packages) is a key reference. The ‘.db’ and most other Bioconductor annotation packages are updated every 6 months.

Annotation packages usually contain an object named after the package itself. These objects are collectively called AnnotationDb objects, with more specific classes named OrgDb, ChipDb or TranscriptDb objects. Methods that can be applied to these objects include cols(), keys(), keytypes() and select(). Common operations for retrieving annotations are summarized in the table.

Category Function Description
Discover columns() List the kinds of columns that can be returned
keytypes() List columns that can be used as keys
keys() List values that can be expected for a given keytype
select() Retrieve annotations matching keys, keytype and columns
Manipulate setdiff(), union(), intersect() Operations on sets
duplicated(), unique() Mark or remove duplicates
%in%, match() Find matches
any(), all() Are any TRUE? Are all?
merge() Combine two different based on shared keys
GRanges* transcripts(), exons(), cds() Features (transcripts, exons, coding sequence) as GRanges.
transcriptsBy() , exonsBy() Features group by gene, transcript, etc., as GRangesList.
cdsBy()

1.2 Internet resources

A short summary of select Bioconductor packages enabling web-based queries is in following Table.

Package Description
AnnotationHub Ensembl, Encode, dbSNP, UCSC data objects
biomaRt Ensembl and other annotations
PSICQUIC Protein interactions
uniprot.ws Protein annotations
KEGGREST KEGG pathways
SRAdb Sequencing experiments.
rtracklayer genome tracks.
GEOquery Array and other data
ArrayExpress Array and other data

1.3 Exercises

Exercise: This exercise illustrates basic use of the `select’ interface to annotation packages.

  1. What is the name of the org package for Homo sapiens? Load it. Display the OrgDb object for the org.Hs.eg.db package. Use the columns() method to discover which sorts of annotations can be extracted from it.
  2. Use the keys() method to extract ENSEMBL identifiers and then pass those keys in to the select() method in such a way that you extract the SYMBOL (gene symbol) and GENENAME information for each. Use the following ENSEMBL ids.
ensids <- c("ENSG00000130720", "ENSG00000103257", "ENSG00000156414", 
            "ENSG00000144644", "ENSG00000159307", "ENSG00000144485")

Solution The OrgDb object is named org.Hs.eg.db.

library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
##  [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [19] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"
columns(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
##  [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [19] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"
cols <- c("SYMBOL", "GENENAME")
select(org.Hs.eg.db, keys=ensids, columns=cols, keytype="ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
##           ENSEMBL SYMBOL                                                    GENENAME
## 1 ENSG00000130720 FIBCD1                            fibrinogen C domain containing 1
## 2 ENSG00000103257 SLC7A5                            solute carrier family 7 member 5
## 3 ENSG00000156414  TDRD9                                   tudor domain containing 9
## 4 ENSG00000144644  GADL1                              glutamate decarboxylase like 1
## 5 ENSG00000159307 SCUBE1 signal peptide, CUB domain and EGF like domain containing 1
## 6 ENSG00000144485   HES6                      hes family bHLH transcription factor 6

Exercise

Internet access required for this exercise

  1. Load the biomaRt package and list the available marts. Choose the ensembl mart and list the datasets for that mart. Set up a mart to use the ensembl mart and the hsapiens gene ensembl dataset.
  2. A biomaRt dataset can be accessed via getBM(). In addition to the mart to be accessed, this function takes filters and attributes as arguments. Use filterOptions() and listAttributes() to discover values for these arguments. Call getBM() using filters and attributes of your choosing.

Solution

## NEEDS INTERNET ACCESS !!
library(biomaRt)
head(listMarts(), 3)                      ## list the marts
head(listDatasets(useMart("ensembl")), 3) ## mart datasets
ensembl <-                                ## fully specified mart
    useMart("ensembl", dataset = "hsapiens_gene_ensembl")

head(listFilters(ensembl), 3)             ## filters
myFilter <- "chromosome_name"
substr(filterOptions(myFilter, ensembl), 1, 50) ## return values
myValues <- c("21", "22")
head(listAttributes(ensembl), 3)          ## attributes
myAttributes <- c("ensembl_gene_id","chromosome_name")

## assemble and query the mart
res <- getBM(attributes =  myAttributes, filters =  myFilter,
             values =  myValues, mart = ensembl)

Exercise

As an optional exercise to be completed after Tuesday’s lab, annotate the genes that are differentially expressed in the DESeq2 laboratory, e.g., find the GENENAME associated with the five most differentially expressed genes. Do these make biological sense? Can you merge() the annotation results with the `top table’ results to provide a statistically and biologically informative summary?

2 Genome annotation

There are a diversity of packages and classes available for representing large genomes. Several include:

2.1 Transcript annotation packages

Genome-centric packages are very useful for annotations involving genomic coordinates. It is straight-forward, for instance, to discover the coordinates of coding sequences in regions of interest, and from these retrieve corresponding DNA or protein coding sequences. Other examples of the types of operations that are easy to perform with genome-centric annotations include defining regions of interest for counting aligned reads in RNA-seq experiments and retrieving DNA sequences underlying regions of interest in ChIP-seq analysis, e.g., for motif characterization.

2.2 rtracklayer

The rtracklayer package allows us to query the UCSC genome browser, as well as providing import() and export() functions for common annotation file formats like GFF, GTF, and BED. The exercise below illustrates some of the functionality of rtracklayer.

2.3 Exercises

Exercise

This exercise uses annotation resources to go from a gene symbol ‘BRCA1’ through to the genomic coordinates of each transcript associated with the gene, and finally to the DNA sequences of the transcripts. This can be achieved using an EnsDb package along with a BSgenome package, or with a combination of TxDb, Homo.sapiens and BSgenome packages. We will focus here on the former approach.

  1. Use the cdsBy() function to retrieve the genomic coordinates of all coding sequences for the gene ‘BRCA1’ from the EnsDb.Hsapiens.v75 package. To retrieve only data for the specified gene, submit either a GenenameFilter or a filter formula/expression to the function’s filter parameter. This avoids to extract the coding region for all genes, which takes a long time.

  2. Visualize the transcripts in genomic coordinates using the Gviz package to construct a GeneRegionTrack, and plotting it using plotTracks().

  3. Use the Bsgenome.Hsapiens.UCSC.hg19 package and extractTranscriptSeqs() function to extract the DNA sequence of each transcript.

Solution

Retrieve the coding sequences grouped by transcript for the gene of interest and verify that each coding sequence is a multiple of 3.

library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

brca1cds <- cdsBy(edb, by = "tx", filter = ~ genename == "BRCA1")

class(brca1cds)
## [1] "GRangesList"
## attr(,"package")
## [1] "GenomicRanges"
length(brca1cds)
## [1] 29
brca1cds[[1]]                           # exons in cds
## GRanges object with 14 ranges and 3 metadata columns:
##        seqnames               ranges strand |   gene_name         exon_id exon_rank
##           <Rle>            <IRanges>  <Rle> | <character>     <character> <integer>
##    [1]       17 [41243452, 41246659]      - |       BRCA1 ENSE00003549219         9
##    [2]       17 [41242961, 41243049]      - |       BRCA1 ENSE00003547126        10
##    [3]       17 [41234421, 41234592]      - |       BRCA1 ENSE00003527960        11
##    [4]       17 [41228505, 41228631]      - |       BRCA1 ENSE00003484886        12
##    [5]       17 [41226348, 41226538]      - |       BRCA1 ENSE00003537850        13
##    ...      ...                  ...    ... .         ...             ...       ...
##   [10]       17 [41209069, 41209152]      - |       BRCA1 ENSE00003458468        18
##   [11]       17 [41203080, 41203134]      - |       BRCA1 ENSE00003477922        19
##   [12]       17 [41201138, 41201211]      - |       BRCA1 ENSE00003628864        20
##   [13]       17 [41199660, 41199720]      - |       BRCA1 ENSE00003687053        21
##   [14]       17 [41197695, 41197819]      - |       BRCA1 ENSE00001312675        22
##   -------
##   seqinfo: 1 sequence from GRCh37 genome
cdswidth <- width(brca1cds)             # width of each exon
all((sum(cdswidth) %% 3) == 0)          # sum within cds, modulus 3
## [1] FALSE

The CDS for some transcripts is not of the expected length, how comes? Get the transcript ID of the first transcript that does have a CDS of the wrong size and look this transcript up in the Ensembl genome browser (http://www.ensembl.org).

tx_cds_fail <- names(brca1cds)[(sum(cdswidth) %% 3) != 0]

length(tx_cds_fail)
## [1] 8
tx_cds_fail[1]
## [1] "ENST00000412061"

In the description of the transcript it says CDS 5’ incomplete. Thus, in addition to known protein coding transcripts, Ensembl provides also annotations for transcripts known to be targeted for nonsense mediated mRNA decay or that have incomplete CDS. Such transcripts would however not be listed in e.g. the TxDb.Hsapiens.UCSC.hg19.knownGene package.

Next we visualize the BRCA1 transcripts using Gviz (this package has an excellent vignette, vignette("Gviz"))

library(Gviz)

## Use the function from the ensembldb package to extract the data in the
## format suitable for Gviz
grt <- getGeneRegionTrackForGviz(edb, filter = ~genename == "BRCA1")
plotTracks(list(GenomeAxisTrack(), GeneRegionTrack(grt)))

Extract the coding sequences of each transcript. EnsDb databases provide annotations from Ensembl and use hence Ensembl style chromosome names (such as “Y”) while the BSgenome package is based on UCSC annotations that use a naming style that prepends a “chr” to each chromosome name (e.g. “chrY”). Change thus the seqlevelsStyle from the default UCSC chromosome naming to Ensembl naming style.

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19

## Change the seqlevelsStyle from UCSC to Ensembl
seqlevelsStyle(genome) <- "Ensembl"
tx_seq <- extractTranscriptSeqs(genome, brca1cds)
tx_seq
##   A DNAStringSet instance of length 29
##      width seq                                                                  names               
##  [1]  4704 ATGAATGTAGAAAAGGCTGAATTCTGTAATAAA...TGATACCCCAGATCCCCCACAGCCACTACTGA ENST00000309486
##  [2]  4875 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TGATACCCCAGATCCCCCACAGCCACTACTGA ENST00000346315
##  [3]  2043 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TGATACCCCAGATCCCCCACAGCCACTACTGA ENST00000351666
##  [4]  2166 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TGATACCCCAGATCCCCCACAGCCACTACTGA ENST00000352993
##  [5]  4797 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TGATACCCCAGATCCCCCACAGCCACTACTGA ENST00000354071
##  ...   ... ...
## [25]  1419 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...AAACCTATCGGAAGAAGGCAAGCCTCCCCAAC ENST00000494123
## [26]   531 ATGAATGTAGAAAAGGCTGAATTCTGTAATAAA...AAACCTATCGGAAGAAGGCAAGCCTCCCCAAC ENST00000497488
## [27]   522 ATGGATGCTGAGTTTGTGTGTGAACGGACACTG...TGATACCCCAGATCCCCCACAGCCACTACTGA ENST00000586385
## [28]  1065 ATGCACAGTTGCTCTGGGAGTCTTCAGAATAGA...TGATACCCCAGATCCCCCACAGCCACTACTGA ENST00000591534
## [29]   291 ATGAGTGACAGCAAGAAAACCTGGCTGCAATAT...TGATACCCCAGATCCCCCACAGCCACTACTGA ENST00000591849

We can also inspect the CDS sequence for the transcripts with incomplete CDS. Many of them do not start with a start codon hence indicating that the CDS is incomplete on their 5’ end.

tx_seq[tx_cds_fail]
##   A DNAStringSet instance of length 8
##     width seq                                                                   names               
## [1]  1312 GTTTGGATTCTGCAAAAAAGGCTGCTTGTGAAT...TTGTTCTAGCAGTGAAGAGATAAAGAAAAAAAA ENST00000412061
## [2]   958 TTCAGCTTGACACAGGTTTGGAGTATGCAAACA...AAGTGAAAGAGTTCACTCCAAATCAGTAGAGAG ENST00000473961
## [3]   667 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...GGGATGAAATCAGTTTGGATTCTGCAAAAAAGG ENST00000476777
## [4]  1867 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...AATTGCAAATTGATAGTTGTTCTAGCAGTGAAG ENST00000477152
## [5]  1870 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TCTGGGTGACCCAGTCTATTAAAGAAAGAAAAA ENST00000478531
## [6]  1495 GAGCTATTGAAAATCATTTGTGCTTTTCAGCTT...TCTGGGTGACCCAGTCTATTAAAGAAAGAAAAA ENST00000484087
## [7]   800 GAGCTATTGAAAATCATTTGTGCTTTTCAGCTT...TACCAGTAAAAATAAAGAACCAGGAGTGGAAAG ENST00000487825
## [8]   296 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TCCTGAACATCTAAAAGATGAAGTTTCTATCAT ENST00000489037

Intron coordinates can be identified by first calculating the range of the genome (from the start of the first exon to the end of the last exon) covered by each transcript, and then taking the (algebraic) set difference between this and the genomic coordinates covered by each exon

introns <- psetdiff(unlist(range(brca1cds)), brca1cds)

Retrieve the intronic sequences with getSeq() (these are not assembled, the way that extractTranscriptSeqs() assembles exon sequences into mature transcripts); note that introns start and end with the appropriate acceptor and donor site sequences. Unfortunately, UCSC and Ensembl do also use different names for the genome assembly. Change the genome name for the introns object to matche the one from the genome object.

unique(genome(genome))
## [1] "hg19"
genome(introns)
##       17 
## "GRCh37"
## Change the genome name on introns to macht the one from the
## BSgenome package
genome(introns) <- c(`17` = unique(genome(genome)))

seq <- getSeq(genome, introns)
names(seq)
##  [1] "ENST00000309486" "ENST00000346315" "ENST00000351666" "ENST00000352993" "ENST00000354071"
##  [6] "ENST00000357654" "ENST00000412061" "ENST00000461221" "ENST00000461574" "ENST00000461798"
## [11] "ENST00000468300" "ENST00000470026" "ENST00000471181" "ENST00000473961" "ENST00000476777"
## [16] "ENST00000477152" "ENST00000478531" "ENST00000484087" "ENST00000487825" "ENST00000489037"
## [21] "ENST00000491747" "ENST00000492859" "ENST00000493795" "ENST00000493919" "ENST00000494123"
## [26] "ENST00000497488" "ENST00000586385" "ENST00000591534" "ENST00000591849"
seq[["ENST00000352993"]]                     # 20 introns
##   A DNAStringSet instance of length 20
##      width seq
##  [1]  1840 GTAAGGTGCCTGCATGTACCTGTGCTATATGGGGTCCTTTTGC...GAATGAATTGACACTAATCTCTGCTTGTGTTCTCTGTCTCCAG
##  [2]  1417 GTAAGTATTGGGTGCCCTGTCAGAGAGGGAGGACACAATATTC...CTACTTTGACACTTTGAATGCTCTTTCCTTCCTGGGGATCCAG
##  [3]  1868 GTAAGAGCCTGGGAGAACCCCAGAGTTCCAGCACCAGCCTTTG...TGCAGATTACTGCAGTGATTTTACATCTAAATGTCCATTTTAG
##  [4]  5934 GTAAAGCTCCCTCCCTCAAGTTGACAAAAATCTCACCCCACCA...TCTCTCCATTCCCCTGTCCCTCTCTCTTCCTCTCTTCTTCCAG
##  [5]  6197 GTAAGTACTTGATGTTACAAACTAACCAGAGATATTCATTCAG...TCTCTTTCTCTTATCCTGATGGGTTGTGTTTGGTTTCTTTCAG
##  ...   ... ...
## [16]  4241 GTAAAACCATTTGTTTTCTTCTTCTTCTTCTTCTTCTTTTCTT...ACTGGCCAACAATTGCTTGACTGTTCTTTACCATACTGTTTAG
## [17]   606 GTAAGTGTTGAATATCCCAAGAATGACACTCAAGTGCTGTCCA...TTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAG
## [18]  1499 GTATATAATTTGGTAATGATGCTAGGTTGGAAGCAACCACAGT...ATAATCACTTGCTGAGTGTGTTTCTCAAACAATTTAATTTCAG
## [19]  9192 GTAAGTTTGAATGTGTTATGTGGCTCCATTATTAGCTTTTGTT...AGGAAGTAAATTAAATTGTTCTTTCTTTCTTTATAATTTATAG
## [20]  8237 GTAAGTCAGCACAAGAGTGTATTAATTTGGGATTCCTATGATT...TGAGCCTCATTTATTTTCTTTTTCTCCCCCCCTACCCTGCTAG

Exercise

Internet access required for this exercise

Here we use rtracklayer to retrieve estrogen receptor binding sites identified across cell lines in the ENCODE project. We focus on binding sites in the vicinity of a particularly interesting region of interest.

  1. Define our region of interest by creating a GRanges instance with appropriate genomic coordinates. Our region corresponds to 10Mb up- and down-stream of a particular gene.
  2. Create a session for the UCSC genome browser
  3. Query the UCSC genome browser for ENCODE estrogen receptor ERalphaa transcription marks; identifying the appropriate track, table, and transcription factor requires biological knowledge and detective work.
  4. Visualize the location of the binding sites and their scores; annotate the mid-point of the region of interest.

Solution

Define the region of interest

library(GenomicRanges)
roi <- GRanges("chr10", IRanges(92106877, 112106876, names="ENSG00000099194"))

Create a session

library(rtracklayer) 
session <- browserSession()

Query the UCSC for a particular track, table, and transcription factor, in our region of interest

trackName <- "wgEncodeRegTfbsClusteredV2"
tableName <- "wgEncodeRegTfbsClusteredV2"
trFactor <- "ERalpha_a"
ucscTable <- getTable(ucscTableQuery(session, track=trackName,
    range=roi, table=tableName, name=trFactor))

Visualize the result

plot(score ~ chromStart, ucscTable, pch="+")
abline(v=start(roi) + (end(roi) - start(roi) + 1) / 2, col="blue")

3 AnnotationHub

AnnotationHub is a data base of large-scale whole-genome resources, e.g., regulatory elements from the Roadmap Epigenomics project, Ensembl GTF and FASTA files for model and other organisms, and the NHLBI grasp2db data base of GWAS results. There are many interesting ways in which these resources can be used. Examples include

Unfortunately, AnnotationHub makes extensive use of internet resources and so we will not pursue it in this course; see the vignettes that come with the pacakge, for instance AnnotationHub HOW-TOs.

4 Annotating variants

Bioconductor provides facilities for reading VCF files. These work very well with the annotation resources described above, so for instance it is straight-forward to identify variants in coding or other regions of interest.

To develop a sense of the capabilities available, work through the VariantAnnotation vignette ‘Introduction to Variant Annotation’, and the VariantFiltering vignette.