--- title: "4. Adding Annotation To Your Analysis" author: "Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{4. Adding Annotation To Your Analysis} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({ library(AnnotationDbi) library(AnnotationHub) library(GenomicFeatures) library(biomaRt) library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg19.knownGene) }) ``` The material in this course requires R version 3.2 and Bioconductor version 3.2 ```{r configure-test} stopifnot( getRversion() >= '3.2' && getRversion() < '3.3', BiocInstaller::biocVersion() == "3.2" ) ``` # Annotation ## Model organisms ### Gene model annotation resources -- `TxDb` packages e.g., `{r Biocpkg("TxDb.Hsapiens.UCSC.hg19.knownGene")} ```{r gene-model-discovery} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene txdb methods(class=class(txdb)) ``` `TxDb` objects - Curatated annotation resources -- http://bioconductor.org/packages/biocViews - Underlying sqlite database -- `dbfile(txdb)` - Make your own: `GenomicFeatures::makeTxDbFrom*()` Accessing gene models - `exons()`, `transcripts()`, `genes()`, `cds()` (coding sequence) - `promoters()` & friends - `exonsBy()` & friends -- exons by gene, transcript, ... - 'select' interface: `keytypes()`, `columns()`, `keys()`, `select()`, `mapIds()` ```{r txdb-exons} exons(txdb) exonsBy(txdb, "tx") ``` ### Whole-genome sequence -- `BSgenome` packages e.g., `{r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")} ```{r bsgenome} library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 getSeq(genome, exons(txdb)[1:100]) ``` ### Identifier mapping -- `OrgDb` ```{r org} library(org.Hs.eg.db) org.Hs.eg.db ``` `OrgDb` objects - Curated resources, underlying sqlite data base, like `TxDb` - make your own: [AnnotationForge][] (but see the AnnotationHub, below!) - 'select' interface: `keytypes()`, `columns()`, `keys()`, `select()`, `mapIds()` `select()` - Vector of keys, desired columns - Specification of key type ```{r select} select(org.Hs.eg.db, c("BRCA1", "PTEN"), c("ENTREZID", "GENENAME"), "SYMBOL") keytypes(org.Hs.eg.db) columns(org.Hs.eg.db) ``` Related functionality - `mapIds()` -- special case for mapping from 1 identifier to another - `OrganismDb` objects: combined `org.*`, `TxDb.*`, and other annotation resources for easy access ```{r organismdb} library(Homo.sapiens) select(Homo.sapiens, c("BRCA1", "PTEN"), c("TXNAME", "TXCHROM", "TXSTART", "TXEND"), "SYMBOL") ``` ## Other annotation resources -- `biomaRt`, `AnnotationHub` ### _biomaRt_ & friends http://biomart.org; _Bioconductor_ package [biomaRt][] ```{r biomart, eval=FALSE} ## NEEDS INTERNET ACCESS !! library(biomaRt) head(listMarts(), 3) ## list marts head(listDatasets(useMart("ensembl")), 3) ## mart datasets ensembl <- ## fully specified mart useMart("ensembl", dataset = "hsapiens_gene_ensembl") head(listFilters(ensembl), 3) ## filters myFilter <- "chromosome_name" substr(filterOptions(myFilter, ensembl), 1, 50) ## return values myValues <- c("21", "22") head(listAttributes(ensembl), 3) ## attributes myAttributes <- c("ensembl_gene_id","chromosome_name") ## assemble and query the mart res <- getBM(attributes = myAttributes, filters = myFilter, values = myValues, mart = ensembl) ``` Other internet resources - [biomaRt](http://biomart.org) Ensembl and other annotations - [PSICQUIC](https://code.google.com/p/psicquic) Protein interactions - [uniprot.ws](http://uniprot.org) Protein annotations - [KEGGREST](http://www.genome.jp/kegg) KEGG pathways - [SRAdb](http://www.ncbi.nlm.nih.gov/sra) Sequencing experiments - [rtracklayer](http://genome.ucsc.edu) USCS genome tracks - [GEOquery](http://www.ncbi.nlm.nih.gov/geo/) Array and other data - [ArrayExpress](http://www.ebi.ac.uk/arrayexpress/) Array and other data - ... ### _AnnotationHub_ - _Bioconductor_ package [AnnotationHub][] - Meant to ease use of 'consortium' and other genome-scale resources - Simplify discovery, retrieval, local management, and import to standard _Bioconductor_ representations Example: Ensembl 'GTF' files to _R_ / _Bioconductor_ GRanges and TxDb ```{r annotationhub-gtf, eval=FALSE} library(AnnotationHub) hub <- AnnotationHub() hub query(hub, c("Ensembl", "80", "gtf")) ## ensgtf = display(hub) # visual choice hub["AH47107"] gtf <- hub[["AH47107"]] gtf txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf) ``` Example: non-model organism `OrgDb` packages ```{r annotationhub-orgdb, eval=FALSE} library(AnnotationHub) hub <- AnnotationHub() query(hub, "OrgDb") ``` Example: Map Roadmap epigenomic marks to hg38 - Roadmap BED file as _GRanges_ ```{r annotationhub-roadmap, eval=FALSE} library(AnnotationHub) hub <- AnnotationHub() query(hub , c("EpigenomeRoadMap", "E126", "H3K4ME2")) E126 <- hub[["AH29817"]] ``` - UCSC 'liftOver' file to map coordinates ```{r annotationhub-liftover, eval=FALSE} query(hub , c("hg19", "hg38", "chainfile")) chain <- hub[["AH14150"]] ``` - lift over -- possibly one-to-many mapping, so _GRanges_ to _GRangesList_ ```{r liftover, eval=FALSE} library(rtracklayer) E126hg38 <- liftOver(E126, chain) E126hg38 ``` # Annotating Variants Example: read variants from a VCF file, and annotate with respect to a known gene model ```{r vcf, message=FALSE} ## input variants library(VariantAnnotation) fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") seqlevels(vcf) <- "chr22" ## known gene model library(TxDb.Hsapiens.UCSC.hg19.knownGene) coding <- locateVariants(rowRanges(vcf), TxDb.Hsapiens.UCSC.hg19.knownGene, CodingVariants()) head(coding) ``` # Resources Acknowledgements - Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum. - The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation. ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ``` [AnnotationHub]: http://bioconductor.org/packages/AnnotationHub [TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [ChIPseeker]: http://bioconductor.org/packages/ChIPseeker [VariantFiltering]: http://bioconductor.org/packages/VariantFiltering [AnnotationForge]: http://bioconductor.org/packages/AnnotationForge [biomaRt]: http://bioconductor.org/packages/biomaRt