--- title: B. Genomic Ranges author: Martin Morgan (mtmorgan@fredhutch.org) date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{B. Genomic Ranges} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r style, echo=FALSE, results='asis'} BiocStyle::markdown() suppressPackageStartupMessages({ library(GenomicAlignments) library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) }) ``` # Motivation Genomic ranges describe... - Annotations, e.g., exons, genes, binding sites, .. - Data, e.g., aligned reads, called peaks, copy number regions - See Lawrence et al., Software for computing and annotating genomic ranges, PLoS Comput Biol 9(8): [e1003118](http://dx.doi.org/10.1371/journal.pcbi.1003118) Packages - `r Biocpkg("GenomicRanges")` for essential genomic ranges; depends on `r Biocpkg("IRanges")`, `r Biocpkg("S4Vectors")` and other packages - `r Biocpkg("GenomicAlignments")` for aligned reads using genomic range-based concepts. Depends on `r Biocpkg("Rsamtools")`, `r Biocpkg("Biostrings")` and other packages ```{r packages} library(GenomicRanges) library(GenomicAlignments) sessionInfo() ``` ## Use cases `GRanges`: simple genomic ranges - e.g., exons, binding sites - e.g., called peaks, reads aligned without gaps - _vector_ of genomic ranges - _metadata_ `mcols()` of associated data, e.g., 'score', 'id', ... ![](our_figures/GRanges.png) - `seqname()`, e.g., chromosome, but could be, e.g., contig, ... - `start()`, `end()`: 1-based, closed intervals - `strand()`: +, -, or * (does not matter) - `mcols()` `GRangesList`: nested genomic ranges - e.g., exons-within-transcripts - e.g., aligned reads with gaps; paired-end reads ![](our_figures/GRangesList.png) - Accessors resturn `*List` objects: lists, but all elements of the same type. E.g., `start()` returns an `IntegerList()`. - `unlist()` # Range-based operations ![](our_figures/RangeOperations.png) Intra-range operations - e.g., `range()`, `flank()` Inter-range operations - e.g., `reduce()`, `disjoin()` Between-object - e.g., `psetdiff()`, `findOverlaps()`, `countOverlaps()` ![](our_figures/journal.pcbi.1003118.t001.png) PLoS Comput Biol 9(8): [e1003118](http://dx.doi.org/10.1371/journal.pcbi.1003118) # Working with _Bioconductor_ classes and methods What can I do with my `GRanges` instance? ```{r methods-class} methods(class="GRanges") ``` What type of object(s) can I use `findOverlaps()` on (what _methods_ exist for the `findOverlaps()` _generic_)? ```{r methods-generic} methods(findOverlaps) ``` How can I get help on functions, generics, and methods? ```{r help, eval=FALSE} ?"findOverlaps" ## generic ?"findOverlaps," ## specific method ``` Other help? - Vignettes, e.g., `browseVignettes("GenomicRanges")` - Work flows, vidoes, training material on the Bioconductor [web site](http://bioconductor.org/) - Questions and answers on the Bioconductor [support site](https://support.bioconductor.org) # Important parts of the sequence class menagerie `GAlignments` and friends (`r Biocpkg("GenomicAlignments")`) - `GAlignments`: Single-end aligned reads, e.g., from BAM files - `GAlignmentPairs`, `GAlignmentsList`: paired-end aligned reads. `*Pairs` is more restrictive on what pairs can be represented `DNAString` and `DNAStringSet` (`r Biocpkg("Biostrings")`) `SummarizedExperiment` (`r Biocpkg("GenomicRanges")`) ![](our_figures/nmeth.3252-F2.jpg) - `assays` of rows (regions of interest; genomic ranges) x columns (samples, including integrated phenotypic information) `TxDb` (`r Biocpkg("AnnotationDb")`) - Coordinating transcript-level descriptions derived, e.g., from GTF, UCSC, Biomart - `transcripts()` interface - `select()` interface `VCF` (`r Biocpkg("VariantAnnotation")`) Lower-level classes - `DataFrame` (`r Biocpkg("S4Vectors")`) (like a `data.frame`, but can contain S4 objects) - `IRanges` (`r Biocpkg("IRanges")`), `Rle` (`r Biocpkg("S4Vectors")`) ![](our_figures/FilesToPackages.png) # Deeper understanding ## Classes and class hierarchies - 'Ideally', the user can remain ignorant of how classes are implemented. Actually, it sometimes helps! - Slots - Contained classes _R_ works efficiently on vectors - Represent `GRanges` as a collection of vectors, not as a collection of records ```{r getClass-GRanges} getClass("GRanges") ``` ![](our_figures/GRangesImplementation.png) ## `Vector` and `Annotated` - `[`, `length()`, `names()`, etc. - `mcols()` ## `List`-like - `[[` - `elementLengths()` Implementation: `Vector` plus partitioning - Consequence: `unlist()` and `relist()` are very inexpensive ![](our_figures/GRangesListImplementation.png) # Practical ## 1. Exon and transcript characterization Ingredients - `r Biocpkg("TxDb.Hsapiens.UCSC.hg19")` TxDb package - `exons()`, and `exonsBy()` functions - `width()`, `elementLengths()` accessors - `hist()` Goals - Histogram of exon widths - Histogram of transcript widths - Transcripts with exactly one exon - Transcripts with maximum number of exons ## 2. GC content Ingredients - `r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")` BSGenome package - `r Biocpkg("TxDb.Hsapiens.UCSC.hg19")` TxDb package - `?"getSeq,BSgenome-method"`, `letterFrequency()` Goapls - GC content of exons, transcripts - GC content of non-coding regions ## 3. CpG islands Ingredients - `DNAStringSet()` to construct CG island sequence - `matchPDict()` to find CG islands on BSgenome chromosomes - ?? `coverage()`, `tileGenome()`, `Views()`, following [Hints](https://support.bioconductor.org/p/65281/) - ?? `tileGenome()`, `findOverlaps()`, `splitAsList()`, `mean()` Goal - Average number of CpG islands in 'tiles' across chr 17 ## 4. Aligned reads Ingredients - `r Biocexptpkg("RNAseqData.HNRNPC.bam.chr14")` and `RNAseqData.HNRNPC.bam.chr14_BAMFILES` - `readGAlignments()`, `readGAlignmentPairs()`, and `readGAlignmentsList()` - `BamFile()` Goals - Compare three input methods - Use `ScanBamParam()` `which` and `what` to selective input data - `countOverlaps()` between reads and known genes - Use `BamFile()` `yieldSize` argument to iterate through file