Contents

The material in this course requires R version 3.3 and Bioconductor version 3.4

stopifnot(
    getRversion() >= '3.3' && getRversion() < '3.4',
    BiocInstaller::biocVersion() == "3.4"
)

1 Bioconductor ‘infrastructure’ for sequence analysis

1.1 Classes, methods, and packages

This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities.

1.2 Motivation

Sequence analysis is specialized

Additional considerations

Solution: use well-defined classes to represent complex data; methods operate on the classes to perform useful functions. Classes and methods are placed together and distributed as packages so that we can all benefit from the hard work and tested code of others.

2 Core packages

                   VariantAnnotation
                           |
                           v
                    GenomicFeatures
                           |
                           v
                       BSgenome
                           |
                           v
                      rtracklayer
                           |
                           v
                    GenomicAlignments
                      |           |
                      v           v
     SummarizedExperiment   Rsamtools  ShortRead
                  |         |      |      |
                  v         v      v      v
                GenomicRanges     Biostrings
                        |          |
                        v          v
               GenomeInfoDb   (XVector)
                        |     |
                        v     v
                        IRanges
                           |
                           v 
                      (S4Vectors)

3 Core classes

3.1 Case study: IRanges and GRanges

The IRanges package defines an important class for specifying integer ranges, e.g.,

library(IRanges)
ir <- IRanges(start=c(10, 20, 30), width=5)
ir
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        10        14         5
##   [2]        20        24         5
##   [3]        30        34         5

There are many interesting operations to be performed on ranges, e.g, flank() identifies adjacent ranges

flank(ir, 3)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         7         9         3
##   [2]        17        19         3
##   [3]        27        29         3

The IRanges class is part of a class hierarchy. To see this, ask R for the class of ir, and for the class definition of the IRanges class

class(ir)
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"
getClass(class(ir))
## Class "IRanges" [package "IRanges"]
## 
## Slots:
##                                                                                       
## Name:            start           width           NAMES     elementType elementMetadata
## Class:         integer         integer characterORNULL       character DataTableORNULL
##                       
## Name:         metadata
## Class:            list
## 
## Extends: 
## Class "Ranges", directly
## Class "GRangesOrIRanges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
## 
## Known Subclasses: "NormalIRanges", "GroupingIRanges"

Notice that IRanges extends the Ranges class. Show

Now try entering ?flank (if not using RStudio, enter ?"flank,<tab>" where <tab> means to press the tab key to ask for tab completion). You can see that there are help pages for flank operating on several different classes. Select the completion

?"flank,Ranges-method" 

and verify that you’re at the page that describes the method relevant to an IRanges instance. Explore other range-based operations.

The GenomicRanges package extends the notion of ranges to include features relevant to application of ranges in sequence analysis, particularly the ability to associate a range with a sequence name (e.g., chromosome) and a strand. Create a GRanges instance based on our IRanges instance, as follows

library(GenomicRanges)
gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [10, 14]      +
##   [2]     chr1  [20, 24]      -
##   [3]     chr2  [30, 34]      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

The notion of flanking sequence has a more nuanced meaning in biology. In particular we might expect that flanking sequence on the + strand would precede the range, but on the minus strand would follow it. Verify that flank applied to a GRanges object has this behavior.

flank(gr, 3)
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [ 7,  9]      +
##   [2]     chr1  [25, 27]      -
##   [3]     chr2  [27, 29]      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Discover what classes GRanges extends, find the help page documenting the behavior of flank when applied to a GRanges object,

It seems like there might be a number of helpful methods available for working with genomic ranges; we can discover some of these from the command line, indicating that the methods should be on the current search() path

methods(class="GRanges")
##   [1] !=                  $                   $<-                 %in%               
##   [5] <                   <=                  ==                  >                  
##   [9] >=                  BamViews            NROW                Ops                
##  [13] ROWNAMES            ScanBamParam        ScanBcfParam        [                  
##  [17] [<-                 aggregate           anyNA               append             
##  [21] as.character        as.complex          as.data.frame       as.env             
##  [25] as.factor           as.integer          as.list             as.logical         
##  [29] as.matrix           as.numeric          as.raw              bamWhich<-         
##  [33] blocks              browseGenome        by                  c                  
##  [37] chrom               chrom<-             coerce              coerce<-           
##  [41] countOverlaps       coverage            disjoin             disjointBins       
##  [45] distance            distanceToNearest   duplicated          elementMetadata    
##  [49] elementMetadata<-   end                 end<-               eval               
##  [53] expand              expand.grid         export              extractROWS        
##  [57] extractUpstreamSeqs findOverlaps        flank               follow             
##  [61] gaps                getPromoterSeq      granges             head               
##  [65] high2low            intersect           is.unsorted         isDisjoint         
##  [69] length              lengths             liftOver            locateVariants     
##  [73] mapFromAlignments   mapFromTranscripts  mapToAlignments     mapToTranscripts   
##  [77] match               mcols               mcols<-             merge              
##  [81] metadata            metadata<-          mstack              names              
##  [85] names<-             narrow              nearest             order              
##  [89] overlapsAny         parallelSlotNames   pcompare            pgap               
##  [93] pintersect          pmapFromAlignments  pmapFromTranscripts pmapToAlignments   
##  [97] pmapToTranscripts   precede             predictCoding       promoters          
## [101] psetdiff            punion              range               ranges             
## [105] ranges<-            rank                readVcf             reduce             
## [109] relist              relistToClass       rename              rep                
## [113] rep.int             replaceROWS         resize              restrict           
## [117] rev                 rowRanges<-         scanFa              scanTabix          
## [121] scanVcf             score               score<-             selfmatch          
## [125] seqinfo             seqinfo<-           seqlevelsInUse      seqnames           
## [129] seqnames<-          setdiff             setequal            shift              
## [133] shiftApply          show                showAsCell          slidingWindows     
## [137] sort                split               split<-             start              
## [141] start<-             strand              strand<-            subset             
## [145] subsetByOverlaps    summarizeOverlaps   summary             table              
## [149] tail                tapply              tile                trim               
## [153] union               unique              update              updateObject       
## [157] values              values<-            width               width<-            
## [161] window              window<-            with                xtabs              
## [165] xtfrm              
## see '?methods' for accessing help and source code

Notice that the available flank() methods have been augmented by the methods defined in the GenomicRanges package, including those that are relevant (via inheritance) to the GRanges class.

grep("flank", methods(class="GRanges"), value=TRUE)
## [1] "flank,GenomicRanges-method"

Verify that the help page documents the behavior we just observed.

?"flank,GenomicRanges-method"

Use help() to list the help pages in the GenomicRanges package, and vignettes() to view and access available vignettes; these are also available in the Rstudio ‘Help’ tab.

help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")

3.2 GenomicRanges

3.2.1 The GRanges and GRangesList classes

Aside: ‘TxDb’ packages provide an R representation of gene models

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

exons(): GRanges

exons(txdb)
## GRanges object with 289969 ranges and 1 metadata column:
##                  seqnames         ranges strand |   exon_id
##                     <Rle>      <IRanges>  <Rle> | <integer>
##        [1]           chr1 [11874, 12227]      + |         1
##        [2]           chr1 [12595, 12721]      + |         2
##        [3]           chr1 [12613, 12721]      + |         3
##        [4]           chr1 [12646, 12697]      + |         4
##        [5]           chr1 [13221, 14409]      + |         5
##        ...            ...            ...    ... .       ...
##   [289965] chrUn_gl000241 [35706, 35859]      - |    289965
##   [289966] chrUn_gl000241 [36711, 36875]      - |    289966
##   [289967] chrUn_gl000243 [11501, 11530]      + |    289967
##   [289968] chrUn_gl000243 [13608, 13637]      + |    289968
##   [289969] chrUn_gl000247 [ 5787,  5816]      - |    289969
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Alt Genomic Ranges

exonsBy(): GRangesList

exonsBy(txdb, "tx")
## GRangesList object of length 82960:
## $1 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand |   exon_id   exon_name exon_rank
##          <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 [11874, 12227]      + |         1        <NA>         1
##   [2]     chr1 [12613, 12721]      + |         3        <NA>         2
##   [3]     chr1 [13221, 14409]      + |         5        <NA>         3
## 
## $2 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand | exon_id exon_name exon_rank
##   [1]     chr1 [11874, 12227]      + |       1      <NA>         1
##   [2]     chr1 [12595, 12721]      + |       2      <NA>         2
##   [3]     chr1 [13403, 14409]      + |       6      <NA>         3
## 
## $3 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand | exon_id exon_name exon_rank
##   [1]     chr1 [11874, 12227]      + |       1      <NA>         1
##   [2]     chr1 [12646, 12697]      + |       4      <NA>         2
##   [3]     chr1 [13221, 14409]      + |       5      <NA>         3
## 
## ...
## <82957 more elements>
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome

Alt Genomic Ranges List

GRanges / GRangesList are incredibly useful

  • Represent annotations – genes, variants, regulatory elements, copy number regions, …
  • Represent data – aligned reads, ChIP peaks, called variants, …

3.2.2 Algebra of genomic ranges

Many biologically interesting questions represent operations on ranges

  • Count overlaps between aligned reads and known genes – GenomicRanges::summarizeOverlaps()
  • Genes nearest to regulatory regions – GenomicRanges::nearest(), [ChIPseeker][]
  • Called variants relevant to clinical phenotypes – VariantFiltering

GRanges Algebra

  • Intra-range methods
    • Independent of other ranges in the same object
    • GRanges variants strand-aware
    • shift(), narrow(), flank(), promoters(), resize(), restrict(), trim()
    • See ?"intra-range-methods"
  • Inter-range methods
    • Depends on other ranges in the same object
    • range(), reduce(), gaps(), disjoin()
    • coverage() (!)
    • see ?"inter-range-methods"
  • Between-range methods
    • Functions of two (or more) range objects
    • findOverlaps(), countOverlaps(), …, %over%, %within%, %outside%; union(), intersect(), setdiff(), punion(), pintersect(), psetdiff()

Alt Ranges Algebra

3.3 Biostrings (DNA or amino acid sequences)

Classes

Methods –

Related packages

Example

3.4 GenomicAlignments (Aligned reads)

Classes – GenomicRanges-like behaivor

Methods

Example

3.5 VariantAnnotation (Called variants)

Classes – GenomicRanges-like behavior

Functions and methods

Example

Related packages

Reference

3.6 rtracklayer (Genome annotations)

3.7 SummarizedExperiment

Alt SummarizedExperiment

Functions and methods

4 Input & representation of standard file formats

4.1 BAM files of aligned reads – GenomicAlignments

Recall: overall workflow

  1. Experimental design
  2. Wet-lab preparation
  3. High-throughput sequencing
  4. Alignment
    • Whole genome, vs. transcriptome
  5. Summary
  6. Statistical analysis
  7. Comprehension

BAM files of aligned reads

GenomicAlignments

4.2 Other formats and packages

Alt Files and the Bioconductor packages that input them

5 Resources

Acknowledgements

The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

5.1 sessionInfo()

sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
## [10] base     
## 
## other attached packages:
##  [1] airway_0.107.2                          BioC2016Introduction_0.0.3             
##  [3] Homo.sapiens_1.3.1                      GO.db_3.3.0                            
##  [5] OrganismDbi_1.15.1                      AnnotationHub_2.5.4                    
##  [7] Gviz_1.17.4                             biomaRt_2.29.2                         
##  [9] org.Hs.eg.db_3.3.0                      BiocParallel_1.7.4                     
## [11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.25.14                
## [13] AnnotationDbi_1.35.3                    VariantAnnotation_1.19.2               
## [15] RNAseqData.HNRNPC.bam.chr14_0.11.0      GenomicAlignments_1.9.4                
## [17] Rsamtools_1.25.0                        SummarizedExperiment_1.3.5             
## [19] Biobase_2.33.0                          BSgenome.Hsapiens.UCSC.hg19_1.4.0      
## [21] BSgenome_1.41.2                         rtracklayer_1.33.7                     
## [23] GenomicRanges_1.25.8                    GenomeInfoDb_1.9.1                     
## [25] Biostrings_2.41.4                       XVector_0.13.2                         
## [27] IRanges_2.7.11                          S4Vectors_0.11.7                       
## [29] BiocGenerics_0.19.1                     ggplot2_2.1.0                          
## [31] BiocStyle_2.1.10                       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.2.0                    splines_3.3.0                 Formula_1.2-1                
##  [4] shiny_0.13.2                  interactiveDisplayBase_1.11.3 latticeExtra_0.6-28          
##  [7] RBGL_1.49.1                   yaml_2.1.13                   RSQLite_1.0.0                
## [10] lattice_0.20-33               biovizBase_1.21.0             chron_2.3-47                 
## [13] digest_0.6.9                  RColorBrewer_1.1-2            colorspace_1.2-6             
## [16] htmltools_0.3.5               httpuv_1.3.3                  Matrix_1.2-6                 
## [19] plyr_1.8.4                    XML_3.98-1.4                  zlibbioc_1.19.0              
## [22] xtable_1.8-2                  scales_0.4.0                  nnet_7.3-12                  
## [25] survival_2.39-4               magrittr_1.5                  mime_0.4                     
## [28] evaluate_0.9                  foreign_0.8-66                graph_1.51.0                 
## [31] BiocInstaller_1.23.4          tools_3.3.0                   data.table_1.9.6             
## [34] formatR_1.4                   matrixStats_0.50.2            stringr_1.0.0                
## [37] munsell_0.4.3                 cluster_2.0.4                 ensembldb_1.5.8              
## [40] RCurl_1.95-4.8                dichromat_2.0-0               bitops_1.0-6                 
## [43] labeling_0.3                  rmarkdown_0.9.6               gtable_0.2.0                 
## [46] codetools_0.2-14              DBI_0.4-1                     reshape2_1.4.1               
## [49] R6_2.1.2                      gridExtra_2.2.1               knitr_1.13                   
## [52] Hmisc_3.17-4                  stringi_1.1.1                 Rcpp_0.12.5                  
## [55] rpart_4.1-10                  acepack_1.3-3.3