Contents

Original Authors: Martin Morgan, Sonali Arora
Presenting Authors: Martin Morgan, Lori Shepherd
Date: 12 June, 2017
Back: Monday labs

Objective: Learn the essentials of Bioconductor data structures

Lessons learned:

1 Classes, methods, and packages

This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities.

1.1 Motivation

Sequence analysis is specialized

Additional considerations

Solution: use well-defined classes to represent complex data; methods operate on the classes to perform useful functions. Classes and methods are placed together and distributed as packages so that we can all benefit from the hard work and tested code of others.

2 Case study: IRanges and GRanges

The IRanges package defines an important class for specifying integer ranges, e.g.,

library(IRanges)
ir <- IRanges(start=c(10, 20, 30), width=5)
ir
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        10        14         5
##   [2]        20        24         5
##   [3]        30        34         5

There are many interesting operations to be performed on ranges, e.g, flank() identifies adjacent ranges

flank(ir, 3)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         7         9         3
##   [2]        17        19         3
##   [3]        27        29         3

The IRanges class is part of a class hierarchy. To see this, ask R for the class of ir, and for the class definition of the IRanges class

class(ir)
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"
getClass(class(ir))
## Class "IRanges" [package "IRanges"]
## 
## Slots:
##                                                             
## Name:              start             width             NAMES
## Class:           integer           integer character_OR_NULL
##                                                             
## Name:        elementType   elementMetadata          metadata
## Class:         character DataTable_OR_NULL              list
## 
## Extends: 
## Class "Ranges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
## 
## Known Subclasses: "NormalIRanges", "GroupingIRanges"

Notice that IRanges extends the Ranges class. Now try entering ?flank (?"flank,<tab>" if not using RStudio, where <tab> means to press the tab key to ask for tab completion). You can see that there are help pages for flank operating on several different classes. Select the completion

?"flank,Ranges-method" 

and verify that you’re at the page that describes the method relevant to an IRanges instance. Explore other range-based operations.

The GenomicRanges package extends the notion of ranges to include features relevant to application of ranges in sequence analysis, particularly the ability to associate a range with a sequence name (e.g., chromosome) and a strand. Create a GRanges instance based on our IRanges instance, as follows

library(GenomicRanges)
gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [10, 14]      +
##   [2]     chr1  [20, 24]      -
##   [3]     chr2  [30, 34]      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

The notion of flanking sequence has a more nuanced meaning in biology. In particular we might expect that flanking sequence on the + strand would precede the range, but on the minus strand would follow it. Verify that flank applied to a GRanges object has this behavior.

flank(gr, 3)
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [ 7,  9]      +
##   [2]     chr1  [25, 27]      -
##   [3]     chr2  [27, 29]      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Discover what classes GRanges extends, find the help page documenting the behavior of flank when applied to a GRanges object, and verify that the help page documents the behavior we just observed.

class(gr)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
getClass(class(gr))
## Class "GRanges" [package "GenomicRanges"]
## 
## Slots:
##                                                                       
## Name:         seqnames          ranges          strand elementMetadata
## Class:             Rle         IRanges             Rle       DataFrame
##                                       
## Name:          seqinfo        metadata
## Class:         Seqinfo            list
## 
## Extends: 
## Class "GenomicRanges", directly
## Class "Vector", by class "GenomicRanges", distance 2
## Class "GenomicRangesORmissing", by class "GenomicRanges", distance 2
## Class "Annotated", by class "GenomicRanges", distance 3
## Class "GenomicRangesORGRangesList", by class "GenomicRanges", distance 2
?"flank,GenomicRanges-method"

Notice that the available flank() methods have been augmented by the methods defined in the GenomicRanges package.

It seems like there might be a number of helpful methods available for working with genomic ranges; we can discover some of these from the command line, indicating that the methods should be on the current search() path

showMethods(class="GRanges", where=search())

Use help() to list the help pages in the GenomicRanges package, and vignettes() to view and access available vignettes; these are also available in the RStudio ‘Help’ tab.

help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")

3 High-throughput sequence data

The following sections briefly summarize some of the most important file types in high-throughput sequence analysis. Briefly review these, or those that are most relevant to your research, before starting on the section Data Representation in R / Bioconductor

Alt Files and the Bioconductor packages that input them

3.1 DNA / amino acid sequences: FASTA files

Input & manipulation: Biostrings

>NM_078863_up_2000_chr2L_16764737_f chr2L:16764737-16766736
gttggtggcccaccagtgccaaaatacacaagaagaagaaacagcatctt
gacactaaaatgcaaaaattgctttgcgtcaatgactcaaaacgaaaatg
...
atgggtatcaagttgccccgtataaaaggcaagtttaccggttgcacggt
>NM_001201794_up_2000_chr2L_8382455_f chr2L:8382455-8384454
ttatttatgtaggcgcccgttcccgcagccaaagcactcagaattccggg
cgtgtagcgcaacgaccatctacaaggcaatattttgatcgcttgttagg
...

3.2 Reads: FASTQ files

Input & manipulation: ShortRead readFastq(), FastqStreamer(), FastqSampler()

@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
+
HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
@ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
+
DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################
    

3.3 Aligned reads: BAM files (e.g., ERR127306_chr14.bam)

Input & manipulation: ‘low-level’ Rsamtools, scanBam(), BamFile(); ‘high-level’ GenomicAlignments

3.4 Called variants: VCF files

Input and manipulation: VariantAnnotation readVcf(), readInfo(), readGeno() selectively with ScanVcfParam().

3.5 Genome annotations: BED, WIG, GTF, etc. files

Input: rtracklayer import()

4 Data representation in R / Bioconductor

This section briefly illustrates how different high-throughput sequence data types are represented in R / Bioconductor. Select relevant data types for your area of interest, and work through the examples. Take time to consult help pages, understand the output of function calls, and the relationship between standard data formats (summarized in the previous section) and the corresponding R / Bioconductor representation.

4.1 Background: Ranges

Alt Ranges Algebra

Ranges - IRanges - start() / end() / width() - List-like – length(), subset, etc. - ‘metadata’, mcols() - GRanges - ‘seqnames’ (chromosome), ‘strand’ - Seqinfo, including seqlevels and seqlengths

Intra-range methods - Independent of other ranges in the same object - GRanges variants strand-aware - shift(), narrow(), flank(), promoters(), resize(), restrict(), trim() - See ?"intra-range-methods"

Inter-range methods - Depends on other ranges in the same object - range(), reduce(), gaps(), disjoin() - coverage() (!) - see ?"inter-range-methods"

Between-range methods - Functions of two (or more) range objects - findOverlaps(), countOverlaps(), …, %over%, %within%, %outside%; union(), intersect(), setdiff(), punion(), pintersect(), psetdiff()

Example

library(GenomicRanges)
gr <- GRanges("A", IRanges(c(10, 20, 22), width=5), "+")
shift(gr, 1)                            # 1-based coordinates!
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [11, 15]      +
##   [2]        A  [21, 25]      +
##   [3]        A  [23, 27]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
range(gr)                               # intra-range
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [10, 26]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
reduce(gr)                              # inter-range
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [10, 14]      +
##   [2]        A  [20, 26]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
coverage(gr)
## RleList of length 1
## $A
## integer-Rle of length 26 with 6 runs
##   Lengths: 9 5 5 2 3 2
##   Values : 0 1 0 1 2 1
setdiff(range(gr), gr)                  # 'introns'
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [15, 19]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

IRangesList, GRangesList - List: all elements of the same type - Many *List-aware methods, but a common ‘trick’: apply a vectorized function to the unlisted representaion, then re-list

    grl <- GRangesList(...)
    orig_gr <- unlist(grl)
    transformed_gr <- FUN(orig)
    transformed_grl <- relist(transformed_gr, grl)
    

Reference

4.2 Biostrings (DNA or amino acid sequences)

Classes

Methods –

Related packages

Example

4.3 GenomicAlignments (Aligned reads)

Classes – GenomicRanges-like behaivor

Methods

Example

4.4 VariantAnnotation (Called variants)

Classes – GenomicRanges-like behavior

Functions and methods

Example

Related packages

Reference

4.5 rtracklayer (Genome annotations)

5 Big data

Much Bioinformatic data is very large. The discussion so far has assumed that the data can be read into memory. Here we mention two important general strategies for working with large data; we will explore these in greater detail in a later lab, but feel free to ask questions and explore this material now.

Restriction

Iteration

Compression

Parallel processing

Reference

  1. (2014), 214-226.

6 Exercises

6.1 Summarize overlaps

The goal is to count the number of reads overlapping exons grouped into genes. This type of count data is the basic input for RNASeq differential expression analysis, e.g., through DESeq2 and edgeR.

  1. Identify the regions of interest. We use a ‘TxDb’ package with gene models alreaddy defined

    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    exByGn <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
    ## only chromosome 14
    seqlevels(exByGn, pruning.mode="coarse") = "chr14"
  2. Identify the sample BAM files.

    library(RNAseqData.HNRNPC.bam.chr14)
    length(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
    ## [1] 8
  3. Summarize overlaps, optionally in parallel

    ## next 2 lines optional; non-Windows
    library(BiocParallel)
    register(MulticoreParam(workers=detectCores()))
    olaps <- summarizeOverlaps(exByGn, RNAseqData.HNRNPC.bam.chr14_BAMFILES)
  4. Explore our handiwork, e.g., library sizes (column sums), relationship between gene length and number of mapped reads, etc.

    olaps
    ## class: RangedSummarizedExperiment 
    ## dim: 779 8 
    ## metadata(0):
    ## assays(1): counts
    ## rownames(779): 10001 100113389 ... 9950 9985
    ## rowData names(0):
    ## colnames(8): ERR127306 ERR127307 ... ERR127304 ERR127305
    ## colData names(0):
    head(assay(olaps))
    ##           ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303
    ## 10001           103       139       109       125       152       168
    ## 100113389         0         0         0         0         0         0
    ## 100113391         0         0         0         0         0         0
    ## 100124539         0         0         0         0         0         0
    ## 100126297         0         0         0         0         0         0
    ## 100126308         0         0         0         0         0         0
    ##           ERR127304 ERR127305
    ## 10001           181       150
    ## 100113389         0         0
    ## 100113391         0         0
    ## 100124539         0         0
    ## 100126297         0         0
    ## 100126308         0         0
    colSums(assay(olaps))                # library sizes
    ## ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 
    ##    340646    373268    371639    331518    313800    331135    331606 
    ## ERR127305 
    ##    329647
    plot(sum(width(olaps)), rowMeans(assay(olaps)), log="xy")
    ## Warning in xy.coords(x, y, xlabel, ylabel, log): 252 y values <= 0 omitted
    ## from logarithmic plot

  5. As an advanced exercise, investigate the relationship between GC content and read count

    library(BSgenome.Hsapiens.UCSC.hg19)
    sequences <- getSeq(BSgenome.Hsapiens.UCSC.hg19, rowRanges(olaps))
    gcPerExon <- letterFrequency(unlist(sequences), "GC")
    gc <- relist(as.vector(gcPerExon), sequences)
    gc_percent <- sum(gc) / sum(width(olaps))
    plot(gc_percent, rowMeans(assay(olaps)), log="y")
    ## Warning in xy.coords(x, y, xlabel, ylabel, log): 252 y values <= 0 omitted
    ## from logarithmic plot