`r knitr::opts_chunk$set(tidy=FALSE)` For an overview of the design principles and use of Bioconductor sequence classes, see Lawrence et al., 2013, Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): [doi:10.1371/journal.pcbi.1003118][] For an overview of select high-throughput sequence packages in Bioconductor, see [Intermediate Sequence Analysis 2013][SeqAnal] section 3.3. # Ranges [IRanges][] & [GenomicRanges][] Classes - `IRanges()`, `GRanges()` - Metadata on the object as a whole (`metadata()`), and on individual elements (e.g., `mcols()`) - Behavior inhertted from a base class `Vector()`, e.g., `length()`, subset, etc. - `*List()`, e.g., `IntegerList()`, `GRangesList()` - _Behave_ like lists where elements are restricted to be of a common type, e.g., `IntegerList()` is a list where all elements are integer vectors - _Implemented_ as a single instance, e.g., `integer()` with partitioning vector. - Very effective representation for many operations - `DataFrame()` - `Rle()` - Especially useful in ChIP and other regulator contexts, in QA, and in visualization Methods - See [Intermediate Sequence Analysis 2013][SeqAnal] section 6.1. Users - Used extensively in high-throughput sequence packages - Example: [GenomicFeatures][] and related [TranscriptDb][TxDb] packages representing UCSC and other genome annotation tracks. # Strings and sequences [Biostrings][] - `DNAString()`, `DNAStringSet()` classes - Derived from `XString()`, `XStringSet` classes. - Extensive methods, see [Intermediate Sequence Analysis 2013][SeqAnal] section 5.1 Users - [ShortRead][] uses `XStringSet` as basis for coordinating short reads and quality scores. - [BSgenome][] uses `DNAString` to represent whole genome sequences. # Data containers [GenomicRanges][] - `GAlignments()`, `GAlignmentsList()`, `GAlignmentPairs()` - `SummarizedExperiment()` [VariantAnnotation][] - `VCF()` [ShortRead][] -- FASTQ files - `ShortReadQ()` -- Reads and their quality scores # I/O - [rtracklayer][] - Data input from common genomics formats, e.g., bed, wig, gtf, gff, bw - Result is usually a [GenomicRanges][]-derived class - Single interface, `import()`, so complexity hidden from user - Can also drive (update and retrieve) a UCSC genome browser session - [VariantAnnotation][] `readVcf()`, `filterVcf()`. Manage large data by:: - Project fields of interest for input with `ScanVcfParam()`. - Selecting ranges of interest for input from indexed Tabix files with `ScanVcfParam()` - For specialized uses select INFO and GENO fields for input with `readInfo()`, `readGeno()` - Iterate through large files using `TabixFile(<...>, yieldSize=10000)` and a paradigm like ```{r tabix-iter, eval=FALSE} tbx <- open(TabixFile(fl, yieldSize=10000)) repeat({ vcf <- readVcf(tbx, "hg19") ## up to 10000 records if (length(vcf) == 0) break ## all done ## do work } close(tbx) ``` - Filter large files to small files using `filterVcf()` - [Rsamtools][] `BamFile()` and `TabixFile()` to open and iterate through BAM and Tabix files - Strategies like those in VariantAnnotation to manage large data: restrict fields and ranges of interest with `ScanBamParam()`; iterate through large files using `yieldSize` argument of `BamFile()`. - [Rsamtools][] - `readGAlignmentsFromBam()`, `readGAlignmentsListFromBam()` - [ShortRead][] `FastqStreamer()`, `FastqSampler()`, `readFastq()` - Examples of reference classes - Iterate with `yield()` on an instance created by `FastqStreamer()` - Sample randomly from an entire file, e.g., for QA purposes, with `yield()` onan instance created with `FastqSampler()` # Approximate data class hierarchy
Annotated
   o metadata
-- Vector
   o many methods (showMethods(class="Vector", where=search()))
   -- Rle
   -- List
      -- SimpleList
         -- DataFrame
         -- Simple*List, e.g., SimpleNumericList
      -- CompressedList (IRanges package)
         -- Compressed*List, e.g., CompressedNumericList
         -- Ranges
            -- IRanges
      -- ... *StringSet, e.g., DNAStringSet
   -- GenomicRanges
      -- GRanges (GenomicRanges package)
   -- ... *String, e.g., DNAString (Biostrings package)
      o transcribe, reverseComplement, pairwiseAligment
SummarizedExperiment (GenomicRanges package)
-- VCF (VariantAnnotation package; readVcf)
ShortReadQ
[IRanges]: http://bioconductor.org/packages/devel/bioc/html/IRanges.html [GenomicRanges]: http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html [Biostrings]: http://bioconductor.org/packages/devel/bioc/html/Biostrings.html [BSgenome]: http://bioconductor.org/packages/devel/bioc/html/BSgenome.html [GenomicFeatures]: http://bioconductor.org/packages/devel/bioc/html/GenomicFeatures.html [VariantAnnotation]: http://bioconductor.org/packages/devel/bioc/html/VariantAnnotation.html [Rsamtools]: http://bioconductor.org/packages/devel/bioc/html/Rsamtools.html [rtracklayer]: http://bioconductor.org/packages/devel/bioc/html/rtracklayer.html [ShortRead]: http://bioconductor.org/packages/devel/bioc/html/ShortRead.html [TxDb]: http://bioconductor.org/packages/devel/data/annotation/html/TxDb.Hsapiens.UCSC.hg19.knownGene.html [SeqAnal]: http://bioconductor.org/help/course-materials/2013/SeattleMay2013/IntermediateSequenceAnalysis2013.pdf [doi:10.1371/journal.pcbi.1003118]: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118