# Introduction to Bioconductor

Martin Morgan, Hervé Pagès
February 4, 2015

## Background: R

• Vectors: logical(), integer(), numeric(), character(), …, matrix(), array()
• list(), data.frame(), …, new.env()
• functions – arguments, named arguments, argument matching, default values
• Statistical concepts: NA, factor(), ~ formula, …

S3 classes

• Informal class system; list() with class() attribute; linear class hierarchy, single-dispatch.
• Generic foo (body: UseMethod()) and methods foo.A
• Help: ?foo, ?foo.A
• Discovery: methods(), methods(class=<...>)
• Example
x <- rnorm(1000)
y <- x + rnorm(1000, .5)
df <- data.frame(x=x, y=y)
fit <- lm(y ~ x, df)
class(fit)
## [1] "lm"
methods(class=class(fit))
##  [4] case.names.lm*     confint.lm         cooks.distance.lm*
##  [7] deviance.lm*       dfbeta.lm*         dfbetas.lm*
## [10] drop1.lm*          dummy.coef.lm      effects.lm*
## [13] extractAIC.lm*     family.lm*         formula.lm*
## [16] hatvalues.lm*      influence.lm*      kappa.lm
## [19] labels.lm*         logLik.lm*         model.frame.lm*
## [22] model.matrix.lm    nobs.lm*           plot.lm*
## [25] predict.lm         print.lm*          proj.lm*
## [28] qr.lm*             residuals.lm       rstandard.lm*
## [31] rstudent.lm*       simulate.lm*       summary.lm
## [34] variable.names.lm* vcov.lm*
##
##    Non-visible functions are asterisked
methods(anova)
## [1] anova.glm*     anova.glmlist* anova.lm*      anova.lmlist*
## [5] anova.loess*   anova.mlm*     anova.nls*
##
##    Non-visible functions are asterisked
plot(y ~ x, df)
abline(fit, col="red", lwd=2)

S4 classes

• Formal classes via setClass(), multiple inheritance, multiple dispatch
• Generic foo and associated methods (showMethods("foo"))
• Help: ?foo, method?foo,A, class?A
• Discovery: showMethods("foo"), showMethods(classes="A", where=search())

• Example

suppressPackageStartupMessages({
library(IRanges)
})
start <- as.integer(runif(1000, 1, 1e4))
width <- as.integer(runif(length(start), 50, 100))
ir <- IRanges(start, width=width)
coverage(ir)
## integer-Rle of length 10092 with 1743 runs
##   Lengths:  7  8  6  9 10  4  2  1 18 13 ...  2  3 11  2 16 12 11  9 16 17
##   Values :  0  1  2  3  4  5  6  7  8  7 ...  8  9  8  7  6  5  4  3  2  1
findOverlaps(ir)
## Hits object with 15638 hits and 0 metadata columns:
##           queryHits subjectHits
##           <integer>   <integer>
##       [1]         1         693
##       [2]         1         594
##       [3]         1         814
##       [4]         1         229
##       [5]         1         178
##       ...       ...         ...
##   [15634]      1000         204
##   [15635]      1000         748
##   [15636]      1000         291
##   [15637]      1000          14
##   [15638]      1000         821
##   -------
##   queryLength: 1000
##   subjectLength: 1000
showMethods("coverage")
## Function: coverage (package IRanges)
## x="IRanges"
##     (inherited from: x="Ranges")
## x="RangedData"
## x="Ranges"
## x="RangesList"
## x="Views"
showMethods(classes=class(ir), where=search())

Notes

• Package authors are at liberty to document classes and methods as they see fit, e.g., all methods on the same page as their class
• Methods are defined independently of class, so available methods can depend on loaded packages, e.g., compare to previous
suppressPackageStartupMessages({
library(GenomicRanges)
})
showMethods("coverage")
## Function: coverage (package IRanges)
## x="GRangesList"
## x="GenomicRanges"
## x="RangedData"
## x="Ranges"
## x="RangesList"
## x="SummarizedExperiment"
## x="Views"

## Principles

1. Statistical
• Volume, technology, experimental design
2. Extensive
• Software, annotation
• Core and community contributions
3. Interoperable
• Common data structures, e.g., GRanges
4. Reproducible
• Integrated data containers, e.g., SummarizedExperiment
• Vignettes & “old school” scripts
5. Accessible – affordable, transparent, usable
• example(findOverlaps)
• browseVignettes("IRanges")

## Infrastructure

Sequences

• DNAString / DNAStringSet

suppressPackageStartupMessages({
library(Biostrings)
})
data(phiX174Phage)
m <- consensusMatrix(phiX174Phage)[1:4,]
polymorphic <- colSums(m > 0) > 1
endoapply(phiX174Phage, `[`, polymorphic)

##   A DNAStringSet instance of length 6
##     width seq                                          names
## [1]     9 GGAACCAGC                                    Genbank
## [2]     9 AAAGCTAGC                                    RF70s
## [3]     9 AAAGCTAGC                                    SS78
## [4]     9 GAGACTAAT                                    Bull
## [5]     9 AAGACTGAC                                    G97
## [6]     9 AAAGTTAGC                                    NEB03

Genomic Ranges

• GRanges

• GRangesList

Integrating sample, range and assay data

• SummarizedExperiment

## Key packages

Biostirings – Sequences

GenomicRanges – Ranges

BiocParallel – Parallel processing

• GenomicFiles – Collections of 'genomic' (e.g., BAM, BED, WIG, …) files

## Work flows

biocViews for discovery.

RNA-seq

ChIP-seq

Variants

Copy number

• 45 packages tagged with “CopyNumberVariation” in biocViews; also terms “DNASeq”, “ExomeSeq”, “WholeGenome”
• Represent duplicated regions as genomic ranges; integrates very easily in Bioconductor annotation work flows.

Methylation

• Bump hunting – minfi
• Visualization – epivizR (much more than epigenomics!)

Expression and other arrays

• Pre-processing – oligo
• Differential representation – limma