Introduction to Bioconductor

useR! 2014
Author: Martin Morgan (mtmorgan@fhcrc.org), Sonali Arora
Date: 30 June, 2014

R

Language and environment for statistical computing and graphics

Vector, class, object

Function, generic, method

Introspection

Help

Example

x <- rnorm(1000)                   # atomic vectors
y <- x + rnorm(1000, sd=.5)
df <- data.frame(x=x, y=y)         # object of class 'data.frame'
plot(y ~ x, df)                    # generic plot, method plot.formula

plot of chunk unnamed-chunk-1

fit <- lm(y ~x, df)                # object of class 'lm'
methods(class=class(fit))          # introspection
##  [1] add1.lm*           alias.lm*          anova.lm*         
##  [4] case.names.lm*     confint.lm         cooks.distance.lm*
##  [7] deviance.lm*       dfbeta.lm*         dfbetas.lm*       
## [10] drop1.lm*          dummy.coef.lm      effects.lm*       
## [13] extractAIC.lm*     family.lm*         formula.lm*       
## [16] hatvalues.lm*      influence.lm*      kappa.lm          
## [19] labels.lm*         logLik.lm*         model.frame.lm*   
## [22] model.matrix.lm    nobs.lm*           plot.lm*          
## [25] predict.lm         print.lm*          proj.lm*          
## [28] qr.lm*             residuals.lm       rstandard.lm*     
## [31] rstudent.lm*       simulate.lm*       summary.lm        
## [34] variable.names.lm* vcov.lm*          
## 
##    Non-visible functions are asterisked

Bioconductor

Analysis and comprehension of high-throughput genomic data

Packages, vignettes, work flows

Alt Sequencing Ecosystem

Objects

Example

require(Biostrings)                     # Biological sequences
data(phiX174Phage)                      # sample data, see ?phiX174Phage
phiX174Phage
##   A DNAStringSet instance of length 6
##     width seq                                          names               
## [1]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Genbank
## [2]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA RF70s
## [3]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA SS78
## [4]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Bull
## [5]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA G97
## [6]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA NEB03
m <- consensusMatrix(phiX174Phage)[1:4,] # nucl. x position counts
polymorphic <- which(colSums(m != 0) > 1)
m[, polymorphic]
##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## A    4    5    4    3    0    0    5    2    0
## C    0    0    0    0    5    1    0    0    5
## G    2    1    2    3    0    0    1    4    0
## T    0    0    0    0    1    5    0    0    1
showMethods(class=class(phiX174Phage), where=search())

Exercise

  1. Load the Biostrings package and phiX174Phage data set. What class is phiX174Phage? Find the help page for the class, and identify interesting functions that apply to it.
  2. Discover vignettes in the Biostrings package with vignette(package="Biostrings"). Add another argument to the vignette function to view the 'BiostringsQuickOverview' vignette.
  3. Navigate to the Biostrings landing page on http://bioconductor.org. Do this by visiting the biocViews page. Can you find the BiostringsQuickOverview vignette on the web site?
  4. The following code loads some sample data, 6 versions of the phiX174Phage genome as a DNAStringSet object.

    library(Biostrings)
    data(phiX174Phage)
    

    Explain what the following code does, and how it works

    m <- consensusMatrix(phiX174Phage)[1:4,]
    polymorphic <- which(colSums(m != 0) > 1)
    mapply(substr, polymorphic, polymorphic, MoreArgs=list(x=phiX174Phage))
    
    ##         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
    ## Genbank "G"  "G"  "A"  "A"  "C"  "C"  "A"  "G"  "C" 
    ## RF70s   "A"  "A"  "A"  "G"  "C"  "T"  "A"  "G"  "C" 
    ## SS78    "A"  "A"  "A"  "G"  "C"  "T"  "A"  "G"  "C" 
    ## Bull    "G"  "A"  "G"  "A"  "C"  "T"  "A"  "A"  "T" 
    ## G97     "A"  "A"  "G"  "A"  "C"  "T"  "G"  "A"  "C" 
    ## NEB03   "A"  "A"  "A"  "G"  "T"  "T"  "A"  "G"  "C"
    

Summary

Bioconductor is a large collection of R packages for the analysis and comprehension of high-throughput genomic data. Bioconductor relies on formal classes to represent genomic data, so it is important to develop a rudimentary comfort with classes, including seeking help for classes and methods. Bioconductor uses vignettes to augment traditional help pages; these can be very valuable in illustrating overall package use.