1 About
- 1.1 Bioconductor: Analysis and comprehension of high-throughput
- 1.2 Packages, vignettes, work flows
- 1.3 Package installation and use
2 Key concepts
3 High-throughput sequence analysis work flows
4 Bioconductor sequencing ecosystem

1 About

1.1 Bioconductor: Analysis and comprehension of high-throughput

genomic data

Statistical analysis: large data, technological artifacts, designed experiments; rigorous
Comprehension: biological context, visualization, reproducibility
High-throughput
- Sequencing: RNASeq, ChIPSeq, variants, copy number, …
- Microarrays: expression, SNP, …
- Flow cytometry, proteomics, images, …

1.2 Packages, vignettes, work flows

1296 software packages; also…
- ‘Annotation’ packages – static data bases of identifier maps, gene models, pathways, etc; e.g., TxDb.Hsapiens.UCSC.hg19.knownGene
- ’Experiment packages – data sets used to illustrate software functionality, e.g., airway
Discover and navigate via biocViews
Package ‘landing page’
- Title, author / maintainer, short description, citation, installation instructions, …, download statistics
All user-visible functions have help pages, most with runnable examples
‘Vignettes’ an important feature in Bioconductor – narrative documents illustrating how to use the package, with integrated code
‘Release’ (every six months) and ‘devel’ branches
Support site; videos, recent courses

1.3 Package installation and use

A package needs to be installed once, using the instructions on the package landing page (e.g., DESeq2).
```
source("https://bioconductor.org/biocLite.R")
biocLite(c("DESeq2", "org.Hs.eg.db"))
```
biocLite() installs Bioconductor, CRAN, and github packages.

Once installed, the package can be loaded into an R session

library(GenomicRanges)

and the help system queried interactively, as outlined above:

help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges

2 Key concepts

2.1 Goals

Reproducibility
Interoperability
Use

2.2 What a few lines of R has to say

x <- rnorm(1000)
y <- x + rnorm(1000)
df <- data.frame(X=x, Y=y)
plot(Y ~ X, df)
fit <- lm(Y ~ X, df)
anova(fit)

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## X           1 967.46  967.46  971.31 < 2.2e-16 ***
## Residuals 998 994.04    1.00                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

abline(fit)

2.3 Classes and methods – “S3”

data.frame()
- Defines class to coordinate data
- Creates an instance or object
plot(), lm(), anova(), abline(): methods defined on generics to transform instances

Discovery and help

class(fit)
methods(class=class(fit))
methods(plot)
?"plot"
?"plot.formula"

tab completion!

2.4 Bioconductor classes and methods – “S4”

Example: working with DNA sequences

library(Biostrings)
dna <- DNAStringSet(c("AACAT", "GGCGCCT"))
reverseComplement(dna)

##   A DNAStringSet instance of length 2
##     width seq
## [1]     5 ATGTT
## [2]     7 AGGCGCC

data(phiX174Phage)
phiX174Phage

##   A DNAStringSet instance of length 6
##     width seq                                            names               
## [1]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA Genbank
## [2]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA RF70s
## [3]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA SS78
## [4]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA Bull
## [5]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA G97
## [6]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA NEB03

letterFrequency(phiX174Phage, "GC", as.prob=TRUE)

##            G|C
## [1,] 0.4476420
## [2,] 0.4472707
## [3,] 0.4472707
## [4,] 0.4470850
## [5,] 0.4472707
## [6,] 0.4470850

Discovery and help

class(dna)
?"DNAStringSet-class"
?"reverseComplement,DNAStringSet-method"

3 High-throughput sequence analysis work flows

Step 1. Experimental design
- Simple, replication, sufficient power, covariates and batch effects, …
Step 2. Wet-lab sequence preparation
- Figure from http://rnaseq.uoregon.edu/

Step 3. (Illumina) Sequencing
- Bentley et al., 2008, doi:10.1038/nature07517
- Primary output: FASTQ files of short reads and their quality scores.

Step 4. Alignment
- Choose to match task, e.g., Rsubread, Bowtie2 good for ChIPseq, some forms of RNAseq; BWA, GMAP better for variant calling
- Primary output: BAM files of aligned reads
- More recently: kallisto and similar programs that produce tables of reads aligned to transcripts
Step 5. Reduction
- e.g., RNASeq ‘count table’ (simple spreadsheets), DNASeq called variants (VCF files), ChIPSeq peaks (BED, WIG files)
Step 6. Analysis
- Differential expression, peak identification, differential binding, …
Step 7. Comprehension
- Biological context; annotation, gene set analysis, …

4 Bioconductor sequencing ecosystem

Alt Sequencing Ecosystem

B.1 – Introduction to Bioconductor

11 - 12 September 2017

Contents