---
title: "B.4 -- Next Steps"
author: Martin Morgan
date: "16 - 17 May, 2016"
output:
BiocStyle::html_document:
toc: true
toc_depth: 2
vignette: >
% \VignetteIndexEntry{B.4 -- Next Steps}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
options(width=100)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
```{r setup, echo=FALSE}
suppressPackageStartupMessages({
library(BiocParallel)
library(microbenchmark)
})
```
# Common _Bioconductor_ work flows
This very open-ended topic points to some of the most prominent
Bioconductor packages for sequence analysis. Use the opportunity in
this lab to explore the package vignettes and help pages highlighted
below; many of the material will be covered in greater detail in
subsequent labs and lectures.
Domain-specific analysis -- explore the landing pages, vignettes, and
reference manuals of two or three of the following packages.
- Important packages for analysis of differential expression include
[edgeR][] and [DESeq2][]; both have excellent vignettes for
exploration. Additional research methods embodied in Bioconductor
packages can be discovered by visiting the [biocViews][] web page,
searching for the 'DifferentialExpression' view term, and narrowing
the selection by searching for 'RNA seq' and similar.
- Popular ChIP-seq packages include [csaw][] and [DiffBind][] for
comparison of peaks across samples, [ChIPQC][] for quality
assessment, and [ChIPseeker][] for annotating results (e.g.,
discovering nearby genes). What other ChIP-seq packages are listed
on the [biocViews][] page?
- Working with called variants (VCF files) is facilitated by packages
such as [VariantAnnotation][], [VariantFiltering][], [ensemblVEP][],
and [SomaticSignatures][]; packages for calling variants include,
e.g., [h5vc][] and [VariantTools][].
- Several packages identify copy number variants from sequence data,
including [cn.mops][]; from the [biocViews][] page, what other copy
number packages are available? The [CNTools][] package provides some
useful facilities for comparison of segments across samples.
- Microbiome and metagenomic analysis is facilitated by packages such
as [phyloseq][] and [metagenomeSeq][].
- Metabolomics, chemoinformatics, image analysis, and many other
high-throughput analysis domains are also represented in
Bioconductor; explore these via biocViews and title searches.
Working with sequences, alignments, common web file formats, and raw
data; these packages rely very heavily on the [IRanges][] /
[GenomicRanges][] infrastructure.
- The [Biostrings][] package is used to represent DNA and other
sequences, with many convenient sequence-related functions. Check
out the functions documented on the help page `?consensusMatrix`,
for instance. Also check out the [BSgenome][] package for working
with whole genome sequences, e.g., `?"getSeq,BSgenome-method"`
- The [GenomicAlignments][] package is used to input reads aligned to
a reference genome. See for instance the `?readGAlignments` help
page and `vigentte(package="GenomicAlignments",
"summarizeOverlaps")`
- [rtracklayer][]'s `import` and `export` functions can read in many
common file types, e.g., BED, WIG, GTF, ..., in addition to querying
and navigating the UCSC genome browser. Check out the `?import` page
for basic usage.
- The [ShortRead][] and [Rsamtools][] packages can be used for
lower-level access to FASTQ and BAM files, respectively. Explore the
[ShortRead vignette](https://bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf)
and Scalable Genomics labs to see approaches to effectively
processing the large files.
Visualization
- The [Gviz][] package provides great tools for visualizing local
genomic coordinates and associated data.
- [epivizr][] drives the [epiviz](http://epiviz.cbcb.umd.edu/) genome
browser from within R; [rtracklayer][] provides easy ways to
transfer data to and manipulate UCSC browser sessions.
- Additional packages include [ggbio][], [OmicCircos][], ...
![Alt Sequencing Ecosystem](our_figures/SequencingEcosystem.png)
# Computation: Big Data
## Efficient _R_ code
The goal of this section is to highlight practices for writing correct, robust
and efficient R code.
## Priorities
1. Correct: consistent with hand-worked examples (`identical()`, `all.equal()`)
2. Robust: supports realistic inputs, e.g., 0-length vectors, `NA`
values, ...
3. Simple: easy to understand next month; easy to describe what it
does to a colleague; easy to spot logical errors; easy to enhance.
4. Fast, or at least reasonable given the speed of modern computers.
## Strategies
1. Profile
- _Look_ at the script to understand in general terms what it is doing.
- _Step_ through the code to see how it is executed, and to gain an
understanding of the speed of each line.
- _Time_ evaluation of select lines or simple chunks of code with
`system.time()` or the `r CRANpkg("microbenchmark")` package.
- _Profile_ the code with a tool that indicates how much time is
spent in each function call or line -- the built-in `Rprof()`
function, or packages such as `r CRANpkg("lineprof")` or
`r CRANpkg("aprof")`
2. Vectorize -- operate on vectors, rather than explicit loops
```{r vectorize}
x <- 1:10
log(x) ## NOT for (i in seq_along(x)) x[i] <- log(x[i])
```
3. Pre-allocate memory, then fill in the result
```{r pre-allocate}
result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
result[i] <- runif(1) * result[i - 1]
result
```
4. Hoist common sub-expressions outside of repeated calculations, so
that the sub-expression is only calculated once
- Simple, e.g., 'hoist' constant multiplications from a `for` loop
- Higher-level, e.g., use `lm.fit()` rather than repeatedly fitting
the same design matrix.
5. Re-use existing, tested code
- Efficient implementations of common operations -- `tabulate()`,
`rowSums()` and friends, `%in%`, ...
- Efficient domain-specific implementations, e.g.,
`r Biocpkg("snpStats")` for GWAS linear models; `r Biocpkg("limma")`
for microarray linear models; `r Biocpkg("edgeR")`,
`r Biocpkg("DESeq2")` for negative binomial GLMs relevant to
RNASeq.
- Reuse others' work -- `r Biocpkg("DESeq2")`,
`r Biocpkg("GenomicRanges")`, `r Biocpkg("Biostrings")`, ...,
`r CRANpkg("dplyr")`, `r CRANpkg("data.table")`, `r CRANpkg("Rcpp")`
### Case Study: Pre-allocate and vectorize
Here's an obviously inefficient function:
```{r inefficient}
f0 <- function(n, a=2) {
## stopifnot(is.integer(n) && (length(n) == 1) &&
## !is.na(n) && (n > 0))
result <- numeric()
for (i in seq_len(n))
result[[i]] <- a * log(i)
result
}
```
Use `system.time()` to investigate how this algorithm scales with `n`,
focusing on elapsed time.
```{r system-time}
system.time(f0(10000))
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")
```
Remember the current 'correct' value, and an approximate time
```{r correct-init}
n <- 10000
system.time(expected <- f0(n))
head(expected)
```
Revise the function to hoist the common multiplier, `a`, out of the
loop. Make sure the result of the 'optimization' and the original
calculation are the same. Use the `r CRANpkg("microbenchmark")`
package to compare the two versions
```{r hoist}
f1 <- function(n, a=2) {
result <- numeric()
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f1(n))
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
```
Adopt a 'pre-allocate and fill' strategy
```{r preallocate-and-fill}
f2 <- function(n, a=2) {
result <- numeric(n)
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f2(n))
microbenchmark(f0(n), f2(n), times=5)
```
Use an `*apply()` function to avoid having to explicitly pre-allocate,
and make opportunities for vectorization more apparent.
```{r use-apply}
f3 <- function(n, a=2)
a * sapply(seq_len(n), log)
identical(expected, f3(n))
microbenchmark(f0(n), f2(n), f3(n), times=10)
```
Now that the code is presented in a single line, it is apparent that
it could be easily vectorized. Seize the opportunity to vectorize it:
```{r use-vectorize}
f4 <- function(n, a=2)
a * log(seq_len(n))
identical(expected, f4(n))
microbenchmark(f0(n), f3(n), f4(n), times=10)
```
`f4()` definitely seems to be the winner. How does it scale with `n`?
(Repeat several times)
```{r vectorized-scale}
n <- 10 ^ (5:8) # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")
```
Any explanations for the different pattern of response?
Lessons learned:
1. Vectorizing offers a huge improvement over iteration
2. Pre-allocate-and-fill is very helpful when explicit iteration is
required.
3. `*apply()` functions help avoid need for explicit pre-allocation
and make opportunities for vectorization more apparent. This may
come at a small performance cost, but is generally worth it
4. Hoisting common sub-expressions can be helpful for improving
performance when explicit iteration is required.
## Parallel evaluation
When data are too large to fit in memory, we can iterate through files in
chunks or subset the data by fields or genomic positions.
Iteration
- Chunk-wise
- `open()`, read chunk(s), `close()`.
- e.g., `yieldSize` argument to `Rsamtools::BamFile()`
- Framework: `GenomicFiles::reduceByYield()`
Restriction
- Limit to columns and / or rows of interest
- Exploit domain-specific formats
- BAM files and `Rsamtools::ScanBamParam()`
- BAM files and `Rsamtools::PileupParam()`
- VCF files and `VariantAnnotation::ScanVcfParam()`
- Use a data base
Parallel evalutation
- _After_ writing efficient _R_ code, adopting iteration and restriction
- [BiocParallel][] for common, cross-platform operation
[BiocParallel][] provides a standardized interface for simple parallel
evaluation. The package builds provides access to the `snow` and
`multicore` functionality in the `parallel` package as well as
`BatchJobs` for running cluster jobs.
General ideas:
- Use `bplapply()` instead of `lapply()`
- Argument `BPPARAM` influences how parallel evaluation occurs
- `MulticoreParam()`: threads on a single (non-Windows) machine
- `SnowParam()`: processes on the same or different machines
- `BatchJobsParam()`: resource scheduler on a cluster
- `DoparParam()`: parallel execution with `foreach()`
### Exercise: Sleeping serially and in parallel
This small example motivates the use of parallel execution and demonstrates how
`bplapply()` can be a drop in for `lapply`.
Use `system.time()` to explore how long this takes to execute as `n`
increases from 1 to 10. Use `identical()` and
`r CRANpkg("microbenchmark")` to compare alternatives `f0()` and `f1()`
for both correctness and performance.
`fun` sleeps for 1 second, then returns `i`.
```{r parallel-sleep}
library(BiocParallel)
fun <- function(i) {
Sys.sleep(1)
i
}
## serial
f0 <- function(n)
lapply(seq_len(n), fun)
## parallel
f1 <- function(n)
bplapply(seq_len(n), fun)
```
[biocViews]: https://bioconductor.org/packages/
[AnnotationDbi]: https://bioconductor.org/packages/AnnotationDbi
[AnnotationHub]: https://bioconductor.org/packages/AnnotationHub
[BSgenome.Hsapiens.UCSC.hg19]: https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19
[BSgenome]: https://bioconductor.org/packages/BSgenome
[BiocParallel]: https://bioconductor.org/packages/BiocParallel
[Biostrings]: https://bioconductor.org/packages/Biostrings
[CNTools]: https://bioconductor.org/packages/CNTools
[ChIPQC]: https://bioconductor.org/packages/ChIPQC
[ChIPseeker]: https://bioconductor.org/packages/ChIPseeker
[DESeq2]: https://bioconductor.org/packages/DESeq2
[DiffBind]: https://bioconductor.org/packages/DiffBind
[GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments
[GenomicFiles]: https://bioconductor.org/packages/GenomicFiles
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[Gviz]: https://bioconductor.org/packages/Gviz
[Homo.sapiens]: https://bioconductor.org/packages/Homo.sapiens
[IRanges]: https://bioconductor.org/packages/IRanges
[KEGGREST]: https://bioconductor.org/packages/KEGGREST
[OmicCircos]: https://bioconductor.org/packages/OmicCircos
[PSICQUIC]: https://bioconductor.org/packages/PSICQUIC
[Rsamtools]: https://bioconductor.org/packages/Rsamtools
[Rsubread]: https://bioconductor.org/packages/Rsubread
[ShortRead]: https://bioconductor.org/packages/ShortRead
[SomaticSignatures]: https://bioconductor.org/packages/SomaticSignatures
[SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment
[TxDb.Hsapiens.UCSC.hg19.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene
[VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation
[VariantFiltering]: https://bioconductor.org/packages/VariantFiltering
[VariantTools]: https://bioconductor.org/packages/VariantTools
[airway]: https://bioconductor.org/packages/airway
[biomaRt]: https://bioconductor.org/packages/biomaRt
[cn.mops]: https://bioconductor.org/packages/cn.mops
[csaw]: https://bioconductor.org/packages/csaw
[edgeR]: https://bioconductor.org/packages/edgeR
[ensemblVEP]: https://bioconductor.org/packages/ensemblVEP
[epivizr]: https://bioconductor.org/packages/epivizr
[ggbio]: https://bioconductor.org/packages/ggbio
[h5vc]: https://bioconductor.org/packages/h5vc
[limma]: https://bioconductor.org/packages/limma
[metagenomeSeq]: https://bioconductor.org/packages/metagenomeSeq
[org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db
[org.Sc.sgd.db]: https://bioconductor.org/packages/org.Sc.sgd.db
[phyloseq]: https://bioconductor.org/packages/phyloseq
[rtracklayer]: https://bioconductor.org/packages/rtracklayer
[snpStats]: https://bioconductor.org/packages/snpStats
[dplyr]: https://cran.r-project.org/package=dplyr
[data.table]: https://cran.r-project.org/package=data.table
[Rcpp]: https://cran.r-project.org/package=Rcpp