# Contents

Original Authors: Martin Morgan, Sonali Arora
Presenting Authors: Martin Morgan, Lori Shepherd
Date: 16 June, 2017

# Efficient R code

The goal of this section is to highlight practices for writing correct, robust and efficient R code.

## Priorities

1. Correct: consistent with hand-worked examples (identical(), all.equal())
2. Robust: supports realistic inputs, e.g., 0-length vectors, NA values, …
3. Simple: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance.
4. Fast, or at least reasonable given the speed of modern computers.

## Strategies

1. Profile
• Look at the script to understand in general terms what it is doing.
• Step through the code to see how it is executed, and to gain an understanding of the speed of each line.
• Time evaluation of select lines or simple chunks of code with system.time() or the microbenchmark package.
• Profile the code with a tool that indicates how much time is spent in each function call or line – the built-in Rprof() function, or packages such as lineprof or aprof
1. Vectorize – operate on vectors, rather than explicit loops

x <- 1:10
log(x)     ## NOT for (i in seq_along) x[i] <- log(x[i])
##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851
2. Pre-allocate memory, then fill in the result

result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
result[i] <- runif(1) * result[i - 1]
result
##  [1] 9.215951e-02 5.919572e-02 2.999709e-02 2.296483e-02 1.191122e-02
##  [6] 1.604789e-03 9.858193e-04 7.812575e-04 3.053767e-04 7.543298e-05
3. Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once
• Simple, e.g., ‘hoist’ constant multiplications from a for loop
• Higher-level, e.g., use lm.fit() rather than repeatedly fitting the same design matrix.
1. Re-use existing, tested code
• Efficient implementations of common operations – tabulate(), rowSums() and friends, %in%, …
• Efficient domain-specific implementations, e.g., snpStats for GWAS linear models; limma for microarray linear models; edgeR, DESeq2 for negative binomial GLMs relevant to RNASeq.
• Reuse others’ work – DESeq2, GenomicRanges, Biostrings, …, dplyr, data.table, Rcpp

### Case Study: Pre-allocate and vectorize

Here’s an obviously inefficient function:

f0 <- function(n, a=2) {
## stopifnot(is.integer(n) && (length(n) == 1) &&
##           !is.na(n) && (n > 0))
result <- numeric()
for (i in seq_len(n))
result[[i]] <- a * log(i)
result
}

Use system.time() to investigate how this algorithm scales with n, focusing on elapsed time.

system.time(f0(10000))
##    user  system elapsed
##   0.011   0.000   0.011
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")

Remember the current ‘correct’ value, and an approximate time

n <- 10000
system.time(expected <- f0(n))
##    user  system elapsed
##   0.006   0.000   0.005
head(expected)
## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519

Revise the function to hoist the common multiplier, a, out of the loop. Make sure the result of the ‘optimization’ and the original calculation are the same. Use the microbenchmark package to compare the two versions

f1 <- function(n, a=2) {
result <- numeric()
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f1(n))
## [1] TRUE
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval cld
##  f0(n) 5.414217 5.669268 6.079984 5.707558 5.727682 7.881197     5   a
##  f1(n) 4.901185 5.023124 5.658701 5.119968 5.218913 8.030314     5   a

Adopt a ‘pre-allocate and fill’ strategy

f2 <- function(n, a=2) {
result <- numeric(n)
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f2(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), times=5)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval cld
##  f0(n) 5.234639 5.912153 6.354878 6.106759 6.209482 8.311359     5   b
##  f2(n) 1.871773 1.911535 1.929604 1.932355 1.945145 1.987210     5  a

Use an *apply() function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.

f3 <- function(n, a=2)
a * sapply(seq_len(n), log)

identical(expected, f3(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), f3(n), times=10)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval cld
##  f0(n) 4.961980 5.166718 6.417935 5.726985 8.653193 8.787703    10   b
##  f2(n) 1.796029 1.840470 1.895657 1.871912 1.922014 2.152608    10  a
##  f3(n) 6.582897 6.848146 7.263491 7.100540 7.236031 9.816066    10   b

Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:

f4 <- function(n, a=2)
a * log(seq_len(n))
identical(expected, f4(n))
## [1] TRUE
microbenchmark(f0(n), f3(n), f4(n), times=10)
## Unit: microseconds
##   expr      min       lq     mean   median       uq      max neval cld
##  f0(n) 5019.834 5146.660 5862.152 5318.873 5646.767 8299.264    10   b
##  f3(n) 6605.260 6678.556 6959.684 6905.622 7093.273 7805.310    10   b
##  f4(n)  676.751  701.697 1386.289  704.247  735.588 4604.705    10  a

f4() definitely seems to be the winner. How does it scale with n? (Repeat several times)

n <- 10 ^ (5:8)                         # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")

Any explanations for the different pattern of response?

Lessons learned:

1. Vectorizing offers a huge improvement over iteration
2. Pre-allocate-and-fill is very helpful when explicit iteration is required.
3. *apply() functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it
4. Hoisting common sub-expressions can be helpful for improving performance when explicit iteration is required.

# Iteration and restriction to manage memory

When data are too large to fit in memory, we can iterate through files in chunks or subset the data by fields or genomic positions.

Iteration - Chunk-wise - open(), read chunk(s), close(). - e.g., yieldSize argument to Rsamtools::BamFile() - Framework: GenomicFiles::reduceByYield()

Restriction - Limit to columns and / or rows of interest - Exploit domain-specific formats – BAM files and Rsamtools::ScanBamParam() – BAM files and Rsamtools::PileupParam() – VCF files and VariantAnnotation::ScanVcfParam() - Use a data base

## Exercise: Counting overlaps

Iterate through files: GenomicFiles::reduceByYield()

1. yield a chunk
2. map from the input chunk to a possibly transformed representation
3. reduce mapped chunks
suppressPackageStartupMessages({
library(GenomicFiles)
library(GenomicAlignments)
library(Rsamtools)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(RNAseqData.HNRNPC.bam.chr14)
})

yield <-     # how to input the next chunk of data
function(X, ...)
{
}

map <-       # what to do to each chunk
function(VALUE, ..., roi)
{
olaps <- findOverlaps(VALUE, roi, type="within", ignore.strand=TRUE)
count <- tabulate(subjectHits(olaps), subjectLength(olaps))
setNames(count, names(roi))
}

reduce <- +   # how to combine mapped chunks

Improvement: “yield factory” to keep track of how many records input

yieldFactory <-  # return a function with local state
function()
{
n_records <- 0L
function(X, ...) {
n_records <<- n_records + length(aln)
message(n_records)
aln
}
}

Regions of interest, named like the chromosomes in the bam file.

exByTx <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")

A function to iterate through a bam file.

count1 <- function(filename, roi) {
message(filename)
## Create and open BAM file
bf <- BamFile(filename, yieldSize=1000000)
reduceByYield(bf, yieldFactory(), map, reduce, roi=roi)
}

In action

## path to example bam files
library(RNAseqData.HNRNPC.bam.chr14)
bam <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
count <- count1(bam[[1]], exByTx)
## /home/csama/R/x86_64-pc-linux-gnu-library/3.4/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127306_chr14.bam
## 800484
## 800484
hist(log10(count)) # drops 0's

# File management

## File classes

Type Example use Name Package
.bed Range annotations BedFile() rtracklayer
.wig Coverage WigFile(), BigWigFile() rtracklayer
.gtf Transcript models GTFFile() rtracklayer
makeTxDbFromGFF() GenomicFeatures
.2bit Genomic Sequence TwoBitFile() rtracklayer
.fastq Reads & qualities FastqFile() ShortRead
.bam Aligned reads BamFile() Rsamtools
.tbx Indexed tab-delimited TabixFile() Rsamtools
.vcf Variant calls VcfFile() VariantAnnotation
## rtracklayer menagerie
suppressPackageStartupMessages(library(rtracklayer))
names(getClass("RTLFile")@subclasses)
##  [1] "UCSCFile"         "GFFFile"          "BEDFile"
##  [4] "WIGFile"          "BigWigFile"       "ChainFile"
##  [7] "TwoBitFile"       "FastaFile"        "TabSeparatedFile"
## [10] "CompressedFile"   "GFF1File"         "GFF2File"
## [13] "GFF3File"         "BEDGraphFile"     "BED15File"
## [16] "BEDPEFile"        "BWFile"           "2BitFile"
## [19] "GTFFile"          "GVFFile"          "GZFile"
## [22] "BGZFile"          "BZ2File"          "XZFile"

Notes

• Not a consistent interface
• open(), close(), import() / yield() / read*()
• Some: selective import via index (e.g., .bai, bam index); selection (‘columns’); restriction (‘rows’)

## Managing a collection of files

*FileList() classes

• reduceByYield() – iterate through a single large file
• bplapply() (BiocParallel) – perform independent operations on several files, in parallel

GenomicFiles()

• ‘rows’ as genomic range restrictions, ‘columns’ as files
• Each row x column is a map from file data to useful representation in R
• reduceByRange(), reduceByFile(): collapse maps into summary representation
• see the GenomicFiles vignette Figure 1

VcfStack()

• Common practice of spliting VCF files into one-per-chromosome.
• Easy way to treat as a signle entitty
• see ?VcfStack

# Parallel evaluation

A couple of caveats -

Iteration / restriction techniques keep the memory requirements under control while parallel evaluation distributes computational load across nodes. Keep in mind that parallel computations are still restricted by the amount of memory available on each node.

There is overhead in setting up and tearing down a cluster and more so when computing in distributed memory. For small calculations, the parallel overhead may outweigh the benefits with no improvement in performance.

Jobs that benefit the most from parallel execution are CPU-intensive and operate on data chunks that fits into memory.

## BiocParallel

BiocParallel provides a standardized interface for parallel evaluation and supports the major parallel computing styles: forks and processes on a single computer, ad hoc clusters, batch schedulers and cloud computing. By default, BiocParallel chooses a parallel back-end appropriate for the OS and is supported across Unix, Mac and Windows.

General ideas:

• Use bplapply() instead of lapply()
• Argument BPPARAM influences how parallel evaluation occurs

• MulticoreParam(): threads on a single (non-Windows) machine
• SnowParam(): processes on the same or different machines
• BatchJobsParam(): resource scheduler on a cluster

### Exercise: Sleeping serially and in parallel

This small example motivates the use of parallel execution and demonstrates how bplapply() can be a drop in for lapply.

Use system.time() to explore how long this takes to execute as n increases from 1 to 10. Use identical() and microbenchmark to compare alternatives f0() and f1() for both correctness and performance.

fun sleeps for 1 second, then returns i.

library(BiocParallel)

fun <- function(i) {
Sys.sleep(1)
i
}

## serial
f0 <- function(n)
lapply(seq_len(n), fun)

## parallel
f1 <- function(n)
bplapply(seq_len(n), fun)

### Exercise: error handling and BPREDO

BiocParallel “catches and returns” errors along with successful results. This exercise demonstrates how to access the traceback() of a failed task and how to re-run the failed tasks with ‘BPREDO’. Full details on error handling, logging and debugging are in the Errors, Logs and Debugging vignette.

param <- MulticoreParam(workers = 3) # Windows: SnowParam(workers=3)

Call the sqrt() function across ‘X’; the second element is a character and will throw and error.

X <- list(1, "2", 3)
res <- bplapply(X, sqrt, BPPARAM = param)
## Error: BiocParallel errors
##   element index: 2
##   first error: non-numeric argument to mathematical function

It’s also possible to catch the error and partially evaluated result

res <- bptry(bplapply(X, sqrt, BPPARAM=param))
res
## [[1]]
## [1] 1
##
## [[2]]
## <remote_error in FUN(...): non-numeric argument to mathematical function>
## traceback() available as 'attr(x, "traceback")'
##
## [[3]]
## [1] 1.732051

Re-run the failed results by repeating the call to bplapply() this time with corrected input data and the partial results as ‘BPREDO’. Only the failed values are re-run.

X.redo <- list(1, 2, 3)
bplapply(X.redo, sqrt, BPREDO = res)
## resuming previous calculation ...
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051

Alternatively, switch to a SerialParam() and debug the specific element that caused the error.

> fun = function(i) { browser(); sqrt(i) }
> bplapply(X, fun, BPREDO=res, BPPARAM=SerialParam())
resuming previous calculation ...
Called from: FUN(...)
Browse[1]>
debug at #1: sqrt(i)
Browse[2]> i
[1] "2"
Browse[2]> i = 2
Browse[2]> c
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

### Exercise: logging

BiocParallel uses the futile.logger package for logging. The package has a flexible system for filtering messages of varying severity thresholds such as INFO, DEBUG, ERROR etc. (For a list of all thresholds see the ?bpthreshold man page.) BiocParallel captures messages written in futile.logger format as well as messages written to stdout and stderr.

This function does some argument checking and has DEBUG, WARN and INFO-level log messages.

FUN <- function(i) {
flog.debug(paste0("value of 'i': ", i))

if (!length(i)) {
flog.warn("'i' is missing")
NA
} else if (!is(i, "numeric")) {
flog.info("coercing to numeric")
as.numeric(i)
} else {
i
}
}

Turn logging on in the param and set the threshold to WARN.

param <- SnowParam(3, log = TRUE, threshold = "WARN")
bplapply(list(1, "2", integer()), FUN, BPPARAM = param)

Lower the threshold to INFO and DEBUG (i.e., use bpthreshold<-) to see how messages are filtered on severity.

### Exercise: Worker timeout

For long running jobs or untested code it can be useful to set a time limit. The timeout field is the time, in seconds, allowed for each worker to complete a task. If a task takes longer than timeout the worker returns an error.

timeout can be set during param construction,

param <- SnowParam(timeout = 20)
param
## class: SnowParam
##   bpisup: FALSE; bpnworkers: 1; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bptimeout: 20; bpprogressbar: FALSE
##   bpRNGseed:
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCK

or with the setter:

bptimeout(param) <- 2
param
## class: SnowParam
##   bpisup: FALSE; bpnworkers: 1; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bptimeout: 2; bpprogressbar: FALSE
##   bpRNGseed:
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCK

Use this function to explore different _timeout_s over a numeric vector of ‘X’ values with bplapply(). ‘X’ values less than timeout return successfully while those over threshold return an error.

fun <- function(i) {
Sys.sleep(i)
i
}

### Exercise: Counting overlaps in parallel

Distribute files over workers: GenomicFiles::reduceByFile()

The previous counting example used GenomicFiles::reduceByYield() which operates on a single file and implements a yield, map, reduce paradigm. In this exercise we’ll use GenomicFiles::reduceByFile() which uses bplaply() under the hood to operate on multiple files in parallel.

Primary arguments to reduceByFile() are a set of files and a set of ranges. Files are sent to the workers and data subsets are extracted based on the ranges. The bulk of the work is done in the MAP function and an optional REDUCE function combines the output on each worker.

suppressPackageStartupMessages({
library(BiocParallel)
library(GenomicFiles)
library(GenomicAlignments)
library(Rsamtools)
})

On Unix or Mac, configure a MulticoreParam() with 4 workers. Turn on logging and set a timeout of 60 seconds.

param <- MulticoreParam(4, log = TRUE, timeout = 60)

On Windows do the same with SnowParam():

param <- SnowParam(4, log = TRUE, timeout = 60)

Point to the collection of bam files.

library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
bfl <- BamFileList(fls)

Defining ranges (region of interest) restricts the amount of data on the workers and keeps memory requirements under control.

ranges <- GRanges("chr14", IRanges(c(28477797, 29527797, 32448354),
c(29477797, 30527797, 33448354)))

The MAP function reads in records and counts overlaps. readGAlignments() reads in bam records that overlap with any portion of the ranges defined in the scanBamParam (i.e., they could be overlapping the start or the end). Once we’ve got the records in R, we want to count only those that fall ‘within’ the ranges.

MAP <- function(range, file, ...) {
param = ScanBamParam(which=range)  ## restriction
## overlaps
olaps <- findOverlaps(gal, range, type="within", ignore.strand=TRUE)
tabulate(subjectHits(olaps), subjectLength(olaps))
} 

Count …

cts <- reduceByFile(ranges, fls, MAP, BPPARAM = param)
## ############### LOG OUTPUT ###############
## Node: 1
## Timestamp: 2017-06-16 13:29:04
## Success: TRUE
##    user  system elapsed
##   0.650   0.051   2.044
## Memory used:
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 5233773 279.6    8273852 441.9  8273852 441.9
## Vcells 7090701  54.1   28119955 214.6 35046959 267.4
## Log messages:
## stderr and stdout:
## ############### LOG OUTPUT ###############
## Node: 2
## Timestamp: 2017-06-16 13:29:06
## Success: TRUE
##    user  system elapsed
##   0.594   0.079   2.009
## Memory used:
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 5234109 279.6    8273852 441.9  8273852 441.9
## Vcells 7091850  54.2   22495964 171.7 35046959 267.4
## Log messages:
## stderr and stdout:
## ############### LOG OUTPUT ###############
## Node: 3
## Timestamp: 2017-06-16 13:29:06
## Success: TRUE
##    user  system elapsed
##   0.619   0.059   1.882
## Memory used:
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 5234142 279.6    8273852 441.9  8273852 441.9
## Vcells 7091920  54.2   22495964 171.7 35046959 267.4
## Log messages:
## stderr and stdout:
## ############### LOG OUTPUT ###############
## Node: 4
## Timestamp: 2017-06-16 13:29:07
## Success: TRUE
##    user  system elapsed
##   0.606   0.082   1.927
## Memory used:
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 5234175 279.6    8273852 441.9  8273852 441.9
## Vcells 7091988  54.2   22495964 171.7 35046959 267.4
## Log messages:
## stderr and stdout:

The result is a list the same length as the number of files.

length(cts)
## [1] 8

Each list element is the length of the number of ranges.

lengths(cts)
## ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
##         3         3         3         3         3         3         3
## ERR127305
##         3

Tables of counts for each range are extracted with [[; use simplifyToArray() to get a matrix of counts

cts[[1]]
## [[1]]
## [1] 429
##
## [[2]]
## [1] 22
##
## [[3]]
## [1] 1338
simplify2array(cts)
##      ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
## [1,] 429       477       508       452       516       721       713
## [2,] 22        18        37        24        14        26        30
## [3,] 1338      1938      1618      1683      1988      2526      2372
##      ERR127305
## [1,] 544
## [2,] 20
## [3,] 1785`

## Other resources

• Bioconductor Amazon AMI

• easily ‘spin up’ 10’s of instances
• Pre-configured with Bioconductor packages and StarCluster management