Authors: Martin Morgan (mtmorgan@fhcrc.org), Sonali Arora (sarora@fredhutch.org)

Date: 19 June, 2015

The goal of this section is to learn to write correct, robust, simple and efficient R code. We do this through a couple of case studies.

- Correct: consistent with hand-worked examples (
`identical()`

,`all.equal()`

) - Robust: supports realistic inputs, e.g., 0-length vectors,
`NA`

values, … - Simple: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance.
- Fast, or at least reasonable given the speed of modern computers.

- Profile

*Look*at the script to understand in general terms what it is doing.*Step*through the code to see how it is executed, and to gain an understanding of the speed of each line.*Time*evaluation of select lines or simple chunks of code with`system.time()`

or the*microbenchmark*package.*Profile*the code with a tool that indicates how much time is spent in each function call or line – the built-in`Rprof()`

function, or packages such as*lineprof*or*aprof*

Vectorize – operate on vectors, rather than explicit loops

`x <- 1:10 log(x) ## NOT for (i in seq_along) x[i] <- log(x[i])`

`## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 ## [8] 2.0794415 2.1972246 2.3025851`

Pre-allocate memory, then fill in the result

`result <- numeric(10) result[1] <- runif(1) for (i in 2:length(result)) result[i] <- runif(1) * result[i - 1] result`

`## [1] 0.7386850875 0.3696589691 0.3542664926 0.0510763549 0.0254611460 ## [6] 0.0070750491 0.0038783618 0.0011608073 0.0007160663 0.0006206700`

- Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once

- Simple, e.g., ‘hoist’ constant multiplications from a
`for`

loop - Higher-level, e.g., use
`lm.fit()`

rather than repeatedly fitting the same design matrix.

- Re-use existing, tested code

- Efficient implementations of common operations –
`tabulate()`

,`rowSums()`

and friends,`%in%`

, … - Efficient domain-specific implementations, e.g.,
*snpStats*for GWAS linear models;*limma*for microarray linear models;*edgeR*,*DESeq2*for negative binomial GLMs relevant to RNASeq.

- Re-think how to attack the problem

- Different implementations
- Alternative algorithms

- Compile your script with the byte compiler
- Use parallel evaluation
- Speak in tongues – ‘foreign’ languages like C, Fortran

Here’s an obviously inefficient function:

```
f0 <- function(n, a=2) {
## stopifnot(is.integer(n) && (length(n) == 1) &&
## !is.na(n) && (n > 0))
result <- numeric()
for (i in seq_len(n))
result[[i]] <- a * log(i)
result
}
```

Use `system.time()`

to investigate how this algorithm scales with `n`

, focusing on elapsed time.

`system.time(f0(10000))`

```
## user system elapsed
## 0.119 0.004 0.122
```

```
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")
```

Remember the current ‘correct’ value, and an approximate time

```
n <- 10000
system.time(expected <- f0(n))
```

```
## user system elapsed
## 0.121 0.000 0.120
```

`head(expected)`

`## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519`

Revise the function to hoist the common multiplier, `a`

, out of the loop. Make sure the result of the ‘optimization’ and the original calculation are the same. Use the *microbenchmark* package to compare the two versions

```
f1 <- function(n, a=2) {
result <- numeric()
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f1(n))
```

`## [1] TRUE`

```
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
```

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## f0(n) 109.7719 109.9072 130.5134 141.9948 142.5390 148.3542 5
## f1(n) 108.5630 139.3563 133.2842 139.4147 139.4894 139.5979 5
```

Adopt a ‘pre-allocate and fill’ strategy

```
f2 <- function(n, a=2) {
result <- numeric(n)
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f2(n))
```

`## [1] TRUE`

`microbenchmark(f0(n), f2(n), times=5)`

```
## Unit: milliseconds
## expr min lq mean median uq max
## f0(n) 121.201141 121.739866 134.761569 143.617813 143.620317 143.628707
## f2(n) 7.684828 7.849474 8.559228 8.415322 8.888803 9.957714
## neval
## 5
## 5
```

Use an `*apply()`

function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.

```
f3 <- function(n, a=2)
a * sapply(seq_len(n), log)
identical(expected, f3(n))
```

`## [1] TRUE`

`microbenchmark(f0(n), f2(n), f3(n), times=10)`

```
## Unit: milliseconds
## expr min lq mean median uq max
## f0(n) 112.120796 143.855378 142.577849 145.567746 146.700210 178.674194
## f2(n) 7.655573 7.698159 8.250884 8.177678 8.452503 9.566792
## f3(n) 3.579709 3.677523 4.083759 4.048497 4.455364 4.736191
## neval
## 10
## 10
## 10
```

Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:

```
f4 <- function(n, a=2)
a * log(seq_len(n))
identical(expected, f4(n))
```

`## [1] TRUE`

`microbenchmark(f0(n), f3(n), f4(n), times=10)`

```
## Unit: microseconds
## expr min lq mean median uq max
## f0(n) 112256.477 145432.576 149337.9525 146027.0340 169576.336 179843.968
## f3(n) 3583.222 3925.399 4090.4899 3987.5965 4017.998 5427.407
## f4(n) 364.700 378.439 395.9539 402.8765 407.481 422.962
## neval
## 10
## 10
## 10
```

Returning to our explicit iteration `f2()`

, in these situations it can be helpful to compile the code to a more efficient representation. Do this using the compiler package.

```
library(compiler)
f2c <- cmpfun(f2)
n <- 10000
identical(f2(n), f2c(n))
```

`## [1] TRUE`

`microbenchmark(f2(n), f2c(n), times=10)`

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## f2(n) 7.957343 8.582533 8.721356 8.747018 9.005959 9.45067 10
## f2c(n) 2.000680 2.045304 2.157064 2.164375 2.216226 2.38820 10
```

`f4()`

definitely seems to be the winner. How does it scale with `n`

? (Repeat several times)

```
n <- 10 ^ (5:8) # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")
```

Any explanations for the different pattern of response?

Lessons learned:

- Vectorizing offers a huge improvement over iteration
- Pre-allocate-and-fill is very helpful when explicit iteration is required.
`*apply()`

functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it- Hoisting common sub-expressions and using the
*compiler*package can be helpful for improving performance when explicit iteration is required.

It can be very helpful to reason about an algorithm in an abstract sense, to gain understanding about how an operation might scale. Here’s an interesting problem, taken from StackOverflow: Suppose one has a very long **sorted** vector

`vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))`

with the simple goal being to identify the number of values less than zero. The original post and many responses suggested a variation of scanning the vector for values less than zero, then summing

`f0 <- function(v) sum(v < 0)`

This algorithm compares each element of `vec`

to zero, creating a logical vector as long as the original, `length(v)`

. This logical vector is then scanned by `sum()`

to count the number of elements satisfying the condition.

Questions:

- How many vectors of length
`v`

need to be allocated for this algorithm? Based on the number of comparisons that need to be performed, how would you expect this algorithm to scale with the length of

`v`

? Verify this with a simple figure.`N <- seq(1, 11, 2) * 1e6 Time <- sapply(N, function(n) { v <- sort(rnorm(n)) system.time(f0(v))[[3]] }) plot(Time ~ N, type="b")`

Is there a better algorithm, i.e., an approach that arrives at the same answer but takes less time (and / or space)? The vector is sorted, and we can take advantage of that by doing a *binary search*. The algorithm is surprisingly simple: create an index to the minimum (first) element, and the maximum (last) element. Check to see if the element half way between is greater than or equal to zero. If so, move the maximum index to that point. Otherwise, make that point the new minimum. Repeat this procedure until the minimum index is no longer less than the maximum index.

```
f3 <- function(x, threshold=0) {
imin <- 1L
imax <- length(x)
while (imax >= imin) {
imid <- as.integer(imin + (imax - imin) / 2)
if (x[imid] >= threshold)
imax <- imid - 1L
else
imin <- imid + 1L
}
imax
}
```

Approximately half of the possible values are discarded each iteration, so we expect on average to arrive at the end after about `log2(length(v))`

iterations – the algorithm scales with the log of the length of `v`

, rather than with the length of `v`

, and no long vectors are created. These difference become increasingly important as the length of `v`

becomes long.

Questions:

- Verify with simple data that
`f3()`

and`f0()`

result in`identical()`

answers. Compare how timing of

`f3()`

scales with vector length.`## identity stopifnot( identical(f0((-2):2), f3((-2):2)), identical(f0(2:4), f3(2:4)), identical(f0(-(4:2)), f3(-(4:2))), identical(f0(vec), f3(vec))) ## scale N <- 10^(1:7) Time <- sapply(N, function(n) { v <- sort(rnorm(n)) system.time(f3(v))[[3]] }) plot(Time ~ N, type="b")`

- Use the
*microbenchmark*package to compare performance of`f0()`

and`f3()`

with the original data,`vec`

. R code can be compiled, and compilation helps most when doing non-vectorized operations like those in

`f3()`

. Use`compiler::cmpfun()`

to compile`f3()`

, and compare the result using microbenchmark.`## relative time library(microbenchmark) microbenchmark(f0(vec), f3(vec))`

`## Unit: microseconds ## expr min lq mean median uq max ## f0(vec) 15659.97 16503.9970 17429.61386 17468.603 17947.157 23199.554 ## f3(vec) 28.08 30.9575 47.23053 47.764 52.141 113.544 ## neval ## 100 ## 100`

`library(compiler) f3c <- cmpfun(f3) microbenchmark(f3(vec), f3c(vec))`

`## Unit: microseconds ## expr min lq mean median uq max neval ## f3(vec) 28.470 29.8355 33.27442 34.6335 36.2500 48.503 100 ## f3c(vec) 6.578 7.0270 7.85183 7.3085 7.6945 18.156 100`

We could likely gain additional speed by writing the binary search algorithm in C, but we are already so happy with the performance improvement that we won’t do that!

It is useful to ask what is lost by `f3()`

compared to `f0()`

. For instance, does the algorithm work on character vectors? What about when the vector contains `NA`

values? How are ties at 0 treated?

`findInterval()`

is probably a better built-in way to solve the original problem, and generalizes to additional situations. The idea is to take the query that we are interested in, `0`

, and find the interval specified by `vec`

in which it occurs.

```
f4 <- function(v, query=0)
findInterval(query - .Machine$double.eps, v)
identical(f0(vec), f4(vec))
```

`## [1] TRUE`

`microbenchmark(f0(vec), f3(vec), f4(vec))`

```
## Unit: microseconds
## expr min lq mean median uq max
## f0(vec) 15408.806 16331.254 16768.227 16472.9370 16934.660 19656.259
## f3(vec) 28.265 30.392 44.537 43.5655 48.658 97.324
## f4(vec) 13645.076 13698.410 13946.125 13777.6360 14172.281 15026.055
## neval
## 100
## 100
## 100
```

The fact that it is flexible and well tested means that it would often be preferred to `f3()`

, even though it is less speedy. For instance, compare the time it takes to query 10000 different points using `f3`

and iteration, versus `findInterval`

and vectorization.

```
threshold <- rnorm(10000)
identical(sapply(threshold, f3, x=vec), f4(vec, threshold))
```

`## [1] TRUE`

`microbenchmark(sapply(x, f3), f4(vec, x))`

```
## Unit: microseconds
## expr min lq mean median uq
## sapply(x, f3) 38.121 40.8785 76.45288 83.7155 89.028
## f4(vec, x) 13650.604 13695.7095 13811.30469 13728.5565 13830.765
## max neval
## 154.216 100
## 14673.133 100
```

Some R functions that implement efficient algorithms are `sort()`

(including radix sort), `match()`

(hash table look-up), and `tabulate()`

; these can be useful in your own code.

Lessons learned:

- Choice of algorithm can be very important
- Implementing classical algorithms (like binary search) can be a rewarding learning experience even when, at the end of the day, it may be better to use existing functions.
- The built-in R functions that implement efficient algorithms can be important building-blocks for more complicated code.

This is an advanced exercise, proceed with enthusiastic caution

This extended example illustrates how one might calculate the distirbution of GC content of aligned reads across several BAM files. We start by processing one BAM file sequentially, and then processes many BAM files in parallel.

Find paths to the following sample BAM files (these are small, but large enough to illustrate the principle.

```
library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
```

BAM files are large, so cannot fit into memory. In addition, we will eventually process several BAM files in parallel, so we need to further manage the amount of memory we consume while processing each BAM file. We take to approaches.

The first is to *iterate* through the BAM file in chunks that are large enough to benefit from ’s effiecient vectorized calculation but not so large as to consume excessive memory. We do this by using `BamFileList()`

to indicate that we would like to input aligned reads in chunks of size 100,000

```
library(Rsamtools)
bfls <- BamFileList(fls, yieldSize=100000)
```

Each time we read from a BAM file, we’ll input the next 100,000 records. We’ll adopt our second strategy for managing memory by *restricting* the data read from the BAM file to that necessary to calculate GC content, specifically the DNA sequence of each read, in addition to its alignment coordinates. We’ll do this by writing a function `yield()`

that uses `GenomicFiles::readGAlignments()`

to input the required data; see the help pages for functions that we use but you do not understand, e.g., `?ScanBamParam()`

.

```
library(GenomicAlignments)
yield <- function(bfl) {
## input a chunk of alignments
library(GenomicAlignments)
readGAlignments(bfl, param=ScanBamParam(what="seq"))
}
```

Next we’ll transform our aligned reads to GC content. We will do this using `Biostrings::letterFrequency()`

to count the fraction of G’s or C’s in each read, tabulate these into 2.5-percentile bins, and calculate the cummulative number of reads in each bin.

```
library(Biostrings)
map <- function(aln) { # GC content, bin & cummulate
gc <- letterFrequency(mcols(aln)$seq, "GC")
tabulate(1 + gc, 73) # max. read length: 72
}
```

`map()`

will be applied to the result of each of data returned by `yield()`

; we’ll write a function `reduce()`

that combines the result of two calls to `map()`

into a single summary. In our case, `reduce`

is simply the adition of the return value of two successive calls to `map()`

.

`reduce <- `+``

The *GenomicFiles* package provides a way to stitch these pieces together, specifically the `reduceByYield()`

function, illustrated in the following code chunk

```
library(GenomicFiles)
bf <- BamFile(fls[1], yieldSize=100000)
reduceByYield(bf, yield, map, reduce)
```

```
## [1] 0 0 0 0 0 0 0 0 0 0 0
## [12] 0 1 1 4 24 41 87 238 490 907 1397
## [23] 2127 3208 4706 6220 8737 11052 14680 17020 19360 21804 27047
## [34] 29212 31249 35395 39807 40259 41722 42304 44108 44073 42317 41260
## [45] 38372 35689 32447 27815 22153 18960 14188 10768 7887 6182 4817
## [56] 3332 2101 1652 1455 860 459 235 116 73 34 22
## [67] 6 4 0 0 0 0 0
```

The result printed out above is the number aligned reads with 0, 1, …, 73 G or C nucleotides. There are never more than 100,000 BAM records in memory at any one time, so memory consumption is modest. Nonetheless, we have processed the entire file.

Now that we can iterate through a single file to generate GC content in a modest amount of memory, it is very easy to process all files in parallel: use `bplapply()`

to invoke `reduceByYield()`

on each file, passing additional arguments `yield`

, `map`

, and `reduce`

.

```
library(BiocParallel)
gc <- bplapply(bfls, reduceByYield, yield, map, reduce)
```

The result is a list of GC-count vectors, one element for each file.

The result can be transformed to a `data.frame()`

```
library(ggplot2)
df <- stack(as.data.frame(lapply(gc, cumsum)))
df$GC <- 0:72
```

and visualized, e.g.,

```
library(ggplot2)
ggplot(df, aes(x=GC, y=values)) + geom_line(aes(colour=ind)) +
xlab("Number of GC Nucleotides per Read") +
ylab("Number of Reads")
```