---
title: "Evening 2.1: Efficient _R_"
output:
BiocStyle::html_document:
toc: true
vignette: >
% \VignetteIndexEntry{Evening 2.1: Efficient R}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```
```{r setup, echo=FALSE}
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))
)
```
# Inspiration: correlated coin tosses
Wolfgang mentioned on Monday that observing 50 heads and then 50 tails
would not be evidence of a 'fair' coin, even though the observed
number of head (50) was exactly that expected of a fair coin.
I wonder, how do you generate correlated coin tosses?
Criteria (most to least important)
1. Correct
2. Understandable
3. Robust
4. Fast
# First approach
## Implementation
Suppose we represent heads as '0' and tails as '1'. Let `p` be the
probability that a coin that is currently a head is tossed and remains
a head, and similarly for tails. If `p == 0.5`, then the tosses would
seem to be uncorrelated. We implement this idea as a simple function
```{r}
f1 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- NULL
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome <- c(outcome, current)
}
outcome
}
```
## Assessment
Is it correct?
```{r}
f1(10, .5)
table(f1(1000, .5))
res <- f1(1000, .9)
mean(rle(res)$length) # expectation: 1 / (1 - p)
```
Is it understandable? Pretty much
Is it robust? We'll come back to this.
Is it fast? Seems so initially, but it scales poorly!
```{r}
system.time(f1(5000, .5))
system.time(f1(10000, .5))
system.time(f1(20000, .5))
```
Doubling the problem size (`n`) causes a 4-fold increase in execution time.
What's the problem? `c(outcome, current)` causes `outcome` to be
copied each time through the loop!
# Second approach
## Implementation
Pre-allocate and fill the outcome vector; `outome` updated in place,
without copying.
```{r}
f2 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}
```
## Assessment
Is it correct?
```{r}
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res2 <- f2(100, .9)
identical(res1, res2)
```
Is it understandable? As understandable as `f1()`.
Is it robust? In a minute...
Is it fast?
```{r}
system.time(f2(5000, .5))
system.time(f2(10000, .5))
system.time(f2(20000, .5))
```
`f2()` seems to scale linearly, so performs increasingly well compared
to `f1()`.
# Third approach
## Implementation
Note that `runif()` is called `n` times, but the program could be
modified, and still be understandable, if it were called just once --
_hoist_ `runif(1) > p` out of the loop.
```{r}
f3 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in 1:n) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}
```
## Assessment
1. Correct?
```{r}
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res3 <- f3(100, .9)
identical(res1, res3)
```
2. Understandable? Yes.
3. Robust? None of them have been, and `f3()` really isn't!
```{r}
set.seed(123)
f1(0, .5)
set.seed(123)
try(f3(0, .5))
```
4. Fast? Yes, about 10 times faster than `f3()`
```{r}
n <- 100000
system.time(f2(n, .5))
system.time(f3(n, .5))
```
... with linear scaling.
```{r}
system.time(f3(n * 10, .5))
```
# Fourth approach
## Implementation
The problem is that `1:n` is not robust, especially `1:0` generates
the sequence `c(1, 0)`, whereas we were expecting a zero-length
sequence!
Solution: use `seq_len(n)`
```{r}
lapply(3:1, seq_len)
```
```{r}
f4 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in seq_len(n)) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}
```
## Assessment
1. Correct? Yes
2. Understandable? Yes
3. Robust? Yes
```{r}
set.seed(123)
f4(3, .5)
f4(2, .5)
f4(1, .5)
f4(0, .5)
```
4. Fast? Yes
# Fifth approach
## Implementation
Use `cumsum()` (cummulative sum) to produce sequential groups that
have the same head or tail status. Use `%%` on the cummulative sum to
translate those groups into heads (`cummsum() %% 2 == 0` or tails
`(cummsum() %% 2 == 1`).
```{r}
f5 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
test <- runif(n) > p
cumsum(current + test) %% 2
}
```
## Assessment
1. Correct?
```{r}
set.seed(123); res1 <- f1(10, .8)
set.seed(123); res5 <- f5(10, .8)
identical(res1, res5)
```
2. Understandable? Harder to understand...
3. Robust? Yes
```{r}
f5(0, .8)
```
4. Fast?
```{r}
n <- 1000000
system.time(f4(n, .5))
system.time(f5(n, .5))
system.time(f5(n * 10, .5))
```
About 4x faster than `f4()`, scales linearly, fast even for very large `n`.
Could be used to generate a large data set for developing methods
about correlated samples, along the lines of
```{r}
correlated_tosses_expts <- function(m, n, p) {
## m tosses (rows) in each of n experiments
start0 <- rep(rbinom(m, 1, .5), each = m)
group0 <- cumsum(runif(m * n) > p)
toss <- (start0 + group0) %% 2
matrix(toss, m)
}
system.time({
expt <- correlated_tosses_expts(1000, 10000, .8)
})
hist(colSums(expt))
```
# XXX Approach
Probably we have reached the point of diminishing gains, we've already
spent far more time developing `f5()` than we'll ever save by further
investigation... However,
- Avoid adding `current` to `cumsum()` vector.
- Use `rgeom()` to generate change points.
- 'Is there a package for that?'
Other tools
- [microbenchmark][] for comparing fine-scale performance differences
(but do we really care about speed when, e.g., timing differences
are less than a couple of seconds for large-scale data?)
- [testthat][] for writing 'unit tests' that allow easy implementation
of tests for correct and robust code.
[microbenchmark]: https://cran.r-project.org/package=microbenchmark
[testthat]: https://cran.r-project.org/package=testthat
# Session info
```{r}
sessionInfo()
```