---
title: "Evening 2.1: Efficient _R_"
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  % \VignetteIndexEntry{Evening 2.1: Efficient R}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r setup, echo=FALSE}
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))
)
```

# Inspiration: correlated coin tosses

Wolfgang mentioned on Monday that observing 50 heads and then 50 tails
would not be evidence of a 'fair' coin, even though the observed
number of head (50) was exactly that expected of a fair coin.

I wonder, how do you generate correlated coin tosses?

Criteria (most to least important)

1. Correct
2. Understandable
3. Robust
4. Fast

# First approach

## Implementation

Suppose we represent heads as '0' and tails as '1'. Let `p` be the
probability that a coin that is currently a head is tossed and remains
a head, and similarly for tails. If `p == 0.5`, then the tosses would
seem to be uncorrelated. We implement this idea as a simple function

```{r}
f1 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- NULL
    for (i in 1:n) {
        if (runif(1) > p)
            current <- (current + 1) %% 2
        outcome <- c(outcome, current)
    }
    outcome
}
```

## Assessment

Is it correct?

```{r}
f1(10, .5)
table(f1(1000, .5))
res <- f1(1000, .9)
mean(rle(res)$length) # expectation: 1 / (1 - p)
```

Is it understandable? Pretty much

Is it robust? We'll come back to this.

Is it fast? Seems so initially, but it scales poorly!

```{r}
system.time(f1(5000, .5))
system.time(f1(10000, .5))
system.time(f1(20000, .5))
```

Doubling the problem size (`n`) causes a 4-fold increase in execution time.

What's the problem? `c(outcome, current)` causes `outcome` to be
copied each time through the loop!

# Second approach

## Implementation

Pre-allocate and fill the outcome vector; `outome` updated in place,
without copying.

```{r}
f2 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    for (i in 1:n) {
        if (runif(1) > p)
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}
```

## Assessment

Is it correct?

```{r}
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res2 <- f2(100, .9)
identical(res1, res2)
```

Is it understandable? As understandable as `f1()`.

Is it robust? In a minute...

Is it fast?

```{r}
system.time(f2(5000, .5))
system.time(f2(10000, .5))
system.time(f2(20000, .5))
```

`f2()` seems to scale linearly, so performs increasingly well compared
to `f1()`.


# Third approach

## Implementation

Note that `runif()` is called `n` times, but the program could be
modified, and still be understandable, if it were called just once --
_hoist_ `runif(1) > p` out of the loop.

```{r}
f3 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    test <- runif(n) > p
    for (i in 1:n) {
        if (test[i])
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}
```

## Assessment

1. Correct?

```{r}
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res3 <- f3(100, .9)
identical(res1, res3)
```

2. Understandable? Yes.

3. Robust? None of them have been, and `f3()` really isn't!

```{r}
set.seed(123)
f1(0, .5)

set.seed(123)
try(f3(0, .5))
```

4. Fast? Yes, about 10 times faster than `f3()`

```{r}
n <- 100000
system.time(f2(n, .5))
system.time(f3(n, .5))
```

... with linear scaling.

```{r}
system.time(f3(n * 10, .5))
```

# Fourth approach

## Implementation

The problem is that `1:n` is not robust, especially `1:0` generates
the sequence `c(1, 0)`, whereas we were expecting a zero-length
sequence!

Solution: use `seq_len(n)`

```{r}
lapply(3:1, seq_len)
```

```{r}
f4 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    test <- runif(n) > p
    for (i in seq_len(n)) {
        if (test[i])
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}
```

## Assessment

1. Correct? Yes

2. Understandable? Yes

3. Robust? Yes

```{r}
set.seed(123)
f4(3, .5)
f4(2, .5)
f4(1, .5)
f4(0, .5)
```

4. Fast? Yes

# Fifth approach

## Implementation

Use `cumsum()` (cummulative sum) to produce sequential groups that
have the same head or tail status. Use `%%` on the cummulative sum to
translate those groups into heads (`cummsum() %% 2 == 0` or tails
`(cummsum() %% 2 == 1`).

```{r}
f5 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    test <- runif(n) > p
    cumsum(current + test) %% 2
}
```

## Assessment

1. Correct?

```{r}
set.seed(123); res1 <- f1(10, .8)
set.seed(123); res5 <- f5(10, .8)
identical(res1, res5)
```

2. Understandable? Harder to understand...

3. Robust? Yes

```{r}
f5(0, .8)
```

4. Fast?

```{r}
n <- 1000000
system.time(f4(n, .5))
system.time(f5(n, .5))
system.time(f5(n * 10, .5))
```

About 4x faster than `f4()`, scales linearly, fast even for very large `n`.

Could be used to generate a large data set for developing methods
about correlated samples, along the lines of

```{r}
correlated_tosses_expts <- function(m, n, p) {
    ## m tosses (rows) in each of n experiments
    start0 <- rep(rbinom(m, 1, .5), each = m)
    group0 <- cumsum(runif(m * n) > p)
    toss <- (start0 + group0) %% 2
    matrix(toss, m)
}
system.time({
    expt <- correlated_tosses_expts(1000, 10000, .8)
})
hist(colSums(expt))
```

# XXX Approach

Probably we have reached the point of diminishing gains, we've already
spent far more time developing `f5()` than we'll ever save by further
investigation... However,

- Avoid adding `current` to `cumsum()` vector.
- Use `rgeom()` to generate change points.
- 'Is there a package for that?'

Other tools

- [microbenchmark][] for comparing fine-scale performance differences
  (but do we really care about speed when, e.g., timing differences
  are less than a couple of seconds for large-scale data?)
- [testthat][] for writing 'unit tests' that allow easy implementation
  of tests for correct and robust code.
  
[microbenchmark]: https://cran.r-project.org/package=microbenchmark
[testthat]: https://cran.r-project.org/package=testthat

# Session info

```{r}
sessionInfo()
```