<!--
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{B2: Efficient R -- Common Pitfalls}
-->
`r knitr::opts_chunk$set(tidy=FALSE)`

# Common pitfalls

# Identity and performance

Remember
- More important that the calculation is correct than fast
- Programmer time is more valuable than computer time
- Others have likely been there before you

But also
- The [VisiCalc][] revolution: _interactive_ computation can be
  transformative. 
  
Anecdote: Fitting GLM's to SNPs.

- Original throughput using `glm()`: 10's of SNPs per second, days for
  full analysis
- Avoid re-specifying model and other standard speed-ups: about 1000
  SNPs per second, hours for total calculation.
- Improved data management via [ncdf][] (nowadays: [rhdf5][]) allowing
  input of data slices. No immediate speed-up, but allows...
- Use on a large cluster (100 nodes x 12 cores / node x 1000 SNPs /
  sec): complete analysis in seconds.
- Transformed from asynchronous, plodding exploration to **interactive
  analysis**
- Epilogue: [snpStats][] and especially [MatrixEQTL][] perform almost
  as well without need for complicated cluster

## `identical` and `all.equal`

- physical vs. numerical identity
```{r identity-all.equal}
identical(1, 1L)
all.equal(1, 1L)
```
- Attributes
```{r all.equal-attributes}
v0 <- c(1, 2); v1 <- c(a=1, b=2)
identical(v0, v1)
all.equal(v0, v1)
all.equal(v0, v1, check.attributes=FALSE)
```

Gotcha
- Environments (hence, reference classes)
```{r all.equal-env}
x <- new.env(); y <- new.env()
all.equal(x, y)
x[["a"]] <- 1
all.equal(x, y)
```
- External pointers

## `system.time` and the [microbenchmark][] package

`system.time()`
- Informal / crude benchmarks: `system.time()`
- First argument can be an expression
```{r system.time-expression}
system.time({
    Sys.sleep(1)
    runif(1000)
})
```
- Use 'gets' (`<-`) to simultaneously time and assign the result
```{r system.time-gets}
system.time(x <- runif(10000000))
length(x)
mean(x)
```

[microbenchmark][]
- Systematically compare timing of different functions
- Automatic replication and randomization of evaluation order
```{r microbenchmark}
library(microbenchmark)
f0 <- function(n) runif(n)
f1 <- function(n) replicate(n, runif(1))
times <- microbenchmark(f0(10), f0(100), f0(1000), f0(2000), f1(10), times=100)
times
```
- Pretty pictures
```{r microbenchmark-times-plot}
plot(times)
```

## `Rprof`

FIXME

# Common solutions

- See [Intermediate Sequence Analysis 2013][SeqAnal] section 2.3.2.

## Exercises

1. **Do no more than needed** From
   [StackOverflow](http://stackoverflow.com/questions/18912805):
   transform SNP genotypes from `A_T` to `AT`. Data:
   ```{r}
   a <- c("A", "C", "G", "T")
   n <- 10000000
   geno <- paste(sample(a, n, TRUE), sample(a, n, TRUE), sep="_")
   ```
   Possible help pages: `?grep`, `?substring`. Compare possibilities
   for error, performance.
   
2. **Hoist common operations from loops, or better...**: 

   Improve the performance of the imperative clamp algorithm, while
   retaining an imperative programming style
   ```{r, eval=FALSE}
   n <- 3
   for (i in seq_along(x))
       if (abs(x[i]) > n * sd(x))
           x[i] <- n * sd(x)
   ```
   and the following data
   ```{r, eval=FALSE}
   set.seed(123)
   x <- runif(10000)
   ```

3. **...Avoid loops entirely by writing in a declaritve manner**:

   Compare the performance of the imperative clamp with a declarative
   alternative
   ```{r, eval=FALSE}
   x[abs(x) > n * sd(x)] <- n * sd(x)
   ```
   How can the performance of the declarative clamp be improved?

# Unit tests

Unit tests
- Write simple functions ('unit tests') that describe expected behavior
- Write unit tests that replicate bugs before being fixed
- Write unit tests that catch common edge cases, e.g., zero-length arguments
- _Very_ useful as your code becomes more complicated; helps think
  about how edge cases should behave.

[RUnit][]
- A framework for writing unit tests
- Helpers to test outcome of tests, e.g., `checkEquals()`,
  `checkIdentical()`, `checkException()`.
- Naming convention for test functions and files containing test functions
- Example: expected behavior of `log`
```{r test-log0, eval=FALSE}
library(RUnit)
test_log0 <- function() {
   checkIdentical(-Inf, log(0))
   checkIdentical(0, log(1))
   checkIdentical(0, exp(log(0)))
   checkIdentical(1, exp(log(1)))
   old_opts <- options(warn=2)
   on.exit(options(old_opts))
   checkException(log(-1), silent=TRUE)
   checkException(log(), silent=TRUE)
}
test_log0()
```
- [Bioconductor best practices][biocunit] for implementing unit tests to be
  evaluated when packages are checked.
- Also: [testthat][] package as alternative test framework

[VisiCalc]: http://wikipedia.org/wiki/VisiCalc
[microbenchmark]: http://cran.fhcrc.org/web/packages/microbenchmark/index.html
[RUnit]: http://cran.fhcrc.org/web/packages/RUnit/index.html
[testthat]: http://cran.fhcrc.org/web/packages/testthat/index.html
[ncdf]: http://cran.r-project.org/web/packages/ncdf/index.html
[rhdf5]: http://bioconductor.org/packages/devel/bioc/html/rhdf5.html
[snpStats]: http://bioconductor.org/packages/devel/bioc/html/snpStats.html
[MatrixEQTL]: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL
[SeqAnal]: http://bioconductor.org/help/course-materials/2013/SeattleMay2013/IntermediateSequenceAnalysis2013.pdf
[biocunit]: http://bioconductor.org/developers/how-to/unitTesting-guidelines/