`r knitr::opts_chunk$set(tidy=FALSE)` # Common pitfalls # Identity and performance Remember - More important that the calculation is correct than fast - Programmer time is more valuable than computer time - Others have likely been there before you But also - The [VisiCalc][] revolution: _interactive_ computation can be transformative. Anecdote: Fitting GLM's to SNPs. - Original throughput using `glm()`: 10's of SNPs per second, days for full analysis - Avoid re-specifying model and other standard speed-ups: about 1000 SNPs per second, hours for total calculation. - Improved data management via [ncdf][] (nowadays: [rhdf5][]) allowing input of data slices. No immediate speed-up, but allows... - Use on a large cluster (100 nodes x 12 cores / node x 1000 SNPs / sec): complete analysis in seconds. - Transformed from asynchronous, plodding exploration to **interactive analysis** - Epilogue: [snpStats][] and especially [MatrixEQTL][] perform almost as well without need for complicated cluster ## `identical` and `all.equal` - physical vs. numerical identity ```{r identity-all.equal} identical(1, 1L) all.equal(1, 1L) ``` - Attributes ```{r all.equal-attributes} v0 <- c(1, 2); v1 <- c(a=1, b=2) identical(v0, v1) all.equal(v0, v1) all.equal(v0, v1, check.attributes=FALSE) ``` Gotcha - Environments (hence, reference classes) ```{r all.equal-env} x <- new.env(); y <- new.env() all.equal(x, y) x[["a"]] <- 1 all.equal(x, y) ``` - External pointers ## `system.time` and the [microbenchmark][] package `system.time()` - Informal / crude benchmarks: `system.time()` - First argument can be an expression ```{r system.time-expression} system.time({ Sys.sleep(1) runif(1000) }) ``` - Use 'gets' (`<-`) to simultaneously time and assign the result ```{r system.time-gets} system.time(x <- runif(10000000)) length(x) mean(x) ``` [microbenchmark][] - Systematically compare timing of different functions - Automatic replication and randomization of evaluation order ```{r microbenchmark} library(microbenchmark) f0 <- function(n) runif(n) f1 <- function(n) replicate(n, runif(1)) times <- microbenchmark(f0(10), f0(100), f0(1000), f0(2000), f1(10), times=100) times ``` - Pretty pictures ```{r microbenchmark-times-plot} plot(times) ``` ## `Rprof` FIXME # Common solutions - See [Intermediate Sequence Analysis 2013][SeqAnal] section 2.3.2. ## Exercises 1. **Do no more than needed** From [StackOverflow](http://stackoverflow.com/questions/18912805): transform SNP genotypes from `A_T` to `AT`. Data: ```{r} a <- c("A", "C", "G", "T") n <- 10000000 geno <- paste(sample(a, n, TRUE), sample(a, n, TRUE), sep="_") ``` Possible help pages: `?grep`, `?substring`. Compare possibilities for error, performance. 2. **Hoist common operations from loops, or better...**: Improve the performance of the imperative clamp algorithm, while retaining an imperative programming style ```{r, eval=FALSE} n <- 3 for (i in seq_along(x)) if (abs(x[i]) > n * sd(x)) x[i] <- n * sd(x) ``` and the following data ```{r, eval=FALSE} set.seed(123) x <- runif(10000) ``` 3. **...Avoid loops entirely by writing in a declaritve manner**: Compare the performance of the imperative clamp with a declarative alternative ```{r, eval=FALSE} x[abs(x) > n * sd(x)] <- n * sd(x) ``` How can the performance of the declarative clamp be improved? # Unit tests Unit tests - Write simple functions ('unit tests') that describe expected behavior - Write unit tests that replicate bugs before being fixed - Write unit tests that catch common edge cases, e.g., zero-length arguments - _Very_ useful as your code becomes more complicated; helps think about how edge cases should behave. [RUnit][] - A framework for writing unit tests - Helpers to test outcome of tests, e.g., `checkEquals()`, `checkIdentical()`, `checkException()`. - Naming convention for test functions and files containing test functions - Example: expected behavior of `log` ```{r test-log0, eval=FALSE} library(RUnit) test_log0 <- function() { checkIdentical(-Inf, log(0)) checkIdentical(0, log(1)) checkIdentical(0, exp(log(0))) checkIdentical(1, exp(log(1))) old_opts <- options(warn=2) on.exit(options(old_opts)) checkException(log(-1), silent=TRUE) checkException(log(), silent=TRUE) } test_log0() ``` - [Bioconductor best practices][biocunit] for implementing unit tests to be evaluated when packages are checked. - Also: [testthat][] package as alternative test framework [VisiCalc]: http://wikipedia.org/wiki/VisiCalc [microbenchmark]: http://cran.fhcrc.org/web/packages/microbenchmark/index.html [RUnit]: http://cran.fhcrc.org/web/packages/RUnit/index.html [testthat]: http://cran.fhcrc.org/web/packages/testthat/index.html [ncdf]: http://cran.r-project.org/web/packages/ncdf/index.html [rhdf5]: http://bioconductor.org/packages/devel/bioc/html/rhdf5.html [snpStats]: http://bioconductor.org/packages/devel/bioc/html/snpStats.html [MatrixEQTL]: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL [SeqAnal]: http://bioconductor.org/help/course-materials/2013/SeattleMay2013/IntermediateSequenceAnalysis2013.pdf [biocunit]: http://bioconductor.org/developers/how-to/unitTesting-guidelines/