`r knitr::opts_chunk$set(tidy=FALSE)` # Parallel evaluation and memory managment for high performance computing ## Parallel evaluation ### Approaches - Parallel linear algebra libraries - See case study, below - (non-Windows) forked processes - Socket and other _ad hoc_ clusters - High-performance MPI clusters ### Interfaces ATLAS, ACML BLAS library - Great for _efficient_ and _parallel_ linear algebra routines - Transparent use - Building R with [user BLAS][BLAS]; tricky but not impossible. 'parallel' package - multi-core - `mclapply` and friends: parallel computation on lists (`lapply(...)`) - `pvec`: parallel computations on vectors (`fun(...)`) - `mcparallel`, `mccollect`: fork and retrieve - Shared memory, copy-on-write - Simple Network of Workstations (SNOW) - Communication via sockets - Manager-worker model: single R process spawns workers - Distinct processes, so each worker needs to be made the same - Easy to use; cross-platform [Rmpi][] - MPI interface - Manager / worker model - Spawn, `lapply`-like, etc - Easy to use interactively - Pool of available hosts can be managed by mpi (via e.g., `mpirun`, a [StackOverflow example][Rmpieg]), which makes sense (responsibility for managing cluster structure should not be R's responsibility). - Also 'single instruction / multiple data' style - Better than manager worker: MPI and cluster management software manage jobs - Example: [pbdR][] [foreach][] - Iteration (yuck!) with plug-in back-ends (yum!) [BatchJobs][] - Managing queues [BiocParallel][] - Common interface, e.g., `bplapply` for any back-end - Registration of current back end, so automatic selection - More sensible defaults and implementations - Additionl evaluation models ### Case study: parallel linear algebra Motivation: [StackOverflow][18964837] question about calculating correlation coefficients between columns in a large (1M x 400) numeric matrix. - Plain R: How does `cor()` scale? Some set-up: ```{r cor-scale-setup} set.seed(123) timer <- function(dim, FUN, nrep=3) { print(dim) m <- matrix(runif(dim[1] * dim[2]), dim[1]) mean(replicate(nrep, system.time(FUN(m))["elapsed"])) } parm <- expand.grid(m=10^(4:5), n=as.integer(seq(20, 300, length.out=3))) ``` And then calculation: ```{r cor-scale, eval=FALSE} parm$cor <- apply(parm[,1:2], 1, timer, cor) xtabs(cor ~ m + n, parm) ``` - A little math: (1) center; (2) standardize; (3) cross-product. Step (3) is the slow part, and is performed by R's math library. Implementation: ```{r correlation-impl} fastcor <- function(m) { m <- t(m) m <- m - rowMeans(m) # center m <- m / sqrt(rowSums(m^2)) # scale tcrossprod(m) # cross-product } ``` How are we doing? ```{r correlation-setup} ## 'small' data set initially m <- 100000; n <- 50 mat <- matrix(runif(m * n), m) ``` Timing and correctness: ```{r fastcor-timing-identity} system.time(c0 <- cor(mat)) system.time(c1 <- fastcor(mat)) all.equal(c0, c1) # why not identical()? ``` Scaling: ```{r fastcor-scale, eval=FALSE} parm$fastcor <- apply(parm[,1:2], 1, timer, fastcor) parm$crossprod <- apply(parm[,1:2], 1, timer, crossprod) ``` Visualizing: ```{r cor-scale-plot,eval=FALSE} library(lattice) xyplot(sqrt(cor) + sqrt(fastcor) + sqrt(crossprod) ~ n, group=m, parm, type="b", pch=20, cex=2, layout=c(3, 1), xlab="Columns", ylab="sqrt(Time)", main="Native", key=simpleKey(text=sprintf("%d", unique(parm$m)), lines=TRUE, points=FALSE, x=.02, y=.95, title="Rows", cex.title=1)) ``` ![Native library](./NATIVE.png) **Same** scaling, **worse** coefficient, **worse** algorithm! - But wait! Use a parallel BLAS library: ![ATLAS parallel BLAS library](./ATLAS.png) - Initial diminishing gains -- multiple processor - Linear end -- implementation (configurable?); better coefficient 'Easy' to arrange for parallel evaluation on Linux / Mac: (1) install appropriate library using a package manager, e.g., `apt get install libatlas3gf-base`; (2) download R source and configure to use installed library (see [R Installation and Administration][BLAS]); (3) `make -j`; (4) Use. ## Memory management Basic observations - *Embarassingly parallel* problems require parallel solution because of volume of data, rather than complexity of algorithm - On a single machine, using more cores implies each cores has access to less memory - Expensive to move data from manager to worker - **Conclusion**: memory management is an integral part of high performance computing Approach - Use file system as a passive way to make data available across a cluster - Organize data in a way that allows slices to be drawn in to memory - data base (especially _relational_ data resources) - [rhdf5][], especially for array data common in scientific programming - R-specific solutions, e.g., [ff][], [bigmemory][]. Used by, e.g., [oligo][] - Domain-specific solutions, e.g., indexed BAM files Common solutions 1. Restrict data input to just that required 2. Draw a sample and infer statistical properties if appropriate, e.g., QA 3. Iterate through large data - In chunks of many 100k's rows, to use R's efficient vector operations - Examples of use: `Rsamtools::filterBam`, `GenomicRanges::summarizeOverlaps` 4. Combine efficient code + 1-3 with parallel evaluation **Case study**: Counting reads overlapping regions of interest, [Intermediate Sequence Analysis 2013][SeqAnal] Chapter 7. [BLAS]: http://cran.r-project.org/doc/manuals/r-devel/R-admin.html#BLAS [SeqAnal]: http://bioconductor.org/help/course-materials/2013/SeattleMay2013/IntermediateSequenceAnalysis2013.pdf [Rmpi]: http://cran.fhcrc.org/web/packages/Rmpi [foreach]: http://cran.fhcrc.org/web/packages/foreach [pbdR]: https://rdav.nics.tennessee.edu/2012/09/pbdr [BatchJobs]: http://cran.fhcrc.org/web/packages/BatchJobs [BiocParallel]: http://bioconductor.org/packages/devel/bioc/html/BiocParallel.html [Streamer]: http://bioconductor.org/packages/devel/bioc/html/Streamer.html [bigmemory]: http://cran.fhcrc.org/web/packages/bigmemory [ff]: http://cran.fhcrc.org/web/packages/ff [oligo]: http://bioconductor.org/packages/devel/bioc/html/oligo.html [rhdf5]: http://bioconductor.org/packages/devel/bioc/html/rhdf5.html [18964837]: http://stackoverflow.com/questions/18964837/fast-correlation-in-r-using-c-and-parallelization [Rmpieg]: http://stackoverflow.com/questions/19066606/r-error-in-rmpi-with-snow/19067500#19067500