`r knitr::opts_chunk$set(tidy=FALSE)`
# Parallel evaluation and memory managment for high performance computing
## Parallel evaluation
### Approaches
- Parallel linear algebra libraries
- See case study, below
- (non-Windows) forked processes
- Socket and other _ad hoc_ clusters
- High-performance MPI clusters
### Interfaces
ATLAS, ACML BLAS library
- Great for _efficient_ and _parallel_ linear algebra routines
- Transparent use
- Building R with [user BLAS][BLAS]; tricky but not impossible.
'parallel' package
- multi-core
- `mclapply` and friends: parallel computation on lists (`lapply(...)`)
- `pvec`: parallel computations on vectors (`fun(...)`)
- `mcparallel`, `mccollect`: fork and retrieve
- Shared memory, copy-on-write
- Simple Network of Workstations (SNOW)
- Communication via sockets
- Manager-worker model: single R process spawns workers
- Distinct processes, so each worker needs to be made the same
- Easy to use; cross-platform
[Rmpi][]
- MPI interface
- Manager / worker model
- Spawn, `lapply`-like, etc
- Easy to use interactively
- Pool of available hosts can be managed by mpi (via e.g., `mpirun`,
a [StackOverflow example][Rmpieg]), which makes sense
(responsibility for managing cluster structure should not be R's
responsibility).
- Also 'single instruction / multiple data' style
- Better than manager worker: MPI and cluster management software
manage jobs
- Example: [pbdR][]
[foreach][]
- Iteration (yuck!) with plug-in back-ends (yum!)
[BatchJobs][]
- Managing queues
[BiocParallel][]
- Common interface, e.g., `bplapply` for any back-end
- Registration of current back end, so automatic selection
- More sensible defaults and implementations
- Additionl evaluation models
### Case study: parallel linear algebra
Motivation: [StackOverflow][18964837] question about calculating
correlation coefficients between columns in a large (1M x 400) numeric
matrix.
- Plain R: How does `cor()` scale? Some set-up:
```{r cor-scale-setup}
set.seed(123)
timer <- function(dim, FUN, nrep=3) {
print(dim)
m <- matrix(runif(dim[1] * dim[2]), dim[1])
mean(replicate(nrep, system.time(FUN(m))["elapsed"]))
}
parm <- expand.grid(m=10^(4:5),
n=as.integer(seq(20, 300, length.out=3)))
```
And then calculation:
```{r cor-scale, eval=FALSE}
parm$cor <- apply(parm[,1:2], 1, timer, cor)
xtabs(cor ~ m + n, parm)
```
- A little math: (1) center; (2) standardize; (3) cross-product. Step
(3) is the slow part, and is performed by R's math library.
Implementation:
```{r correlation-impl}
fastcor <- function(m) {
m <- t(m)
m <- m - rowMeans(m) # center
m <- m / sqrt(rowSums(m^2)) # scale
tcrossprod(m) # cross-product
}
```
How are we doing?
```{r correlation-setup}
## 'small' data set initially
m <- 100000; n <- 50
mat <- matrix(runif(m * n), m)
```
Timing and correctness:
```{r fastcor-timing-identity}
system.time(c0 <- cor(mat))
system.time(c1 <- fastcor(mat))
all.equal(c0, c1) # why not identical()?
```
Scaling:
```{r fastcor-scale, eval=FALSE}
parm$fastcor <- apply(parm[,1:2], 1, timer, fastcor)
parm$crossprod <- apply(parm[,1:2], 1, timer, crossprod)
```
Visualizing:
```{r cor-scale-plot,eval=FALSE}
library(lattice)
xyplot(sqrt(cor) + sqrt(fastcor) + sqrt(crossprod) ~ n,
group=m, parm, type="b", pch=20, cex=2, layout=c(3, 1),
xlab="Columns", ylab="sqrt(Time)", main="Native",
key=simpleKey(text=sprintf("%d", unique(parm$m)), lines=TRUE,
points=FALSE, x=.02, y=.95, title="Rows", cex.title=1))
```
![Native library](./NATIVE.png)
**Same** scaling, **worse** coefficient, **worse** algorithm!
- But wait! Use a parallel BLAS library:
![ATLAS parallel BLAS library](./ATLAS.png)
- Initial diminishing gains -- multiple processor
- Linear end -- implementation (configurable?); better coefficient
'Easy' to arrange for parallel evaluation on Linux / Mac: (1)
install appropriate library using a package manager, e.g., `apt get
install libatlas3gf-base`; (2) download R source and configure to
use installed library (see [R Installation and
Administration][BLAS]); (3) `make -j`; (4) Use.
## Memory management
Basic observations
- *Embarassingly parallel* problems require parallel solution because
of volume of data, rather than complexity of algorithm
- On a single machine, using more cores implies each cores has access
to less memory
- Expensive to move data from manager to worker
- **Conclusion**: memory management is an integral part of high
performance computing
Approach
- Use file system as a passive way to make data available across a cluster
- Organize data in a way that allows slices to be drawn in to memory
- data base (especially _relational_ data resources)
- [rhdf5][], especially for array data common in scientific programming
- R-specific solutions, e.g., [ff][], [bigmemory][]. Used by, e.g., [oligo][]
- Domain-specific solutions, e.g., indexed BAM files
Common solutions
1. Restrict data input to just that required
2. Draw a sample and infer statistical properties if appropriate, e.g.,
QA
3. Iterate through large data
- In chunks of many 100k's rows, to use R's efficient vector
operations
- Examples of use: `Rsamtools::filterBam`,
`GenomicRanges::summarizeOverlaps`
4. Combine efficient code + 1-3 with parallel evaluation
**Case study**: Counting reads overlapping regions of interest,
[Intermediate Sequence Analysis 2013][SeqAnal] Chapter 7.
[BLAS]: http://cran.r-project.org/doc/manuals/r-devel/R-admin.html#BLAS
[SeqAnal]: http://bioconductor.org/help/course-materials/2013/SeattleMay2013/IntermediateSequenceAnalysis2013.pdf
[Rmpi]: http://cran.fhcrc.org/web/packages/Rmpi
[foreach]: http://cran.fhcrc.org/web/packages/foreach
[pbdR]: https://rdav.nics.tennessee.edu/2012/09/pbdr
[BatchJobs]: http://cran.fhcrc.org/web/packages/BatchJobs
[BiocParallel]: http://bioconductor.org/packages/devel/bioc/html/BiocParallel.html
[Streamer]: http://bioconductor.org/packages/devel/bioc/html/Streamer.html
[bigmemory]: http://cran.fhcrc.org/web/packages/bigmemory
[ff]: http://cran.fhcrc.org/web/packages/ff
[oligo]: http://bioconductor.org/packages/devel/bioc/html/oligo.html
[rhdf5]: http://bioconductor.org/packages/devel/bioc/html/rhdf5.html
[18964837]: http://stackoverflow.com/questions/18964837/fast-correlation-in-r-using-c-and-parallelization
[Rmpieg]: http://stackoverflow.com/questions/19066606/r-error-in-rmpi-with-snow/19067500#19067500