--- title: I. Working with Large Data author: Martin Morgan (mtmorgan@fredhutch.org) date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{I. Working with Large Data} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r style, echo=FALSE, results='asis'} BiocStyle::markdown() suppressPackageStartupMessages({ library(rtracklayer) library(BiocParallel) library(GenomicFiles) library(TxDb.Hsapiens.UCSC.hg19.knownGene) }) ``` # Scalabe computing Efficient _R_ code - Vectorize! - Reuse others' work -- `r Biocpkg("DESeq2")`, `r Biocpkg("GenomicRanges")`, `r Biocpkg("Biostrings")`, ..., `r CRANpkg("dplyr")`, `r CRANpkg("data.table")`, `r CRANpkg("Rcpp")` - Useful tools: `system.time()`, `Rprof()`, `r CRANpkg("microbenchmark")` - More detail in [deadly sins](http://bioconductor.org/help/course-materials/2014/CSAMA2014/1_Monday/labs/IntermediateR.html#efficient-code) of a previous course. Iteration - Chunk-wise - `open()`, read chunk(s), `close()`. - e.g., `yieldSize` argument to `Rsamtools::BamFile()` Restriction - Limit to columns and / or rows of interest - Exploit domain-specific formats, e.g., BAM files and `Rsamtools::ScanBamParam()` - Use a data base Sampling - Iterate through large data, retaining a manageable sample, e.g., `ShortRead::FastqSampler()` Parallel evaluation - **After** writing efficient code - Typically, `lapply()`-like operations - Cores on a single machine ('easy'); clusters (more tedious); clouds # File management ## File classes | Type | Example use | Name | Package | |-------|-----------------------|-----------------------------|----------------------------------| | .bed | Range annotations | `BedFile()` | `r Biocpkg("rtracklayer")` | | .wig | Coverage | `WigFile()`, `BigWigFile()` | `r Biocpkg("rtracklayer")` | | .gtf | Transcript models | `GTFFile()` | `r Biocpkg("rtracklayer")` | | | | `makeTxDbFromGFF()` | `r Biocpkg("GenomicFeatures")` | | .2bit | Genomic Sequence | `TwoBitFile()` | `r Biocpkg("rtracklayer")` | | .fastq | Reads & qualities | `FastqFile()` | `r Biocpkg("ShortRead")` | | .bam | Aligned reads | `BamFile()` | `r Biocpkg("Rsamtools")` | | .tbx | Indexed tab-delimited | `TabixFile()` | `r Biocpkg("Rsamtools")` | | .vcf | Variant calls | `VcfFile()` | `r Biocpkg("VariantAnnotation")` | ```{r rtracklayer-file-classes} ## rtracklayer menagerie library(rtracklayer) names(getClass("RTLFile")@subclasses) ``` Notes - Not a consistent interface - `open()`, `close()`, `import()` / `yield()` / `read*()` - Some: selective import via index (e.g., `.bai`, bam index); selection ('columns'); restriction ('rows') ## Managing a collection of files `*FileList()` classes - `reduceByYield()` -- iterate through a single large file - `bplapply()` (`r Biocpkg("BiocParallel")`) -- perform independent operations on several files, in parallel `GenomicFiles()` - 'rows' as genomic range restrictions, 'columns' as files - Each row x column is a _map_ from file data to useful representation in _R_ - `reduceByRange()`, `reduceByFile()`: collapse maps into summary representation - see the GenomicFiles vignette [Figure 1](http://bioconductor.org/packages/devel/bioc/vignettes/GenomicFiles/inst/doc/GenomicFiles.pdf) # Parallel evaluation with BiocParallel Standardized interface for simple parallel evaluation. - `bplapply()` instead of `lapply()` - Argument `BPPARAM` influences how parallel evaluation occurs - `MulticoreParam()`: threads on a single (non-Windows) machine - `SnowParam()`: processes on the same or different machines - `BatchJobsParam()`: resource scheduler on a cluster Other resources - [Bioconductor Amazon AMI](http://bioconductor.org/help/bioconductor-cloud-ami/) - easily 'spin up' 10's of instances - Pre-configured with Bioconductor packages and StarCluster management - `r Biocpkg("GoogleGenomics")` to interact with google compute cloud and resources # Practical ### Efficient code Define following as a function. ```{r benchmark-f0} f0 <- function(n) { ## inefficient! ans <- numeric() for (i in seq_len(n)) ans <- c(ans, exp(i)) ans } ``` Use `system.time()` to explore how long this takes to execute as `n` increases from 100 to 10000. Use `identical()` and `r CRANpkg("microbenchmark")` to compare alternatives `f1()`, `f2()`, and `f3()` for both correctness and performance of these three different functions. What strategies are these functions using? ```{r benchmark} f1 <- function(n) { ans <- numeric(n) for (i in seq_len(n)) ans[[i]] <- exp(i) ans } f2 <- function(n) sapply(seq_len(n), exp) f3 <- function(n) exp(seq_len(n)) ``` ### Sleeping serially and in parallel Go to sleep for 1 second, then return `i`. This takes 8 seconds. ```{r parallel-sleep} library(BiocParallel) fun <- function(i) { Sys.sleep(1) i } ## serial f0 <- function(n) lapply(seq_len(n), fun) ## parallel f1 <- function(n) bplapply(seq_len(n), fun) ``` ## Reads overlapping windows This exercise uses the following packages: ```{r csaw-packages} library(GenomicAlignments) library(GenomicFiles) library(BiocParallel) library(Rsamtools) library(GenomeInfoDb) ``` This is a re-implementation of the basic `r Biocpkg("csaw")` binned counts algorithm. It supposes that ChIP fragment lengths are 110 nt, and that we bin coverage in windows of width 50. We focus on chr1. ```{r olaps-chr} frag.len <- 110 spacing <- 50 chr <- "chr1" ``` Here we point to the bam files, indicating that we'll process the files in chunks of size 1,000,000. ```{r olaps-tileGenome} fls <- dir("~/UseBioconductor-data/ChIPSeq/NFYA/", ".BAM$", full=TRUE) names(fls) <- sub(".fastq.*", "", basename(fls)) bfl <- BamFileList(fls, yieldSize=1000000) ``` We'll creating the counting bins using `tileGenome()`, focusing the 'standard' chromosomes' ```{r csaw-tiles} len <- seqlengths(keepStandardChromosomes(seqinfo(bfl)))[chr] tiles <- tileGenome(len, tilewidth=spacing, cut.last.tile.in.chrom=TRUE) ``` We'll use `reduceByYield()` to iterate through a single file. We read to tell this function we'll `YIELD` a chunk of the file, how we'll `MAP` the chunk from it's input representation to the per-window counts, and finally how we'll `REDUCE` successive chunks into a final representation. `YIELD` is supposed to be a function that takes one argument, the input source, and returns a chunk of records ```{r yield} yield <- function(x, ...) readGAlignments(x) ``` `MAP` must take the output of yield and perhaps additional arguments, and return a vector of counts. We'll resize the genomic ranges describing the alignment so that they have a width equal to the fragment length ```{r map} map <- function(x, tiles, frag.len, ...) { gr <- keepStandardChromosomes(granges(x)) countOverlaps(tiles, resize(gr, frag.len)) } ``` `REDUCE` takes two results from `MAP` (in our case, vectors of counts) and combines them into a single result. We simply add our vectors (`+` is actually a function!) ```{r reduce} reduce <- `+` ``` To process one file, we use `reduceByYield()`, passing the file we want to process, the yield, map, and reduce functions. Our 'wrapper' function passes any additional arguments through to `reduceByYield()` using `...`: ```{r reduceByYield} count1file <- function(bf, ...) reduceByYield(bf, yield, map, reduce, ...) ``` Using `yieldSize` and `reduceByYield()` means that we do not consume too much memory processing each file, so that we can process files in parallel using `r Biocpkg("BiocParallel")`. The `simplify2array()` function transforms a list-of-vectors to a matrix. ```{r count-overlaps-parallel, eval=FALSE} counts <- bplapply(bfl, count1file, tiles=tiles, frag.len=frag.len) counts <- simplify2array(counts) dim(counts) colSums(counts) ``` # Resources - Lawrence, M, and Morgan, M. 2014. Scalable Genomics with R and Bioconductor. Statistical Science 2014, Vol. 29, No. 2, 214-226. http://arxiv.org/abs/1409.2864v1 [BiocParallel]: http://bioconductor.org/packages/release/bioc/html/BiocParallel.html [GenomicFiles]: http://bioconductor.org/packages/release/bioc/html/GenomicFiles.html