--- title: "5. Counting Reads And Working With Large Files" author: "Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{5. Counting Reads And Working With Large Files} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({ library(GenomicFiles) library(BiocParallel) }) ``` The material in this course requires R version 3.2 and Bioconductor version 3.2 ```{r configure-test} stopifnot( getRversion() >= '3.2' && getRversion() < '3.3', BiocInstaller::biocVersion() == "3.2" ) ``` # Large data -- `BiocParallel`, `GenomicFiles` ## Restriction - Input only the data necessary, e.g., `ScanBamParam()` - `which`: genomic ranges of interest - `what`: 'columns' of BAM file, e.g., 'seq', 'flag' ## Iteration - Read entire file, but in chunks - Chunk size small enough to fit easily in memory, - Chunk size large enough to benefit from _R_'s vectorized operations -- 10k to 1M records at at time - e.g., `BamFile(..., yieldSize=100000)` Iterative programming model - _yield_ a chunk of data - _map_ input data to convenient representation, often summarizing input to simplified form - E.g., Aligned read coordinates to counts overlapping regions of interest - E.g., Aligned read sequenced to GC content - _reduce_ across mapped chunks - Use `GenomicFiles::reduceByYield()` ```{r iteration} library(GenomicFiles) yield <- function(bfl) { ## input a chunk of alignments library(GenomicAlignments) readGAlignments(bfl, param=ScanBamParam(what="seq")) } map <- function(aln) { ## Count G or C nucleotides per read library(Biostrings) gc <- letterFrequency(mcols(aln)$seq, "GC") ## Summarize number of reads with 0, 1, ... G or C nucleotides tabulate(1 + gc, 73) # max. read length: 72 } reduce <- `+` ``` - Example ```{r iteration-doit} library(RNAseqData.HNRNPC.bam.chr14) fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES bf <- BamFile(fls[1], yieldSize=100000) gc <- reduceByYield(bf, yield, map, reduce) plot(gc, type="h", xlab="GC Content per Aligned Read", ylab="Number of Reads") ``` ## Parallel evaluation - Cores, computers, clusters, clouds - Generally, requires memory management techniques like restriction or iteration -- parallel processes competing for shared memory - Many problems are _embarassingly parallel_ -- `lapply()`-like -- especially in bioinformatics where parallel evaluation is across files - Example: GC content in several BAM files ```{r parallel-doit} library(BiocParallel) gc <- bplapply(BamFileList(fls), reduceByYield, yield, map, reduce) library(ggplot2) df <- stack(as.data.frame(lapply(gc, cumsum))) df$GC <- 0:72 ggplot(df, aes(x=GC, y=values)) + geom_line(aes(colour=ind)) + xlab("Number of GC Nucleotides per Read") + ylab("Number of Reads") ``` # Resources Acknowledgements - Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum. - The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation. ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ```