Introduction

This course is directed at intermediate R / Bioconductor users who, in an effort to get the most out of high-throughput sequence and other analyses, want to understand more about how R and Bioconductor work.

The course begins by reviewing R data types, memory management, and other aspects of internal computation. We use this as a basis for understanding how to writing, debug, and assess the performance of efficient R code, including straight-forward approaches to iteration, vectorization, and parallel evaluation.
We then explore R objects, especially the S4 object system. We learn about how to specify simple and more complicated S4 objects, and how to implement essential methods for single and multiple dispatch. We use insights from performance and the S4 class system to explore strategies for efficient representation of large structured data, especially the classes in the IRanges, GenomicRanges, VariantAnnotation, and Biostrings packages.
Availability of programming libraries (such as samtools) or performance needs may sometimes point to use of C or C++ code integrated into R. We develop some simple C functions, and explore use of Rcpp as a relatively painless way to incorporate C code. We take a brief look at R's internal data representations, and explore how to debug and profile C code.
Finally, we investigate how R can be used to interact with other important resources: data bases; web sites; and visualization facilities like shiny. Use of some of these facilities is illustrated by packages such as AnnotationDbi and biomaRt.

A tentative schedule is below:

Monday
  Morning (9 - 12:30)      Efficient R
  Afternoon (1:30 - 5:00)  Objects
Tuesday
  Morning (9 - 12:30)      C
  Afternoon (1:30 - 5:00)  Data bases, XML, shiny, ...

Resources

Intermediate Sequence Analysis 2013 manual
Lawrence et al., 2013, Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): doi:10.1371/journal.pcbi.1003118
Hadley Wickham's maturing ebook