CSAMA 2014

Author: Martin Morgan (mtmorgan@fhcrc.org)

Date: 23 June, 2014

- Refresher: working with data
- Case study: ALL phenotypic data
- Case study: Weighty matters

- Making progress: classes, methods, and packages
- Case study: IRanges and GRanges

- A sequence analysis package tour
- Avoiding deadly sins: efficient code
- Case study: From iteration to vectorization
- Case study: Choosing algorithms

This case study servers as a refresher basic input and manipulation of data.

Input a file that contains ALL (acute lymphoblastic leukemia) patient information

```
fname <- file.choose() ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)
```

Check out the help page `?read.delim`

for input options, and explore
basic properties of the object you've created, for instance…

```
class(pdata)
```

```
## [1] "data.frame"
```

```
colnames(pdata)
```

```
## [1] "id" "diagnosis" "sex" "age"
## [5] "BT" "remission" "CR" "date.cr"
## [9] "t.4.11." "t.9.22." "cyto.normal" "citog"
## [13] "mol.biol" "fusion.protein" "mdr" "kinet"
## [17] "ccr" "relapse" "transplant" "f.u"
## [21] "date.last.seen"
```

```
dim(pdata)
```

```
## [1] 127 21
```

```
head(pdata)
```

```
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 2 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## 6 4008 7/30/1997 M 17 B1 CR CR 9/27/1997 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 2 FALSE simple alt. NEG <NA> POS dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## 6 FALSE complex alt. NEG <NA> NEG hyperd. FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 2 TRUE FALSE REL 8/28/2000
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
## 6 TRUE FALSE REL 12/15/1997
```

```
summary(pdata$sex)
```

```
## F M NA's
## 42 83 2
```

```
summary(pdata$cyto.normal)
```

```
## Mode FALSE TRUE NA's
## logical 69 24 34
```

Remind yourselves about various ways to subset and access columns of a data.frame

```
pdata[1:5, 3:4]
```

```
## sex age
## 1 M 53
## 2 M 19
## 3 F 52
## 4 M 38
## 5 M 57
```

```
pdata[1:5, ]
```

```
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 2 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 2 FALSE simple alt. NEG <NA> POS dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 2 TRUE FALSE REL 8/28/2000
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
```

```
head(pdata[, 3:5])
```

```
## sex age BT
## 1 M 53 B2
## 2 M 19 B2
## 3 F 52 B4
## 4 M 38 B1
## 5 M 57 B2
## 6 M 17 B1
```

```
tail(pdata[, 3:5], 3)
```

```
## sex age BT
## 125 M 19 T2
## 126 M 30 T3
## 127 M 29 T2
```

```
head(pdata$age)
```

```
## [1] 53 19 52 38 57 17
```

```
head(pdata$sex)
```

```
## [1] M M F M M M
## Levels: F M
```

```
head(pdata[pdata$age > 21,])
```

```
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## 10 8001 1/15/1997 M 40 B2 CR CR 3/26/1997 FALSE FALSE
## 11 8011 8/21/1998 M 33 B3 CR CR 10/8/1998 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## 10 FALSE del(p15) BCR/ABL p190 NEG <NA> FALSE
## 11 FALSE del(p15/p16) BCR/ABL p190/p210 NEG dyploid FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
## 10 TRUE FALSE REL 7/11/1997
## 11 FALSE TRUE BMT / DEATH IN CR <NA>
```

It seems from below that there are 17 females over 40 in the data set,
but when sub-setting `pdata`

to contain just those individuals 19 rows
are selected. Why? What can we do to correct this?

```
idx <- pdata$sex == "F" & pdata$age > 40
table(idx)
```

```
## idx
## FALSE TRUE
## 108 17
```

```
dim(pdata[idx,])
```

```
## [1] 19 21
```

Use the `mol.biol`

column to subset the data to contain just
individuals with 'BCR/ABL' or 'NEG', e.g.,

```
bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]
```

The `mol.biol`

column is a factor, and retains all levels even after
subsetting. How might you drop the unused factor levels?

```
bcrabl$mol.biol <- factor(bcrabl$mol.biol)
```

The `BT`

column is a factor describing B- and T-cell subtypes

```
levels(bcrabl$BT)
```

```
## [1] "B" "B1" "B2" "B3" "B4" "T" "T1" "T2" "T3" "T4"
```

How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T

```
table(bcrabl$BT)
```

```
##
## B B1 B2 B3 B4 T T1 T2 T3 T4
## 4 9 35 22 9 4 1 15 9 2
```

```
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
```

```
##
## B T
## 79 31
```

Use `xtabs()`

(cross-tabulation) to count the number of samples with
B- and T-cell types in each of the BCR/ABL and NEG groups

```
xtabs(~ BT + mol.biol, bcrabl)
```

```
## mol.biol
## BT BCR/ABL NEG
## B 37 42
## T 0 31
```

Use `aggregate()`

to calculate the average age of males and females in
the BCR/ABL and NEG treatment groups.

```
aggregate(age ~ mol.biol + sex, bcrabl, mean)
```

```
## mol.biol sex age
## 1 BCR/ABL F 39.94
## 2 NEG F 30.42
## 3 BCR/ABL M 40.50
## 4 NEG M 27.21
```

Use `t.test()`

to compare the age of individuals in the BCR/ABL versus
NEG groups; visualize the results using `boxplot()`

. In both cases,
use the `formula`

interface. Consult the help page `?t.test`

and re-do
the test assuming that variance of ages in the two groups is
identical. What parts of the test output change?

```
t.test(age ~ mol.biol, bcrabl)
```

```
##
## Welch Two Sample t-test
##
## data: age by mol.biol
## t = 4.817, df = 68.53, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.135 17.224
## sample estimates:
## mean in group BCR/ABL mean in group NEG
## 40.25 28.07
```

```
boxplot(age ~ mol.biol, bcrabl)
```

This case study is a second walk through basic data manipulation and visualization skills. We use data from the US Center for Disease Control's Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010.

Input the data using `read.csv()`

, creating a variable `brfss`

to hold
it. Use `file.choose()`

to locate the data file BRFSS-subset.csv

```
fname <- file.choose() ## BRFSS-subset.csv
stopifnot(file.exists(fname))
brfss <- read.csv(fname)
```

Explore the data using

`class()`

,`dim()`

,`head()`

,`summary()`

, etc. Use`xtabs()`

to summarize the number of males and females in the study, in each of the two years.Use

`aggregate()`

to summarize the average weight in each sex and year.Create a scatterplot showing the relationship between the square root of weight and height, using the

`plot()`

function and the`main`

argument to annotate the plot. Note the transformed Y-axis. Experiment with different plotting symbols (try the command`example(points)`

to view different points).`plot(sqrt(Weight) ~ Height, brfss, main="All Years, Both Sexes")`

Color the female and male points differently. To do this, use the

`col`

argument to`plot()`

. Provide as a value to that argument a vector of colors, subset by`brfss$Sex`

.Create a subset of the data containing only observations from 2010.

`brfss2010 <- brfss[brfss$Year == "2010", ]`

Create the figure below (two panels in a single figure). Do this by using the

`par()`

function with the`mfcol`

argument before calling`plot()`

. You'll need to create two more subsets of data, perhaps when you are providing the data to the function`plot`

.`opar <- par(mfcol=c(1, 2)) plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Female", ], main="2010, Female") plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Male", ], main="2010, Male")`

`par(mfcol=c(1, 1))`

Plotting large numbers of points means that they are often over-plotted, potentially obscuring important patterns. Experiment with arguments to

`plot()`

to address over-plotting, e.g.,`pch='.'`

or`alpha=.4`

. Try using the`smoothScatter()`

function (the data have to be presented as`x`

and`y`

, rather than as a formula). Try adding the hexbin library to your R session (using`library()`

) and creating a`hexbinplot()`

.

R has a number of additional plotting facilities, both as part of the
'base' distribution and user-contributed packages. The *lattice*
package adopts a particular philosophy about the presentation of data,
and can be a very effective visualization tool.

Use

`library()`

to load the*lattice*package.`library(lattice)`

Create the following figure using the

`xyplot()`

function with a formula and the`brfss2010`

data. The formula is`sqrt(Weight) ~ Height | Sex`

, which can be read as square root of Weight as a function of Height, conditioned on Sex'.`xyplot(sqrt(Weight) ~ Height | Sex, brfss2010)`

Add a background grid and a regression line to each panel using the argument

`type=c('p', 'g', 'r')`

; change the width (`lwd`

) and color (`col.line`

) of the regression line. For some hints on other arguments, see the help pages`?xyplot`

(for overall structure of the plot) and`?panel.xyplot`

(for how each 'panel' of data is plotted).Create the following figure. Use the

`densityplot()`

function with the formula`~ sqrt(Weight)`

. The`group=Sex`

function argument creates separate lines for each sex. Be sure to use`plot.points=FALSE`

to avoid a rug' of points at the base of the figure. Can you add a key (legend)?`densityplot(~sqrt(Weight), brfss2010, group=Sex, plot.points=FALSE)`

Create the figure below using the

`bwplot`

function. The formula requires that`Year`

be coerced to a`factor`

,`factor(Year)`

.`bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss)`

Create the Figure below, a

*violin*plot, using`bwplot()`

and the`panel`

argument set to`panel.violin`

. from`?bwplot`

we learn that`panel`

is a function that determines how each panel is drawn; the details for controling the violin panel plot are described on the help page`?panel.violin`

.`bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss, panel=panel.violin)`

(Advanced) We can write our own

`panel`

argument to*lattice*functions to influence how each panel is displayed. Here we add a point at the median age and weight.`xyplot(sqrt(Weight) ~ Height|Sex, brfss2010, panel = function(x, y, ...) { panel.xyplot(x, y, ...) panel.points(median(x, na.rm=TRUE), median(y, na.rm=TRUE), cex=2, pch=20, col="red") }, type=c("p", "g", "r"), lwd=2, col.line="red", xlim=c(120, 210))`

This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities.

Sequence analysis is specialized

- Large data needs to be processed in a memory- and time-efficient manner
- Specific algorithms have been developed for the unique characteristics of sequence data

Additional considerations

- Re-use of existing, tested code is easier to do and less error-prone than re-inventing the wheel.
- Interoperability between packages is easier when the packages share similar data structures.

Solution: use well-defined *classes* to represent complex data;
*methods* operate on the classes to perform useful functions. Classes
and methods are placed together and distributed as *packages* so that
we can all benefit from the hard work and tested code of others.

The IRanges package defines an important class for specifying integer ranges, e.g.,

```
library(IRanges)
```

```
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, as.vector, cbind, colnames, do.call,
## duplicated, eval, evalq, get, intersect, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rep.int, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unlist
```

```
ir <- IRanges(start=c(10, 20, 30), width=5)
ir
```

```
## IRanges of length 3
## start end width
## [1] 10 14 5
## [2] 20 24 5
## [3] 30 34 5
```

There are many interesting operations to be performed on ranges, e.g,
`flank()`

identifies adjacent ranges

```
flank(ir, 3)
```

```
## IRanges of length 3
## start end width
## [1] 7 9 3
## [2] 17 19 3
## [3] 27 29 3
```

Consult the help page for flank, `?flank`

, and explore other
range-based operations.

The `IRanges`

class is part of a class hierarchy. To see this, ask R for
the class of `ir`

, and for the class definition of the `IRanges`

class

```
class(ir)
```

```
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"
```

```
getClassDef(class(ir))
```

```
## Class "IRanges" [package "IRanges"]
##
## Slots:
##
## Name: start width NAMES elementType
## Class: integer integer characterORNULL character
##
## Name: elementMetadata metadata
## Class: DataTableORNULL list
##
## Extends:
## Class "Ranges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
##
## Known Subclasses: "NormalIRanges"
```

Notice that `IRanges`

extends the `Ranges`

class. Now try entering
`?"flank,<tab>`

, where `<tab>`

means to press the tab key to ask for
tab completion (may not be necessary in Rstudio). You can see that
there are help pages for several different classes. Tab-complete to

```
?"flank,Ranges-method"
```

and verify that you're at the page that describes the method relevant
to an `IRanges`

instance.

The GenomicRanges package extends the notion of ranges to include
features relevant to application of ranges in sequence analysis,
particularly the ability to associate a range with a sequence name
(e.g., chromosome) and a strand. Create a `GRanges`

instance based on
our `IRanges`

instance, as follows

```
library(GenomicRanges)
```

```
## Loading required package: GenomeInfoDb
```

```
gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr
```

```
## GRanges with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [10, 14] +
## [2] chr1 [20, 24] -
## [3] chr2 [30, 34] +
## ---
## seqlengths:
## chr1 chr2
## NA NA
```

The notion of flanking sequence has a more nuanced meaning in
biology. In particular we might expect that flanking sequence on the
`+`

strand would precede the range, but on the minus strand would
follow it. Verify that `flank`

applied to a `GRanges`

object has this
behavior.

```
flank(gr, 3)
```

```
## GRanges with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [ 7, 9] +
## [2] chr1 [25, 27] -
## [3] chr2 [27, 29] +
## ---
## seqlengths:
## chr1 chr2
## NA NA
```

Discover what classes `GRanges`

extends, find the help page
documenting the behavior of `flank`

when applied to a GRanges object,
and verify that the help page documents the behavior we just observed.

```
class(gr)
```

```
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
```

```
getClassDef(class(gr))
```

```
## Class "GRanges" [package "GenomicRanges"]
##
## Slots:
##
## Name: seqnames ranges strand elementMetadata
## Class: Rle IRanges Rle DataFrame
##
## Name: seqinfo metadata
## Class: Seqinfo list
##
## Extends:
## Class "GenomicRanges", directly
## Class "Vector", by class "GenomicRanges", distance 2
## Class "GenomicRangesORmissing", by class "GenomicRanges", distance 2
## Class "GenomicRangesORGRangesList", by class "GenomicRanges", distance 2
## Class "Annotated", by class "GenomicRanges", distance 3
```

```
?"flank,GenomicRanges-method"
```

Notice that the available `flank()`

methods have been augmented by the
methods defined in the *GenomicRanges* package.

It seems like there might be a number of helpful methods available for
working with genomic ranges; we can discover some of these from the
command line, indicating that the methods should be on the current
`search()`

path

```
showMethods(class="GRanges", where=search())
```

Use `help()`

to list the help pages in the `GenomicRanges`

package,
and `vignettes()`

to view and access available vignettes; these are
also available in the Rstudio 'Help' tab.

```
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
```

This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.

Basics

- Bioconductor packages are listed on the biocViews page. Each package has 'biocViews' (tags from a controlled vocabulary) associated with it; these can be searched to identify appropriately tagged packages, as can the package title and author.
- Each package has a 'landing page', e.g., for GenomicRanges. Visit this landing page, and note the description, authors, and installation instructions. Packages are often written up in the scientific literature, and if available the corresponding citation is present on the landing page. Also on the landing page are links to the vignettes and reference manual and, at the bottom, an indication of cross-platform availability and download statistics.
A package needs to be installed once, using the instructions on the landing page. Once installed, the package can be loaded into an R session

`library(GenomicRanges)`

and the help system queried interactively, as outlined above:

`help(package="GenomicRanges") vignette(package="GenomicRanges") vignette(package="GenomicRanges", "GenomicRangesHOWTOs") ?GRanges`

Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.

- Important packages for analysis of differential expression include edgeR and DESeq2; both have excellent vignettes for exploration. Additional research methods embodied in Bioconductor packages can be discovered by visiting the biocViews web page, searching for the 'DifferentialExpression' view term, and narrowing the selection by searching for 'RNA seq' and similar.
- Popular ChIP-seq packages include DiffBind for comparison of peaks across samples, ChIPQC for quality assessment, and ChIPpeakAnno for annotating results (e.g., discovering nearby genes). What other ChIP-seq packages are listed on the biocViews page?
- Working with called variants (VCF files) is facilitated by packages such as VariantAnnotation, VariantFiltering, and ensemblVEP; packages for calling variants include, e.g., h5vc and VariantTools.
- Several packages identify copy number variants from sequence data, including cn.mops; from the biocViews page, what other copy number packages are available? The CNTools package provides some useful facilities for comparison of segments across samples.
- Microbiome and metagenomic analysis is facilitated by packages such as phyloseq and metagenomeSeq.
- Metabolomics, chemoinformatics, image analysis, and many other high-throughput analysis domains are also represented in Bioconductor; explore these via biocViews and title searches.

Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course.

- The Biostrings package is used to represent DNA and other
sequences, with many convenient sequence-related functions. Check
out the functions documented on the help page
`?consensusMatrix`

, for instance. Also check out the BSgenome package for working with whole genome sequences, e.g.,`?"getSeq,BSgenome-method"`

- The GenomicAlignments package is used to input reads aligned to
a reference genome. See for instance the
`?readGAlignments`

help page and`vigentte(package="GenomicAlignments", "summarizeOverlaps")`

- [rtracklayer][]'s
`import`

and`export`

functions can read in many common file types, e.g., BED, WIG, GTF, …, in addition to querying and navigating the UCSC genome browser. Check out the`?import`

page for basic usage. - The ShortRead and Rsamtools packages can be used for lower-level access to FASTQ and BAM files, respectively. Explore the ShortRead vignette and Scalable Genomics labs to see approaches to effectively processing the large files.

Annotation: Bioconductor provides extensive access to 'annotation' resources (see the AnnotationData biocViews hierarchy); these are covered in greater detail in Thursday's lab, but some interesting examples to explore during this lab include:

- biomaRt, PSICQUIC, KEGGREST and other packages for querying on-line resources; each of these have informative vignettes.
- AnnotationDbi is a cornerstone of the
Annotation Data packages provided by Bioconductor.
**org**packages (e.g., org.Hs.eg.db) contain maps between different gene identifiers, e.g., ENTREZ and SYMBOL. The basic interface to these packages is described on the help page`?select`

**TxDb**packages (e.g., TxDb.Hsapiens.UCSC.hg19.knownGene) contain gene models (exon coordinates, exon / transcript relationships, etc) derived from common sources such as the hg19 knownGene track of the UCSC genome browser. These packages can be queried, e.g., as described on the`?exonsBy`

page to retrieve all exons grouped by gene or transcript.**BSgenome**packages (e.g., BSgenome.Hsapiens.UCSC.hg19) contain whole genomes of model organisms.

- VariantAnnotation and ensemblVEP provide access to sequence annotation facilities, e.g., to identify coding variants; see the Introduction to VariantAnnotation vignette for a brief introduction; we'll re-visit this during the Thursday lab.
- Take a quick look (we'll do more of this in Thursday's lab) at the annotation work flow on the Bioconductor web site.

The goal of this section is to learn to write correct, robust, simple and efficient R code. We do this through a couple of case studies.

- Correct: consistent with hand-worked examples
`identical()`

,`all.equal()`

- Robust: supports realistic inputs, e.g., 0-length vectors,
`NA`

values, … - Simple: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance.
- Fast, or at least reasonable given the speed of modern computers.

- Profile
*Look*at the script to understand in general terms what it is doing.*Step*through the code to see how it is executed, and to gain an understanding of the speed of each line.*Time*evaluation of select lines or simple chunks of code with`system.time()`

or the microbenchmark package.*Profile*the code with a tool that indicates how much time is spent in each function call or line – the built-in`Rprof()`

function, or packages such as lineprof or aprof

Vectorize – operate on vectors, rather than explicit loops

`x <- 1:10 log(x) ## NOT for (i in seq_along) x[i] <- log(x[i])`

`## [1] 0.0000 0.6931 1.0986 1.3863 1.6094 1.7918 1.9459 2.0794 2.1972 2.3026`

Pre-allocate memory, then fill in the result

`result <- numeric(10) result[1] <- runif(1) for (i in 2:length(result)) result[i] <- runif(1) * result[i - 1] result`

`## [1] 0.72601 0.64980 0.44125 0.25742 0.09304 0.08861 0.07351 0.07070 ## [9] 0.05750 0.02301`

Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once

- Simple, e.g., 'hoist' constant multiplications from a
`for`

loop - Higher-level, e.g., use
`lm.fit()`

rather than repeatedly fitting the same design matrix.

- Simple, e.g., 'hoist' constant multiplications from a
Re-use existing, tested code

Re-think how to attack the problem

- Different implementations
- Alternative algorithms

Compile your script with the byte compiler

Use parallel evaluation

Speak in tongues – 'foreign' languages like C, Fortran

Here's an obviously inefficient function:

```
f0 <- function(n, a=2) {
## stopifnot(is.integer(n) && (length(n) == 1) &&
## !is.na(n) && (n > 0))
result <- numeric()
for (i in seq_len(n))
result[[i]] <- a * log(i)
result
}
```

Use `system.time`

to investigate how this algorithm scales with `n`

,
focusing on elapsed time.

```
system.time(f0(10000))
```

```
## user system elapsed
## 0.256 0.003 0.258
```

```
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")
```

Remember the current 'correct' value, and an approximate time

```
n <- 10000
system.time(expected <- f0(n))
```

```
## user system elapsed
## 0.252 0.003 0.254
```

```
head(expected)
```

```
## [1] 0.000 1.386 2.197 2.773 3.219 3.584
```

Revise the function to hoist the common multiplier, `a`

, out of the
loop. Make sure the result of the 'optimization' and the original
calculation are the same. Use the microbenchmark package to
compare the two versions

```
f1 <- function(n, a=2) {
result <- numeric()
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f1(n))
```

```
## [1] TRUE
```

```
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
```

```
## Unit: milliseconds
## expr min lq median uq max neval
## f0(n) 231.6 234.1 235.2 349.7 352.2 5
## f1(n) 229.6 232.3 348.2 348.8 357.9 5
```

Adopt a 'pre-allocate and fill' strategy

```
f2 <- function(n, a=2) {
result <- numeric(n)
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f2(n))
```

```
## [1] TRUE
```

```
microbenchmark(f0(n), f2(n), times=5)
```

```
## Unit: milliseconds
## expr min lq median uq max neval
## f0(n) 235.23 236.08 236.31 351.28 363.17 5
## f2(n) 16.68 16.81 17.15 17.91 18.37 5
```

Use an `*apply()`

function to avoid having to explicitly pre-allocate,
and make opportunities for vectorization more apparent.

```
f3 <- function(n, a=2)
a * sapply(seq_len(n), log)
identical(expected, f3(n))
```

```
## [1] TRUE
```

```
microbenchmark(f0(n), f2(n), f3(n), times=10)
```

```
## Unit: milliseconds
## expr min lq median uq max neval
## f0(n) 234.50 237.66 298.54 359.67 365.47 10
## f2(n) 16.84 17.24 17.32 18.30 21.26 10
## f3(n) 20.37 21.04 22.21 22.55 27.46 10
```

Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:

```
f4 <- function(n, a=2)
a * log(seq_len(n))
identical(expected, f4(n))
```

```
## [1] TRUE
```

```
microbenchmark(f0(n), f3(n), f4(n), times=10)
```

```
## Unit: microseconds
## expr min lq median uq max neval
## f0(n) 233304.5 237095.1 298124 359998.1 364520.8 10
## f3(n) 20219.3 20356.3 20679 21235.9 22307.4 10
## f4(n) 473.4 474.3 488 488.7 527.9 10
```

Returning to our explicit iteration `f2()`

, in these situations it
can be helpful to compile the code to a more efficient
representation. Do this using the compiler package.

```
library(compiler)
f2c <- cmpfun(f2)
n <- 10000
identical(f2(n), f2c(n))
```

```
## [1] TRUE
```

```
microbenchmark(f2(n), f2c(n), times=10)
```

```
## Unit: milliseconds
## expr min lq median uq max neval
## f2(n) 17.214 17.670 18.173 19.025 22.358 10
## f2c(n) 7.272 7.374 7.533 7.835 9.165 10
```

`f4()`

definitely seems to be the winner. How does it scale with `n`

?
(Repeat several times)

```
n <- 100000 * seq(1, 20, 2) # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, type="b")
```

Any explanations for the different pattern of response?

Lessons learned:

- Vectorizing offers a huge improvement over iteration
- Pre-allocate-and-fill is very helpful when explicit iteration is required.
`*apply()`

functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it- Hoisting common sub-expressions and using the
*compiler*package can be helpful for improving performance when explicit iteration is required.

It can be very helpful to reason about an algorithm in an abstract
sense, to gain understanding about how an operation might
scale. Here's an interesting problem, taken from
StackOverflow: Suppose
one has a very long **sorted** vector

```
vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
```

with the simple goal being to identify the number of values less than zero. The original post and many responses suggested a variation of scanning the vector for values less than zero, then summing

```
f0 <- function(v) sum(v < 0)
```

This algorithm compares each element of `vec`

to zero, creating a
logical vector as long as the original, `length(v)`

. This logical
vector is then scanned by `sum`

to count the number of elements
satisfying the condition.

Questions:

- How many vectors of length
`v`

need to be allocated for this algorithm? - Based on the number of comparisons that need to be performed, how
would you expect this algorithm to scale with the length of
`v`

? Verify this with a simple figure.

```
N <- seq(1, 11, 2) * 1e6
Time <- sapply(N, function(n) {
v <- sort(rnorm(n))
system.time(f0(v))[[3]]
})
plot(Time ~ N, type="b")
```

Is there a better algorithm, i.e., an approach that arrives at the
same answer but takes less time (and / or space)? The vector is
sorted, and we can take advantage of that by doing a *binary
search*. The algorithm is surprisingly simple: create an index to the
minimum (first) element, and the maximum (last) element. Check to
see if the element half way between is greater than or equal to
zero. If so, move the maximum index to that point. Otherwise, make
that point the new minimum. Repeat this procedure until the minimum
index is no longer less than the maximum index.

```
f3 <- function(x, threshold=0) {
imin <- 1L
imax <- length(x)
while (imax >= imin) {
imid <- as.integer(imin + (imax - imin) / 2)
if (x[imid] >= threshold)
imax <- imid - 1L
else
imin <- imid + 1L
}
imax
}
```

Approximately half of the possible values are discarded each
iteration, so we expect on average to arrive at the end after about
`log2(length(v))`

iterations – the algorithm scales with the log of
the length of `v`

, rather than with the length of `v`

, and no long
vectors are created. These difference become increasingly important as
the length of `v`

becomes long.

Questions:

- Verify with simple data that
`f3()`

and`f0()`

result in`identical()`

answers. - Compare how timing of
`f3()`

scales with vector length.

```
## identity
identical(f0((-2):2), f3((-2):2))
```

```
## [1] TRUE
```

```
identical(f0(2:4), f3(2:4))
```

```
## [1] TRUE
```

```
identical(f0(-(4:2)), f3(-(4:2)))
```

```
## [1] TRUE
```

```
identical(f0(vec), f3(vec))
```

```
## [1] TRUE
```

```
## scale
N <- 10^(1:7)
Time <- sapply(N, function(n) {
v <- sort(rnorm(n))
system.time(f3(v))[[3]]
})
plot(Time ~ N, type="b")
```

- Use the microbenchmark package to compare performance of
`f0()`

and`f3()`

with the original data,`vec`

. - R code can be compiled, and compilation helps most when doing
non-vectorized operations like those in
`f3()`

. Use`compiler::cmpfun()`

to compile`f3()`

, and compare the result using microbenchmark.

```
## relative time
library(microbenchmark)
microbenchmark(f0(vec), f3(vec))
```

```
## Unit: microseconds
## expr min lq median uq max neval
## f0(vec) 26688.44 27822.29 27924.6 28301.45 32728.0 100
## f3(vec) 52.44 55.19 72.3 79.73 121.5 100
```

```
library(compiler)
f3c <- cmpfun(f3)
microbenchmark(f3(vec), f3c(vec))
```

```
## Unit: microseconds
## expr min lq median uq max neval
## f3(vec) 53.10 55.28 59.17 60.93 69.20 100
## f3c(vec) 23.11 23.85 24.83 25.71 35.18 100
```

We could likely gain additional speed by writing the binary search algorithm in C, but we are already so happy with the performance improvement that we won't do that!

It is useful to ask what is lost by `f3()`

compared to `f0()`

. For
instance, does the algorithm work on character vectors? What about
when the vector contains `NA`

values? How are ties at 0 treated?

`findInterval()`

is probably a better built-in way to solve the
original problem, and generalizes to additional situations. The idea
is to take the query that we are interested in, `0`

, and find the
interval specified by `vec`

in which it occurs.

```
f4 <- function(v, query=0)
findInterval(query - .Machine$double.eps, v)
identical(f0(vec), f4(vec))
```

```
## [1] TRUE
```

```
microbenchmark(f0(vec), f3(vec), f4(vec))
```

```
## Unit: microseconds
## expr min lq median uq max neval
## f0(vec) 26911.29 27841.17 27956.62 28343.8 31953.2 100
## f3(vec) 52.59 58.17 70.79 78.8 137.9 100
## f4(vec) 21972.92 22035.44 22062.42 22158.6 23855.8 100
```

The fact that it is flexible and well tested means that it would often
be preferred to `f3()`

, even though it is less speedy. For instance,
compare the time it takes to query 10000 different points using `f3`

and
iteration, versus `findInterval`

and vectorization.

```
threshold <- rnorm(10000)
identical(sapply(threshold, f3, x=vec), f4(vec, threshold))
```

```
## [1] TRUE
```

```
microbenchmark(sapply(x, f3), f4(vec, x))
```

```
## Unit: microseconds
## expr min lq median uq max neval
## sapply(x, f3) 128.2 138.2 181.7 257.9 291.2 100
## f4(vec, x) 22000.2 22037.3 22091.4 22308.3 23871.0 100
```

Some R functions that implement efficient algorithms are `sort()`

(including radix sort), `match()`

(hash table look-up), and
`tabulate()`

; these can be useful in your own code.

Lessons learned:

- Choice of algorithm can be very important
- Implementing classical algorithms (like binary search) can be a rewarding learning experience even when, at the end of the day, it may be better to use existing functions.
- The built-in R functions that implement efficient algorithms can be important building-blocks for more complicated code.