derfinderHelper 1.12.0

`R`

is an open-source statistical environment which can be easily modified to enhance its functionality via packages. *derfinderHelper* is a `R`

package available via the Bioconductor repository for packages. `R`

can be installed on any operating system from CRAN after which you can install *derfinderHelper* by using the following commands in your `R`

session:

```
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("derfinderHelper")
## Check that you have a valid Bioconductor installation
biocValid()
```

*derfinderHelper* is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. A *derfinderHelper* user is not expected to deal with those packages directly but will need to be familiar with *derfinder*.

If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R`

and `Bioconductor`

have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the `derfinder`

or `derfinderHelper`

tags and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

We hope that *derfinderHelper* will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

```
## Citation info
citation('derfinderHelper')
```

```
##
## Collado-Torres L, Jaffe AE and Leek JT (2017). _derfinderHelper:
## derfinder helper package_. doi: 10.18129/B9.bioc.derfinderHelper
## (URL: http://doi.org/10.18129/B9.bioc.derfinderHelper),
## https://github.com/leekgroup/derfinderHelper - R package version
## 1.12.0, <URL: http://www.bioconductor.org/packages/derfinderHelper>.
##
## Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead
## B, Irizarry RA, Leek JT and Jaffe AE (2017). "Flexible expressed
## region analysis for RNA-seq with derfinder." _Nucl. Acids Res._.
## doi: 10.1093/nar/gkw852 (URL: http://doi.org/10.1093/nar/gkw852),
## <URL:
## http://nar.oxfordjournals.org/content/early/2016/09/29/nar.gkw852>.
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
```

*derfinderHelper* (Collado-Torres, Jaffe, and Leek, 2017) is a small package that was created to speed up the F-statistics approach implemented in the parent package *derfinder*. It contains a single function, `fstats.apply()`

, which is used to calculate the F-statistics for a given data matrix, null and an alternative models.

The data is generally arranged in an matrix where the rows (\(n\)) are the genomic features of interest (gene-level summaries, exon-level summaries, or base-level data) and the columns (\(m\)) represent the samples. The other two main arguments for `fstats.apply()`

are the null and alternative model matrices which are \(m \times p_0\) and \(m \times p\) where \(p_0\) is the number of covariates in the null model and \(p\) is the number of covariates in the alternative model. The models have to be nested and thus by definition \(p > p_0\). The end result is a vector of F-statistics with length \(n\), which is run length encoded for memory saving purposes.

Other arguments of `fstats.apply()`

are related to flow in *derfinder* such as the scaling factor (`scalefac`

) used, whether to subset the data (`index`

), and if the data was separated into chunks and saved to disk to lower the memory load (`lowMemDir`

).

Implementation-wise, `adjustF`

is useful when the denominator of the F-statistic calculation is too small. Finally, `method`

controls how will the F-statistics be calculated.

`Matrix`

is the recommended option because it uses around half the memory load of`regular`

and can be faster. Specially if the data was saved in this format previously by*derfinder*.`Rle`

uses the least amount of memory but gets very slow as the number of samples increases. Thus making it less than ideal in several cases.`regular`

uses base`R`

to calculate the F-statistics and can require a large amount of memory. This is noticeable when using several cores to run`fstats.apply()`

on different portions of the data.

The F-statistics for each feature \(i\) are calculated using the following formula:

\[ F_i = \frac{ (\text{RSS0}_i - \text{RSS1}_i)/(\text{df}_1 - \text{df}_0) }{ \text{adjustF} + (\text{RSS1}_i / (p - p_0 - \text{df_1}))} \]

The following section walks through an example. However, in practice, you will probably not use this package directly and it will be used via *derfinder*.

First lets create an example data set where we have information for 1000 features and 16 samples where samples 1 to 4 are from group A, 5 to 8 from group B, 9 to 12 from group C, and 13 to 16 from group D.

```
## Create some toy data
suppressPackageStartupMessages(library('IRanges'))
set.seed(20140923)
toyData <- DataFrame(
'sample1' = Rle(sample(0:10, 1000, TRUE)),
'sample2' = Rle(sample(0:10, 1000, TRUE)),
'sample3' = Rle(sample(0:10, 1000, TRUE)),
'sample4' = Rle(sample(0:10, 1000, TRUE)),
'sample5' = Rle(sample(0:15, 1000, TRUE)),
'sample6' = Rle(sample(0:15, 1000, TRUE)),
'sample7' = Rle(sample(0:15, 1000, TRUE)),
'sample8' = Rle(sample(0:15, 1000, TRUE)),
'sample9' = Rle(sample(0:20, 1000, TRUE)),
'sample10' = Rle(sample(0:20, 1000, TRUE)),
'sample11' = Rle(sample(0:20, 1000, TRUE)),
'sample12' = Rle(sample(0:20, 1000, TRUE)),
'sample13' = Rle(sample(0:100, 1000, TRUE)),
'sample14' = Rle(sample(0:100, 1000, TRUE)),
'sample15' = Rle(sample(0:100, 1000, TRUE)),
'sample16' = Rle(sample(0:100, 1000, TRUE))
)
## Lets say that we have 4 groups
group <- factor(rep(toupper(letters[1:4]), each = 4))
## Note that some groups have higher coverage, we can adjust for this in the model
sampleDepth <- sapply(toyData, sum)
sampleDepth
```

```
## sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
## 4753 5009 4829 4969 7470 7624 7304 7380
## sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16
## 10387 9644 9795 9748 49419 50509 48726 50448
```