Building packages for Bioconductor

The aim of this guide is to be a primer for new package authors who aspire to write packages for Bioconductor. It will take you through the steps necessary to do this using tools available in the widely available IDE RStudio. And it will also attempt to explain the most fundamental aspects of package anatomy with the aim of helping your package to eventually meet or exceed the Bioconductor Package Guidelines. We also have other resources that can help developers with various aspects of package development. For those not interested in this very introductory guide, there are other developer resources on the Bioconductor website here. And there are also other how to documents that can be explored to answer various questions about developing for Bioconductor.

Bioconductor packages are really just R packages. So the final word on what can be done with them can be found in this document here.

But because that document is exhaustively detailed, it can be difficult for a beginner to know what they need to get started with writing R packages for Bioconductor. This document will attempt to simplify the discussion by focusing on the bare minimum of what is needed to make R code run as a Bioconductor package.

Package Types

The Bioconductor project recognizes that there are many different general uses for R packages. Speaking very broadly, some packages are used primarily for holding annotations or examples of data sets while other packages are primarily for distributing software. For the purposes of this document, we are focusing on software packages since they are by far the most complex to create and maintain. In other words, these are packages that wrap some functionality up so that other R users can also easily reproduce and build on your work.

Creating a package using RStudio

Today a lot of people use RStudio, and this is a great place to get started because RStudio has already added tools to make package creation easier for end users. But before we get started lets configure a couple of things so that your code will be formatted the way that we prefer it for Bioconductor.

To make your first package with RStudio, go to the ‘File’ menu and choose ‘New Project’. From here you probably want to choose ‘New Directory’ and then you definitely want o choose ‘R Package’ to set up a new R package. This will take you to the final step to get started which will just ask you what you want to call your new package and where to put it. It is worth taking a moment to choose a name that you think you will like and to put the package in a location where you are comfortable working. These things can be easily changed later but it’s still best to get them correct right now.

For this example lets call our package ‘MyPkg’. Now go ahead an fill this information in and push the ‘Create Project’ button.

If you look at the screen now you will see several files listed in the files tab. These are files that are needed to make a package (and which RStudio conveniently creates for you). A couple of these files are ‘extras’ that are put there by RStudio to help it manage your package building (‘.Rbuildignore’, ‘Read-and-delete-me’ and ‘MyPkg.Rproj’) You can just ignore these files, but I wanted you to understand that they are not part of ‘required’ R package anatomy.

[ Back to top ]

Basic package Anatomy

Now lets look at the core files and directories that RStudio has placed in this directory and describe what they are each for:

These are just the parts that RStudio will put in for you. Some other important things you will need to know about are:

In the next several sections we will go over most of these components in a little bit more detail. But remember, this is meant to be a high level guide for just getting you started. So I am not going to put everything in here. For that kind of guide I really think you should see the official guide at CRAN called Writing R extensions.

[ Back to top ]

The DESCRIPTION file

The DESCRIPTION file has a LOT of roles in an R package. And technically it is a ‘.dcf’ file. It does its job by detailing a series of fields to tell R about what is in a package and how to use that information. The one that was auto generated for you should already look like this:

    Package: MyPkg
    Type: Package
    Title: What the package does (short line)
    Version: 1.0
    Date: 2014-06-04
    Author: Who wrote it
    Maintainer: Who to complain to <yourfault@somewhere.net>
    Description: More about what it does (maybe more than one line)
    License: What license is it under?

Which is a good start really for almost any DESCRIPTION file. You will see that for some fields they are giving you hints about what to put there and for other fields they have just filled them in already. You will need to fill in at least that many fields (and a few more) in order to make a package for Bioconductor. Lets start by describing what goes into the fields that are present already in our example, and then I will talk a bit about adding some of the fields that are not.

Those are the fields that RStudio will have created for you. Here are some others that you should also think about:

Now that we have discussed the DESCRIPTION file lets fill ours out so that it can actually work for our test package:

    Package: MyPkg
    Type: Package
    Title: A hello world package
    Version: 0.99.1
    Date: 2014-06-04
    Author: Some person <somePerson@someplace.com>
    Maintainer: Some person <somePerson@someplace.com>
    Description: We learn best how to program by practicing it.  
        So in order to practice creating packages, we are going to 
        copy this stuff into an actual DESCRITION file and test it 
        out for ourselves.
    License: Artistic-2.0

Once you have done this. Build the package for yourself by hitting ‘Ctrl+Shift+B’ (or you can find the command for ‘build and reload’ on the ‘build’ menu).

[ Back to top ]

The 'R' directory

The R directory is used for holding your ‘.R’ source files. The example we have has created a little file for you called ‘MyPkg.R’.

Lets create a little test function in that package like this:

    myFun <- function(arg1){
        arg1 + 1
    }

now that you have code for a simple R function in your ‘R’ directory, hit ‘Ctrl+Shift+B’ to build and reload. You will see now that not only will your package load, but that the function you just wrote is available if you list the contents of your package like this:

    library("MyPkg")
    ls("package:MyPkg")

And you can even test call your new function just like this:

    myFun(5)

[ Back to top ]

The NAMESPACE file

The NAMESPACE file is used to control which objects are exposed by a package (to end users) as well as which ones are available to it (as imported from other packages). Unlike the DESCRIPTION file the NAMESPACE file is not a ‘.dcf’ file. It is actually meant to contain R code. You can see a simple example of this by looking at the NAMESPACE file that was generated for you. It contains the following line of code:

    exportPattern("^[[:alpha:]]+")

This basically tells R to export every single thing in your package. And this is usually a bad idea. In fact, this is such a consistently bad idea that if you send us a package that does this we will probably ask you to change it.

Doing this will expose all the contents of your package to end users. Including any helper functions or utilities that you may have meant to only be used internally. This means that your users may have to sift through many of your functions that they don’t need to know about just in order to learn about the ones that they do. It also means that you will have to spend a lot of time documenting all these utilities (even ones that you don’t intend for anyone else to ever use). This results in man pages that say things like ‘for internal use only’. There are times when documenting something that is used internally can be a good idea. But you have to weight the benefits of having everything documented against the consequences of everyone having to read documentation that they don’t really need. In general, it is best to only document the stuff that end users intend to make use of.

There are a lot of commands that are normally put into a NAMESPACE file. The 1st big class are commands to selectively export things that you want to document for end users. The most common of these is export() which is used like this:

    export('myFun')

Now try this for yourself. Replace the exportPattern() command with the call to export() that only exposes your function. Then hit ‘Ctrl+Shift+B’ and make sure that it works as expected.

The other big use for NAMESPACE files is to import functions from other packages. When you import a function you don’t need to attach that package to the search path. This is good as having a lot of things on the search path can slow R down. So by importing you allow your package to have access to this code without paying this performance penalty. Lets suppose that you wanted to make use of the genomeStyles function from the GenomeInfoDb package. In this case you need to do two things. First you need to add GenomeInfoDb to the Imports field in the DESCRIPTION file. This means you need to add a line like this to your DESCRIPTION file:

    Imports: GenomeInfoDb

And then add something like this to your NAMESPACE file.

    import('GenomeInfoDb')

Alternatively you could also add this to your NAMESPACE file:

    importFrom('GenomeInfoDb', 'genomeStyles')

Either of these approaches will get you access to the genomeStyles function from the GenomeInfoDb package and without fully loading the GenomeInfoDb package onto the search path. The 1st case is more general and will make all the variables in the GenomeInfoDb package available. But the 2nd case will only get the ‘genomeStyles’ function from the ‘GenomeInfoDb’ package.

For more details on how to use the NAMESPACE file, you should see this section of the ‘Writing R Extensions’ manual.

Exercise 1:

Now create another new function. This time, have your function return just the names of your imported genomeStyles function.

[ Back to top ]

Documenting manual pages

The R manual pages are written in ‘Rd format’. The style is mostly self explanatory and the example file ‘MyPkg-package.Rd’ that is provided for you is worth taking a moment to look at. The topic of how to write these kinds of files, what the different fields mean and how to fill them out has already been very well documented here, in its own specific section of the ‘Writing R Extensions’ manual. There are a few specific things worth mentioning though.

Now look again at that manual page that RStudio generated for your new package (‘Rd format’). You will notice that (among other things) the example section is not filled in. If you were to try and run check on your package right now you would get an error from this. Go ahead and try that to see for yourself what will happen. From the ‘Build’ menu, choose ‘Check Package’ (CTRL-SHIFT-E). You should get an error.

To correct this put a simple example into your manual page. This is just the page that describes the package so there is not much for it to demonstrate. For now just add the following simple example section as a replacement for what is currently there:

    myFun(10)

Then run check again. You will still have warnings, but it should at least complete the process.

Now lets look at another one of RStudios nicer features. Lets add a manual page for our new function.

To do this go to the ‘File’ menu, choose ‘New File’ and ‘Rd File’. Be sure to choose a man page that is pre-configured for a ‘Function’ and be sure to name it ‘myFun’ after our new function. This should drop a new ‘pre-fabricated’ manual page into your packages existing ‘man’ directory.

Now obviously you need to fill in your manual pages with actual descriptions and working examples etc. But for now fill in a value for the new manual page title so that R will be able to build and check the package again.

[ Back to top ]

Writing vignettes

You might be wondering why you would need a vignette if you already have a series of manual pages. The answer is that end users need a high level guide to show them how to connect the dots. It’s not enough to just show how all the pieces work. You also need to know how they are supposed to work together. A good vignette should first explain what the package is for, and then it should provide a common workflow for how that package could be used to accomplish that. In doing so, it puts all the functions and classes into context so that new users can find their way.

Structurally, vignettes go in the ‘vignettes’ directory. These consist of either an ‘.Rnw’ or a .Rmd’ file. That R will render into a .pdf or html document later. It used to be that most vignettes were written in Sweave as described here, but now more and more users are choosing to write them using the knitr package as described here. Either form of vignette is acceptable, but writing a ‘.Rmd’ document is a lot easier to do.

To get RStudio to create a vignette for you simply go to the ‘File’ menu, and look under ‘New File’. From here you could choose to create either a new ‘R Sweave’ or ‘R Markdown’ file. But for our purposes in this demo, choose an ‘R Markdown’ file. Now at this point RStudio will create a sample R Markdown file for you. But unlike the case with the man pages, it will not put it in the correct place for you (at least not as of the time of this writing). So go again to the ‘File’ menu and choose ‘Save As’. This will open up a little widget for you to specify how you want this markdown file to be saved. From here click the button at the bottom left called ‘New Folder’ and create a sub directory in your MyPkg directory called ‘vignettes’. Then tell the widget to save the markdown file there under the name ‘MyPkg.Rmd’.

Finally you must add the following lines to your DESCRIPTION file:

    Suggests: knitr
    VignetteBuilder: knitr

These are needed because knitr is needed to support non-Sweave based markdown files as vignettes and knitr is not part of base R.

You now have the beginnings of a vignette. Now there are really two things that a vignette must always do: 1) it must explain sufficient background information so that new users can determine if a package is going to help them solve a problem or not (even if they are new to the topic) and 2) it must demonstrate how to use the various functions from the package in the expected way. In order to do the latter you will sometimes need to repeat examples that you may have already shown in the example sections of your manual pages. Running these examples twice can be inefficient. How can you avoid this problem? Well you can always mark your code chunks as eval=FALSE. Here is an example of a markdown code chunk that is marked to not be evaluated:

    ```{r chunkName, eval=FALSE}
        myFun(3)
    ```

If you were using Sweave, it would have looked like this instead:

    <<chunkName, eval=FALSE>>=
        myFun(3)
    @

Exercise 2:

Now Document your genomeStyleNames function with a manual page and also update your vignette. Be sure to include examples in each case that demonstrate how to use it. When you put the example into your vignette mark the code chunk as FALSE.

[ Back to top ]

Storing data in your package

As mentioned above, there are basically two places where R packages can store helpful user data. The 1st case is for when you have your data already in an R object. In that case you probably want to just save the R object in the ‘data’ directory. This directory is where you can put ‘.Rda’ objects that you want to document and use in examples by calling load().

The second case it the ‘inst/extdata’ directory. The ‘inst’ directory is just a place where you can put stuff that will be stored in your installed package for retrieval later. You can think of it a little bit like a storage place for miscellaneous items. For example, some users will put an ‘inst/scripts’ directory in their package as a place to store scripts. But the ‘inst/extdata’ sub-directory combination gets special mention here because so many packages use it as a place to store extra data like a database or ‘.csv’ files etc. Basically you should think of using it for anything data oriented that is not a ‘.Rda’ file. This comes up a lot because not everything that you may want to load into your R session as an example of your packages abilities will necessarily start as a ‘.Rda’ file.

Finally, we put a maximum value on how big the built tarball for a software package can be (five megabytes). If you need more data than this, it is time to consider making an companion package. A companion package can be either an experiment data package or an annotation package. In either case its primary use will be to hold data for an end user. The biggest pragmatic difference between an experiment data package and an annotation package is primarily how often they are updated. Annotations are data that people need to rely upon to be current, so these data need to be updated (or checked to make sure they are still current) for every release. In contrast most experimental data never really changes once the experiment has been run and so it is used primarily so that people can have an authentic data set for running their algorithms on.

Exercise 3:

Part 1: GenomeStyles() returns a list object. Save this list object out to the file system and add it to a ‘data’ directory so that it can be loaded by end users.

Part 2: Now call genomeStyles(‘Homo_sapiens’) and trap the result. Then use write.table() to create a tab delimited file with the contents of that data.frame on disc. Finally, store that information in inst/extdata and write an extractor function using system.file that pulls that data back into your R session.

[ Back to top ]

BiocCheck

If you look in our package guidelines you will see that not only must a package pass R CMD build and R CMD check without ERRORs or WARNINGs, but it must also pass a bunch of other requirements intended to make sure that your code is well documented and that it will integrate well into the rest of the project. Fortunately we have a tool to help you with this. From RStudio you can run build and check from the ‘Build’ menu. But you can also test your own package for BiocCheck compliance like this:

    library(BiocCheck)
    pathToPkg <- file.path("..","MyPkg")
    BiocCheck(pathToPkg)

If you have followed the instructions up to this point, then calling the function above will produce an list object as output to tell you which things are needed for your package to be compliant with the Bioconductor package guidelines. The BiocCheck output is broken into three distinct categories:

[ Back to top ]

Tips for writing good packages

Entire books have been written about good coding practices. But here are just a few tips for writing good Bioconductor packages.

[ Back to top ]

Answers for exercises:

Exercise 1:

You should have created a function that looks like this:

    genomeStyleNames <- function(){
        names(genomeStyles())
    }

And you should have exported it so that your NAMESPACE should now look something like this:

    export("myFun","genomeStyleNames")  
    import('GenomeInfoDb')

Exercise 2:

You can either choose to document this function in the manual pages by either adding a new alias to your older manual page or by creating a new page entirely. Either way there is not really a ‘correct’ answer for how to document things (or at least not one that R will not already tell you about).

Exercise 3:

Part 1: You should have used simple code like this to save the R object to file

    gs <- genomeStyles()
    save(gs, file='gs.rda')

Then once the file ‘gs.Rda’ was stored in ‘data’ you could do this once your package was loaded.

    data(package='MyPkg')
    data(gs)

Part 2: Now call genomeStyles(‘Homo_sapiens’) and trap the result. Then use write.table() to create a tab delimited file with the contents of that data.frame on disc. Finally, store that information in inst/extdata and write an extractor function using system.file that pulls that data back into your R session.

    res <- genomeStyles('Homo_sapiens')
    write.table(res, sep="\t",row.names=FALSE, col.names=FALSE, 
                file='human.txt')

Then if you move this to inst/extdata you can write a simple function to extract it like this:

  getData <- function(){
    read.delim(file=
               system.file('extdata','human.txt', package="MyPkg"), 
               header=FALSE)
  }
  getData()