posted by Valerie Obenchain, July 2015
Bioconductor newsletter is a quarterly review of core infrastructure
developments, community projects and future directions. We cover topics of
general interest as well as those with the greatest impact on the software.
This issue covers the package landing page ‘shields’ that track use and
usability metrics and the newly minted
Bioconductor Git/GitHub mirrors. We
discuss improved tools for navigating package repositories, revisit the
‘OrganismDb’ packages as a framework for managing multiple annotations and Cole
Trapnell shares his thoughts on software development in the context of the
Cole Trapnell is an Assistant Professor of Genome Sciences at the University of
Washington. He received his PhD in Computer Science from the University of
Maryland where he was co-advised by researchers at University of California at
Berkeley. He spent several years in Berkeley as a visiting student where
he wrote TopHat and Cufflinks and co-wrote Bowtie with Ben Langmead.
It was during a postdoc at Harvard in the Department of Stem Cell and
Regenerative Biology where he started to work with stem cells and became
interested in single-cell analysis. This led to the development of the Monocle
and eventually the
Bioconductor package. Cole
will be speaking at the BioC 2015 conference in Seattle this year and will talk
in detail about the biological questions behind his research interests.
In this interview he shares his perspective as a software developer in the
context of past projects, e.g., TopHat, Cufflinks, as well as
the recent ‘monocle’ package.
Q: What motivated the transition from read mapping and transcript assembly tools to single-cell analysis?
I’ve worked on developing technology to study gene regulation for several years now, and often when I build something new, I test it out on a model system of cell differentiation or some other cell state transition. Before long, I found myself as interested in the biology of differentiation and reprogramming as I was in algorithm or assay development. There are many, many questions in differentiation that are just inherently hard to answer with bulk assays. You need to be able to look at single cells. Single-cell genomics is really just getting started - there are so many open problems and technological challenges attached to these experiments. So trying to push the technological envelop on single cell genomics was doubly appealing to me.
Q: What scientific question was behind development of the Monocle algorithm?
Davide Cacchiarelli and I had hypothesized for some time that a differentiating cell goes through a continuous process of gene regulatory changes, as opposed to passing through a sequence of discrete intermediate stages, which is the classical view. We figured that by looking at single cells as they differentiate, we’d see cells along a continuous path from the starting state to the terminal state. Monocle finds that path, places cells along it at the correct positions, and tells you what genes change as a cell moves along the path. We also speculated that if a cell has to make a choice between two different cell fates, there would be a branch in the path. That’s what we see in the data, and Monocle can reconstruct differentiation trajectories with branches in them.
Q: At what point did you decide to create the ‘monocle’ package vs using a collection of personal scripts?
There were essentially no tools dedicated to single cell RNA-Seq or qPCR out there at the time we started doing our experiments. What was unclear at the time was whether tools for bulk RNA-Seq could be used for single cell. We found that some parts of the bulk analysis workflow, like read mapping and quantification of expression in individual cells worked just fine. For example, we use TopHat and Cufflinks to do those parts, but other groups use RSEM, Sailfish, etc. However, when we got to differential expression, we found the existing tools for differential expression like Cuffdiff or DESeq really didn’t work well on single-cell data. And there was nothing out there to do the more advanced analyses like trajectory reconstruction. So I knew I would have to write a new tool.
I try to make the code I write available. I tend to keep my plotting code to myself, but release almost everything else, for a couple of reasons. I figure if I have a problem, somebody else probably has that same problem, so why not save them some time? Secondly, knowing I’m going to release something that is to be used by a busy scientist who may or may not be a wizard with computational work forces me to think harder about making the code robust, straightforward, and easy to use. It also forces me to more clearly define the problem I am trying to solve. Finally, I feel pretty strongly that while we all try to do our best in writing up methods in our papers, there is just no substitute for seeing the code used to perform the analysis in that paper. Releasing it makes the paper more accessible and transparent, and is the right thing to do scientifically.
Q: How has the
Bioconductor framework fit into your research? For
example, TopHat, Cufflinks, CuffDiff etc. are are not in
but ‘cummmeRbund’ and ‘monocle’ are.
TopHat and Cufflinks were both written before
Bioconductor really had
good support for sequencing data analysis. It wouldn’t have been as easy back
then to make the tools play nice with other Bioconductor packages. Also, both
TopHat and Cufflinks are not really tools that you want to interact with in the
way a lot of
R packages are. They don’t need APIs. You typically just run
them with default parameters, get your output, and move on to the next stage.
Finally, read mapping and transcript assembly are very, very computationally
expensive and involve large memory and I/O loads. C++ is much faster than
has great support for multi-core programming, and gives you precise control
over how memory is managed. Bowtie, TopHat, and Cufflinks all take advantage
of C++’s performance advantages.
‘CummeRbund’ and ‘monocle’ are both highly interactive tools and neither need
to do a crazy amount of CPU- or memory-intensive work, so
R was a much better
fit. I also wanted to try to leverage all the hard work that the
expression community has put into ExpressionSet and related objects to make
single cell analysis as familiar and easy to folks as possible. So ‘monocle’
tries wherever possible to do things “the
Q: You have developed highly respected and heavily used software over the years. Can you share some lessons learned with respect to design: what worked, what would you do differently?
As many of you know the ‘SummarizedExperiment’ class is under construction.
This class holds numeric and other data types derived from a sequencing
experiment. The structure is rectangular with annotations on the rows and
columns. Historically the class held a ‘GRanges’ or ‘GRangesList’ accessible as
rowData(). Based on requests to add support for non range-based row data
(e.g., DataFrame) Hervé, Jim, Martin and myself have been working on a
redesign with credit going to Hervé for the majority of the work so far.
The goals are to support both range-based and non range-based row data in a modular design that is easy to extend. Other feedback from discussions on the bioc-devel mailing list will also be incorporated.
This is a work in progress and will require a couple of devel cycles to completely finish. Currently we have created a new package, SummarizedExperiment, that contains the new classes. ‘SummarizedExperiment’ in the GenomicRanges package has been deprecated and will no longer be available as of BioC 3.3.
Summary of new classes in the new SummarizedExperiment package:
New parent class supports non range-based row data only (i.e., DataFrame accessable with mcols()). Once the old ‘SummarizedExperiment’ class in GenomicRanges has gone through the deprecation cycle, this will be renamed ‘SummarizedExperment’.
This class extends ‘SummarizedExperiment0’ by adding range-based row data to it (i.e., GRanges or GRangesList accessible with rowRanges()). Old ‘SummarizedExperiment’ objects must be updated to ‘RangedSummarizedExperiment’. This class is functionally equivalent to the old ‘SummarizedExperiment’ class in the ‘GenomicRanges’ package.
Bioconductor project is maintained in a Subversion (SVN) source control
system and has been since the early days of the project. As Git gained
popularity we experimented with making packages available via a
commits in Git were propagated to SVN and visa versa. The bridge was very
popular hosting 221 packages, however, there were a few drawbacks. The bridge
did not contain complete commit history, other users must be granted permission
to access your repository and there were some reliability issues - the bridge
did go down.
Recognizing the strengths and popularity of Git/GitHub we looked for a more complete solution. The decision was to mirror the full package repository and hence the creation of the Bioconductor Git Mirrors. These are read-only repositories for every software package and are synchronized with the SVN repository. These mirrors replace the Git-SVN bridge, which is now deprecated. Maintainers who are using the bridge should migrate to the Git mirrors. See the HOWTO section of the website for migration instructions.
Thanks to Jim and Dan for their work on this. See the official announcement on bioc-devel.
Nate and Jim have offered up some of their favorite (and fun!) Git resources:
Git Internals from the Pro Git book
Git and GitHub with R packages from Hadleys book
interactive game to learn Git commands
video game to master Git branching and basic commands
Bioconductor aims to provide stable and relevant software to users across the community. The number of packages continues to grow and as of June 2015 the repository hosts 1024 software, 241 experimental data and 883 annotation packages. Software and experimental data packages are built and checked regularly, ensuring that functions operate as expected and necessary dependencies are available. Before each release the Seattle team contacts maintainers of packages that are failing build or check so the new release is as clean as possible.
Inevitably, packages become abandoned as the maintainer moves on or as one technology is exchanged for another. Whether it is an active decision on the part of the maintainer or the maintainer cannot be found, we needed a way to mark such packages for retirement. Along these lines we recently adopted a package end of life policy. This applies to all software, experimental data and annotation packages that no longer pass build or check and do not have an active maintainer. Packages are considered for end of life deprecation prior to each Bioconductor release. When a package enters the deprecation phase an announcement is made on the support site to offer the package up for adoption. The first successful adoption was of the triform package by Tom Caroll. Thanks Tom!
With over 2000 packages across the 3 repositories effective tools for package qc and management are essential. We aim to provide packages that meet the standard criteria in the package review process, build across supported platforms, are readily available (svn, biocLite() and now Git) and discoverable (i.e., best package for the job).
Over the years Marc, and now Sonali, have established a number of systems for the tracking, quality control and categorization of packages. Through their own innovation and community feedback they create tools that improve the way we interact with these repositories. Much of this work is used internally and is not public-facing. This section highlights recent additions that are visible (and hopefully useful) to all.
‘biocViews’ terms allows for powerful topic-specific searches across the package repositories. These terms are revised and updated as technologies change and new data formats become available. The newest addition is the term ‘ReproducibleResearch’. This is reserved for packages containing reproducible data and results from a publication and is intended to distinguish them from the typical experimental data package.
Sonali has added a new function, ‘recommendPackages()’, to the ‘biocViews’ package. ‘biocViews’ terms are arranged in a hierarchy where each term can have one or more child terms. ‘recommendPackages()’ identifies packages tagged with a particular term(s) as well as packages tagged with child terms.
The function is currently in the development branch (
3.2) only, however, both release and development repositories can be searched
with the ‘use.release’ argument.
For example, this call returns all packages tagged with both “Cheminformatics” and “Clustering” in the release branch using release ‘biocViews’ terms:
library(biocViews) terms <- c("Cheminformatics", "Clustering") ## > recommendPackages(terms, use.release = TRUE, intersect.views = TRUE) ##  "ChemmineOB" "ChemmineR" "eiR" "fmcsR"
‘use.release = FALSE’ will search the devel branch with devel ‘biocViews’.
Packages with a particular ‘biocViews’ term can also be identified by using the web interface. The advantage of ‘recommendPackages()’ is the ability to intersect terms thereby searching more than one view and their children.
A new status term, “pre-accepted”, has been added to the package tracker. This term is used for new packages that have been approved by a reviewer but have not yet had a successful build report. Once the package has a clean build and check the status is changed to “accepted”. This status applies to new packages only.
Quite a bit going on with the build report this past quarter. Here are the highlights.
Package landing pages (e.g., http://bioconductor.org/packages/BiocGenerics) now includes shields (aka badges) that summarize use and usability metrics of each package. Motivation was to provide a quick and easy way for users to assess the package. If a search on ‘biocViews’ terms returns a number of packages suitable for an analysis the shield metrics can provide an additional level of assessment and help further narrow the field.
The shields provide the following metrics:
To follow the complete discussion see the post on Bioc-devel.
Thanks to Jim and Dan for the work on the shields.
There was a request to have the Bioconductor website resolve shortened package urls. The use case was in the context of grants and publications where referring to a package with the full length url can be cumbersome and distracting.
Previously the full path to a package had to be specified:
The generic short form takes you to the release page unless the package is only available in devel:
Also supported are short forms that specify release or devel or a numbered Bioconductor version:
http://bioconductor.org/packages/devel/BiocGenerics http://bioconductor.org/packages/release/BiocGenerics http://bioconductor.org/packages/3.2/BiocGenerics http://bioconductor.org/packages/3.1/BiocGenerics http://bioconductor.org/packages/3.0/BiocGenerics
The short URLs can be used in publications and other sites and they are useful as a permanent link to the current release version of a package. Information about these short URLs has been added to the email we send to new package developers so they can use them in publications that reference their packages.
To follow the full discussion on short urls see the bioc-devel post. Thanks goes to Dan.
The Bioconductor support site sees quite a bit of traffic. There is an average of 900 posts per month if we count top-level questions, answers and comments. Tracking and filtering pertinent messages can be time-consuming. Dan has implemented a few scripts to make sure messages get to the right pair of eyes
To help get messages to the right pair of eyes, Dan implemented a few scripts to alert maintainers to posts that pertain to their packages.
The first script looks at the “watched tags” field of the user profile. Adding a package to “watched tags” triggers an email to be sent every time someone posts a question tagged with the (lowercase) package name. The script checks to make sure all package maintainers who are registered on the support site have all the packages they maintain in their “watched tags” field.
A second script identifies maintainers who do not have logins on the support site. The email address in the Maintainer field of the package should be the same used to register on the support site.
Finally, BiocCheck has been modified to make sure the maintainer is registered on the support site. For new packages, which are required to pass BiocCheck, this is now a requirement.
The Bioconductor TxDb object is a container for storing transcript annotations. Many methods have been written to query and subset the underlying database making the class a useful format to store information from UCSC, Ensembl and BioMart resources.
Functions for converting data to and from a TxDb have been around for some
time and greatly improve the ability to wrangle data between online sources
Bioconductor. A few new functions were added recently and I think they
are worth knowing about. Thanks to Marc and Herve for their work on these.
More details are on the man page in the ‘GenomicFeatures’ package.
The ‘OrganismDbi’ package provides classes and methods for interfacing with a collection of annotation packages. The ‘OrganismDb’ class is a composite object composed of multiple annotation packages linked through a common key. The prototype packages contain a TxDb, GO.db and org.*.eg.db combination for a specific organism and genome build. The intent is that any collection of existing annotation packages could be bundled to make an organism package as long as they have a common mapping.
Currently there are 3 ‘OrganismDbi’ packages, Homo.sapiens, Mus.musculus and Rattus.norvegicus. These can be found by searching the annotation repository with the biocViews term “OrganismDb”.
For more on creating your own package see the man page in the ‘OrganismDbi’ package.
Accessing data in these objects can be done with the ‘select()’ interface or more recently with one of Marc’s new range accessor functions.
selectByRanges(x, ranges, columns, overlaps, ignore.strand)
Overlaps a set of user-supplied ranges with the annotations in an ‘OrganismDb’. The return value is a ‘GRanges’ with mcols that contain the data specified in the ‘columns’ argument. The ‘overlaps’ argument dictates the gene region for overlap such as ‘tx’, ‘exons’, ‘cds’ etc.
selectRangesByID(x, keys, columns, keytype, feature)
Matches the ID specified by ‘keys’ and ‘keytype’ to the annotations in ‘OrganismDb’. The return value is a ‘GRangesList’ where each list element is defined by the gene region in ‘feature’ such as ‘gene’, ‘tx’, ‘exon’ or ‘cds’. Metadata columns on the individual ‘GRanges’ elements contain the data specified in ‘columns’.
Resources available in ‘AnnotationHub’ have been downloaded from external
sources and saved as pre-parsed
R objects or as raw data files. Each
resource in the hub has associated metadata organized into 10 categories.
library(AnnotationHub) hub <- AnnotationHub() ## > names(mcols(hub)) ##  "title" "dataprovider" "species" "taxonomyid" "genome" ##  "description" "tags" "rdataclass" "sourceurl" "sourcetype"
This is useful information to keep with the object once extracted from the hub.
Sonali is working on propagating these fields as metadata in the
objects returned from the hub. For instance, if resource returns a GRanges, the fields will are
metadata(). This is a work in progress and is scheduled to be complete
for the fall
Ensembl release 80 files (both GTF and FASTA):
query(hub, c('gtf', '80', 'ensembl')) query(hub, c('fasta', '80', 'ensembl'))
Epigenomics RoadMap Project files:
The following compares the number of sessions and new users from the second quarter of 2015 (April 1 - June 25) with the second quarter of 2014. Sessions are broken down by new and returning visitors. New visitors correspond to the total new users.
|Sessions: Total||28.52% increase||(375,901 vs 292,490)|
|Sessions: Returning Visitor||31.25% increase||(246,683 vs 187,956)|
|Sessions: New Visitor||23.61% increase||(129,218 vs 104,534)|
Statistics generated with Google Analytics.
The number of unique IP downloads of software packages for April, May and June of 2015 were 38853, 37784 and 35505 respectively. For the same time period in 2014, numbers were 39818, 38348 and 36179. Numbers must be compared by month (vs sum) because some IPs are the same between months. See the web site for a full summary of download stats.
A total of 63 software packages were added in the second quarter of 2015
bringing counts to 1042 in devel (
Bioconductor 3.2) and 1024 in release
Bioconductor support site now hosts a
tutorials page. Sean Davis
has contributed one on using dplyr and GEOmetadb to mine NCBI GEO data and
another on accessing public genomic data from
tutorial gives an overview of the new Epigenomics RoadMap Project files
available in ‘AnnotationHub’.
See the events page for a listing of all courses and conferences.
useR! Conference 2015 30 June - 03 July 2015 — Aalborg, Denmark
BioC2015: 20 - 22 July in Seattle, WA, USA.
GIW/InCoB 2015 Satellite Workshop 07 September 2015 — Tokyo, Japan
Asia-Pacific Bioconductor Developers’ Meeting 08 September 2015 — Tokyo, Japan
SMODIA2015 Statistical Methods for Omics Data Integration and Analysis 14 - 16 September 2015 — Valencia, Span
Thanks to Cole Trapnell for sharing his thoughts on software development and
the ‘monocle’ package and to the
Bioconductor team in Seattle for project
updates and editorial review.
Send comments or questions to Valerie at email@example.com.