posted by Valerie Obenchain, January 2015
Welcome to the January 2015 newsletter. This is a quarterly look behind the
scenes at recent developments and exploratory work going on in the core group
Bioconductor community. This issue covers Docker containers, work on
coordinate mapping, changes to the algorithm underlying overlap operations and
an overview of
csaw, the most recent package contribution from the Smyth
group. We want this newsletter to be valuable to you so please share comments
Anyone who has attended the annual BioC conference or participated in a
class at the Hutchinson Center is familiar with the
Machine Images (AMI). Dan configures these images with the current
R and all necessary package dependencies and non-R software.
Providing a pre-configured environment to participants has eliminated
many “day of” problems such as internet overload due to concurrent downloads,
installation of the wrong package version and the inability to access common
sample data used in the course.
Recently Dan has been looking into the Docker software as another approach to providing pre-configured environments. The Docker containers operate using process level isolation and are therefore more lightweight than a virtual machine. Additionally the containers have access to the full physical machine.
These containers are isolated, reproducible environments useful for development or production. In contrast to the AMI, a Docker container could be used on a desktop or laptop instead of the cloud or other remote hardware. We envision them being useful for
reproducibility: providing identical, reproducible environments
convenience: deploying on any (Linux, Mac, or Windows) machine, or in the cloud via Amazon’s Elastic Container Service (ECS)
time saver; instead of waiting for packages to compile on Linux, you can download a container in which packages are already installed
This work is exploratory and not yet supported. There will be an announcement on the mailing list when containers are ready for use. Those interested in following the development can visit the bioc_docker GitHub repository.
One of Hervé’s recent projects was replacing the interval tree-based algorithm used in finding and counting overlaps. The change was motivated by decreased performance observed when comparing many different chromosomes or many overlapping ranges. He decided on an approach based on the Nested Containment List algorithm by Alexander V. Alekseyenko and Christopher J. Lee.
In BioC 3.1 (current devel) overlaps operations on GRanges and/or GRangesList objects are approximately 3x to 10x faster than in BioC 3.0 for a medium size data set (e.g. 25 million reads). Memory usage is also reduced by ~ 25% or more. Numbers will vary depending on the size of the data; the larger the data the greater the improvement.
The user-visible change is the ‘algorithm’ argument added to most overlap-based operations. This allows a choice between the new (algorithm=”nclist”) and the old (algorithm=”intervaltree”).
At the time of this writing, there are 3 situations where “nclist” and “intervaltree” produce different output:
With ‘nclist’, zero-width ranges are interpreted as insertion points and are considered to overlap with ranges that contain them.
When using ‘select=”arbitrary”’, the ‘nclist’ algorithm will generally not select the same hits as the old algorithm.
When using ‘intervaltree’, the ‘maxgap’ argument has a special meaning if ‘type’ is “start”, “end”, or “within”. This is not yet the case with the new ‘nclist’.
For a complete description of changes and future activity please see this post on the bioc-devel mailing list.
Translating (mapping) coordinates between genome, transcript and protein space is a common task in bioinformatics. Over the past quarter a group of us have been working to expand and harmonize mapping capabilities in the infrastructure.
Functions mapping between genome and the transcriptome will be added to
GenomicFeatures and methods for mapping via a CIGAR alignment will be added to
Pbase package has a mapping
that demonstrates the steps involved when mapping from protein to genomic
position. Once the API for the transcriptome and alignment mapping methods is
stable, similar functions can be added to
Pbase to round out the suite of
Others involved in the project are Michael Lawrence, Hervé Pagès, Laurent Gatto, Robert Castelo and Martin Morgan. Discussions and progress can be followed at the biocCoordinateMapping Google Group.
A related mapping task is migrating data from one assembly to another
either via alignment tool or by converting assembly coordinates. The
implementation of the UCSC LiftOver tool in
(thanks Michael) and the availability of the UCSC chain files in
AnnotationHub (thanks Sonali) make this a straightforward operation.
library(AnnotationHub) hub <- AnnotationHub()
The chain file format describes a pairwise alignment that allows gaps in both
AnnotationHub currently hosts 1113 chain files.
allChains <- query(hub, 'chain') ## > length(allChains) ##  1113
Search for the conversion from hg38 to hg19:
query(hub, 'hg38ToHg19') ## > query(hub, 'hg38ToHg19') ## class: AnnotationHub ## hub: https://annotationhub.bioconductor.org ## cache: /home/vobencha/.AnnotationHub ## display()ing 1 of 1 records on 6 mcols() ## title dataprovider species ## AH14108 hg38ToHg19.over.chain.gz hgdownload.cse.ucsc.edu Homo sapiens ## taxonomyid genome description ## AH14108 9606 hg38 UCSC liftOver chain file from hg38 to hg19 ## tags ## AH14108 liftOver, chain, UCSC, genome, homology
rtracklayer to read
R. The return object is a
Chain class where data for each
chromosome are parsed into a
chain <- query(hub, 'hg38ToHg19')[] chain ## > chain ## Chain of length 25 ## names(25): chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX ## chrY chrM class(chain$chr6) ## > class(chain$chr6) ##  "ChainBlock" ## attr(,"package") ##  "rtracklayer"
liftOver() translates coordinates and outputs a
gr <- GRanges(c("chr7", "chr2"), IRanges(c(75625897, 68010781), width=1)) res <- liftOver(gr, chain) ## > res ## GRangesList object of length 2: ## $1 ## GRanges object with 1 range and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ##  chr7 [75255215, 75255215] * ## ## $2 ## GRanges object with 1 range and 0 metadata columns: ## seqnames ranges strand ##  chr2 [68237913, 68237913] *
Another example of using
liftOver() is in the
Changing genomic coordinate systems workflow. SNPS from the NHGRI GWAS catalogue are mapped from hg38
to hg19 resulting in a few lost loci.
Gordon Smyth is a professor at Walter and Eliza Hall Institute of Medical
Research in Victoria and has been active member in the
since inception. Many know Gordon from his detailed responses on the
support site where he provides
statistical guidance and thoughtful discussion to the novice and advanced
user alike. The Smyth group has authored a number of
Bioconductor packages including cornerstone contributions such as
limma is consistently in the top 10 and
edgeR in the
top 25 of the
package download stats.
A trademark of a Smyth group package is a well-written vignette with detailed
scientific and statistical background. These documents are an excellent starting
point for anyone new to microarray or RNA-seq analysis. The most recent
contribution from the group is the
csaw package (ChIP-seq analysis with
windows) by Aaron Lun. I asked Aaron and Gordon if they would answer a few
questions about the package.
Aaron Lun and Gordon Smyth on the csaw package:
Q: What is differential binding (DB) in ChIP-seq and how does this differ from differential expression (DE) in RNA-seq?
As most readers will know, ChIP-seq sequences genomic DNA that is bound to a target protein whereas RNA-seq sequences RNA transcripts. From a scientific point of view, RNA-seq measures gene expression whereas ChIP-seq is used to examine the regulation of gene expression. ChIP-seq is often used for example to find the binding sites of a transcription factor (TF) or to examine the positioning of an epigenetic histone mark across the genome. Many people analyze ChIP-seq data by calling peaks in each individual ChIP-seq library. These peaks are then taken to identify where the TF binds or where epigenetic marking is active. In the csaw package, however, we envisage ChIP-seq experiments with multiple experimental conditions and with multiple biological replicates within each condition, in other words with a structure much like a typical gene expression experiment. We focus on testing for DB in a quantitative way between the experimental conditions rather than on making present/absent calls. ChIP-seq and RNA-seq experiments both produce short sequence reads that can be aligned to a reference genome. In principle, the two types of experiments can be analyzed in broadly similar ways. In each case, we can choose a set of genomic locations of interest, count the number of reads mapping to those regions, and then test for differential coverage between the experimental conditions relative to biological variability. The key difference between ChIP-seq and RNA-seq is how the genomic locations are chosen and combined.
Q: What scientific question does a DB ChIP-seq analysis help answer?
A conventional ChIP-seq analysis produces a list of peaks in each library, with the implication that the protein is bound to DNA within the peak regions and not bound elsewhere. We take an alternative view that binding coverage is quantitative, not just present or absent. DB analyses identify a list of genomic regions where the strength of binding changes between biological conditions. We believe that the regions where the binding is subject to change are the most likely to be biologically important, even if they are not necessarily the regions with highest coverage. DB regions are necessarily linked with the phenotype or treatment of interest being studied in the experiment. We can therefore start investigating mechanisms through which DB can lead to observed differences in biology. One of our favorite analyses is to relate DB to differential expression for the same set of genes. These insights are harder to obtain with the conventional analyses, as the identified regions are only associated with each condition in isolation.
Q: What was the motivation for creating the package? Has your group recently started doing DB ChIP-seq or have you been doing it for some time?
Our group has been doing DB analyses for a while. However, these analyses have mostly been gene-orientated. We would count reads into pre-defined intervals like the promoters or gene bodies and then test for whether these intervals were DB using the edgeR package. This type of analysis can be done very similarly for ChIP-seq as for RNA-seq. The csaw package is our first attempt at performing de novo DB analyses where the regions of interest are not known in advance. We wanted to be able to identify novel DNA elements like new enhancers or promoters and we wanted to avoid potential headaches regarding the regions of interest, for example misspecifying the promoter region of a gene. We were motivated to take a new approach to de novo DB analysis by our observation that some commonly used approaches do not control the error rates correctly. A simple method is to call peaks in each condition separately, then simply compare the regions to identify unique peaks in each condition. This ad hoc approach does not provide any statistical error rate control and tends to overstate the differences between the conditions. A more sophisticated version of this approach is to conduct statistical tests between conditions for each of the regions called as peaks in either condition. Again, this over-estimates the differences between the conditions. Our approach gives similar flexibility to peak calling but with rigorous error rate control. The idea is to slide windows across the genome, count reads into those windows, and then use those counts to test for significant differences between conditions. Adjacent regions are then merged and the p-values combined in a statistically rigorous way. This provides the same level of statistical rigor as for our previous gene-orientated analysis but without the need to specify regions of interest beforehand. Using small windows also provides excellent spatial resolution. There are a number of other Bioc packages that can do DB analyses, diffBind and DBChIP for example, but these require a set of peaks identified with external software like MACS. The motivation for the csaw package is to avoid having to specific the genomic regions externally . We wanted to generate and discover the DB completely de novo as an integral part of the analysis. csaw is the first Bioc package to take a windowing approach.
Q: How are the statistical methods from edgeR leveraged or extended in csaw? Please describe from a statistical perspective, why methods used in RNA-seq are appropriate for DB ChIP-seq?
Once a set of genomic regions has been chosen, a DB analysis is very similar to an RNA-seq experiment. So it was logical to leverage the statistical methods provided by edgeR. edgeR accounts for biological variability between replicates, which is critical for DB analyses where the counts for each region are typically over-dispersed. Failure to account for such variability will lead to spurious DB calls. The generalized linear model functionality of edgeR provides a rich framework with which to analyze complex experiments with multiple experimental factors or covariate and batch effects. We chose edgeR over limma and voom because the ChIP-seq counts for each window can be quite small. The reads can be sparsely distributed across the whole genome. limma and voom are very effective for moderate to large counts, but edgeR is able to more accurately model the distribution of very small integer counts. csaw uses the quasi-likelihood functionality of edgeR because of its rigorous error rate control and because it gives access to the adaptive empirical Bayes functionality of limma within the edgeR negative binomial framework. We made many improvements to edgeR as part of the csaw project. Our sliding window approach to ChIP-seq generates a much larger number of regions than is typical for an RNA-seq DE analysis, so it was important to be as computationally efficient as possible. One of us (AL) converted many of the edgeR functions into C++ for speed and memory efficiency. There are a number of statistical extensions specific to csaw. These include an implementation of Simes’ method for combining the p-values of adjacent windows within a region; a non-linear normalization procedure adapted to low counts; and a method to calculate the average abundance of a region scaled to its width.
A number of core and community members have been working on a “perspective”
article that provides a project overview for potential users and developers.
Focus is on reproducibility, interoperability and data access. The collective
work, “Orchestrating high-throughput genomic analysis with
scheduled to appear in Nature Methods in early 2015.
Other recent project-level (vs package-level) publications are “Scalable
Bioconductor” by Michael Lawrence and Martin Morgan
which reviews strategies for processing, summarizing and visualizing large
genomic data. “Programming tools: Adventures with R” by Sylvia Tippman
Bioconductor as the analysis tool of choice in genomics,
oceanography and ecology applications. Links are available on
the publications page.
It can be challenging to get an accurate count of articles that cite
Bioconductor project or the individual software packages. The
Publications page of the web site lists results of a full text
search for the single term
bioconductor for each of
The number of hits range from 591 (PubMed) to 27,000 (Google Scholar).
Taking a package-level approach, a PubMed query for titles or IDs found in CITATION files returned 22838 references. For packages with no CITATION file a PubMed search of the package title returned 1281 references.
The following tables compare traffic from the fourth quarter of 2014 with the fourth quarter of 2013 (October 1 - December 28).
In the fourth quarter we saw an increase in total sessions, new sessions and the number of total users.
|Sessions||23.28% (311,731 vs 252,873)|
|% New Sessions||0.62% (36.01% vs 35.79%)|
|Users||24.32% (133,839 vs 107,655)|
The greatest increase (percent change) in total sessions was seen in China followed by Spain then the United States and Italy.
|United States||24.95% (101011 vs 80840)|
|China||28.43% (27593 vs 21485)|
|United Kingdom||11.19% (22627 vs 20350)|
|Germany||20.75% (21138 vs 17506)|
|France||20.47% (10446 vs 8671)|
|Canada||21.96% (9808 vs 8042)|
|Japan||19.21% (9406 vs 7890)|
|Spain||27.30% (8444 vs 6633)|
|India||4.15% (8048 vs 7727)|
|Italy||24.68% (7494 vs 6010)|
Statistics were generated with Google Analytics.
There was a 9% increase in the number of (distinct IP) downloads of software packages in the fourth quarter of 2014 (106212 total) as compared to the fourth quarter of 2013 (97462 total). See the website for a full summary of package download stats.
A total of 63 software packages were added in the fourth quarter of 2014
bringing the count to 954 in devel (
Bioconductor 3.1) and 934 in release
On average, 20-30 software packages are submitted to
quarter resulting in 100+ new packages each year. New package guidelines are
posted on the
developers often have additional questions or special case situations. We
thought a more interactive exchange would help avoid common errors and
encourage questions before a package was submitted for review.
In mid December Marc and Sonail hosted a Google Hangout to share some key
development tips and solicit questions from a wider audience. Topics covered
were package organization, documentation and code reuse. Questions were posted
via a YouTube chat window and answered after Marc finished the slide
presentation. If you are considering submitting a package you may want to
check out the Webinar posted on the
The events page is updated regularly
with new courses and conference announcements. In January 2015, EMBL is hosting
both the European Developer’s conference and an Advanced
R Programming course.
European Bioconductor Developer’s Meeting European Molecular Biology Laboratory Heidelberg, Germany January 12-13, 2015
Advanced R Programming and Development European Molecular Biology Laboratory Heidelberg, Germany January 15-16, 2015
Send comments or questions to Valerie at firstname.lastname@example.org.