Bioconductor Newsletter
posted by Valerie Obenchain, July 2014
Contents
- Software Infrastructure
- Education and Outreach
- Build System
- Quarterly Project Statistics
- Resources, Courses and Conferences
Software Infrastructure
BSgenome
packages 2bit conversion
Current use cases for BSgenome packages emphasize quering many smaller regions across the entire genome, e.g., assembling many transcript sequences from the underlying coding sequence genomic ranges. To efficiently enable this use case, many BSgenome data packages are now using the UCSC 2bit format to store the sequences on disk.
One limitation of the 2bit format is that it does not support genomes that contain letters other than As, Cs, Gs, Ts, or Ns. These genomes (e.g. hg17, hg18, GRCh38, Ecoli, TAIR.04232008, and TAIR.TAIR9) use the previous .rda storage format, with fast whole-chromosome access but somewhat slower but still very usable random access.
The 2bit format is currently available in devel, but will be part of
the next release of Bioconductor. Hervé and Martin worked on
this project with thanks to Michael for supporting the 2bit format in
rtracklayer
.
S4 Vectors
package
In April Hervé started to split out the low level functions
from IRanges
and move them to the new S4Vectors
package.
IRanges
had grown to 90 classes, 157 generics and 844 methods
and was becoming difficult to maintain. The plan is to move
code that does not involve ranges, e.g., the Vector
and
List
virtual classes, and DataFrame
, Rle
and Hits
concrete classes. This is a work in progress and estimated to be
about 30% done.
pileups
Nate has been working on a pileup()
function which computes
pileup statistics in BAM files. Design goals were versatile
record filtering and flexible presentation of results for
downstream analyses. Filtering is achieved through the
ScanBamParam
and PileupParam
objects; output is a
data.frame
with variable columns based on the filtering applied.
pileup()
is available in the Rsamtools
package in devel. Other
Bioconductor
packages that offer pileup-like functions include gmapR
(bam_tally), deepSNV
(bam2R) and Rsubread
(featureCounts). All
have slightly different input requirements and output formats, see
the man pages for details.
Git-SVN Bridge
The Bioconductor
project uses a Subversion (SVN) source control
system. SVN is effective for version control but does not offer
social coding features such as GitHub’s issue tracking, pull
requests or ease in granting permissions.
In response to popular request, Dan created the Git-SVN Bridge
to allow Github repositories to sync with the Bioconductor
SVN repository. Commits made in SVN are propagated to GitHub and
vice versa. The service has been well received, with 73 bridges
created as of June 2014.
To create a bridge see the Git-SVN Bridge HOWTO.
Bioconductor Amazon Machine Image (AMI)
The Bioconductor AMI has been overhauled and is now compatible with StarCluster. These enhancements make it straightforward to spin up a cluster with nodes that communicate via MPI, SSH or Sun Grid Engine. Details available at the AMI page.
The process of creating the AMI has been automated using Vagrant and Chef. Our scripts are publicly available, and can be used to provision an AMI, or a virtual machine (using Virtualbox or VMware) or even a physical machine.
Education and Outreach
Web site re-design
The Bioconductor
home page now has what we hope is a more
intuitive and user friendly interface. The ‘Install’,
‘Learn’, ‘Use’ and ‘Develop’ fields organize resources for
the novice through the advanced developer.
Have a look at the new design.
biocViews
Sonali continued her work on biocViews this quarter. A new function,
recommendBiocViews()
is available in the biocViews
package. The
function looks at words in the DESCRIPTION, man pages and vignette,
and suggests possible terms for use in the biocViews:
field of
package DESCRIPTION files. The function also identifies invalid
biocViews terms (e.g., mis-spellings) present in the package
DESCRIPTION file.
recommendBiocViews()
has been incorporated into the Single Package
Builder that checks new package submissions; new package authors are
encouraged to run it before submitting a package. Remember that biocViews
are case-sensitive and branch-sensitive (i.e., terms for a Software
package must come from the Software branch of biocViews).
Sonali distributed recommended views for all devel software packages to the mailing list in June. For the complete list see this post.
Instructional videos
We are looking into short, single-topic videos as an interactive complement to traditional vignettes and workflows.
The plan is to create a series of 5 minute videos that encapsulate a HOWTO skill or overview a project aspect. You can tour the website with Dan or do a ‘quick start’ with Martin’s overview of key packages and classes. Watch Marc slice and dice an AnnotationDb object or read BAM and VCF files with Sonali and Valerie.
Sonali and Martin have led this effort and plan to unveil the first videos at BioC 2014 in Boston.
Build System
Branching the experimental data Subversion repository
Historically, only the Subversion repository for the Bioconductor
software
packages had a distinct branch for each release. Subversion repositories for
experimental data and annotation packages had a trunk with no branches.
Starting with the Spring 2014 release a branch was created in the Subversion repository for the experimental data. The motivation was to allow software and experimental data packages to evolve together in release and devel build environments. It was often the case that updates to a software package broke the companion experimental data. Changes made to the experimental data were committed to trunk and propagated to both release and devel builds creating incompatibilities in one place or the other.
A consequence of creating the new branch is the need to bump ‘y’ of the ‘x.y.z’ version numbering scheme at release time (as we do for software). The Annotations have not changed; they are not under Subversion, do not go through automated builds and do not have a version policy.
Changes in how the experimental data Subversion repository is manged are
relevant for developers only. The public repositories remain the same with
a separate repository for each Bioconductor
version. There is no visible
change for users accessing the public repositories.
New Mac OS X Mavericks build machines
An R
3.1.0 binary for Mac OS X 10.9 (Mavericks) is now available from
R Core. This R
as been built with Xcode 5 to leverage new compilers and
functionalities in Mavericks not available in earlier OS X versions.
To provide compatible Bioconductor
package binaries we needed new
build machines. Dan has configured two new Mavericks,
one in release
(morelia) and one in
devel
(oaxaca).
The introduction of Xcode and the clang compiler resulted in new errors for packages with C and C++ code. Nate and Dan spent many hours troubleshooting with package authors and came up with a list of common problems and solutions. Lessons learned were distilled into the C++/Mavericks Best Practices document.
Quarterly Project Statistics
There were 86953 downloads of Bioconductor
software packages
over the past quarter (April - June). During this time 41 new software
packages were accepted. A full summary of package download stats is
available here.
The web site saw approximately 119,000 visitors (26% increase from the previous year) from 180 countries, with the US, China, United Kingdom, Germany, and Canada at the head of the pack.
Resources, Courses and Conferences
Data Analysis for Genome Biology (CSAMA)
This one week intensive course is offered each year in Brixen-Bressanone, Italy and focuses on statistical and computational analysis of large-scale biological experiments. The course is intended for researchers with basic familiarity with the experimental technologies and who are interested in developing their own advanced data analyses.
Topics this year included RNASeq differential expression, variant calling
and ChIP-Seq as well as the essentials of statistical testing, machine
learning, visualization and of course using R. Michael Lawrence presented
a Scalable Genomics
lab which covered topics of limiting resource
consumption, using iteration when appropriate, and scaling genome graphics.
Much of the material is based on
a manuscript
currently in press at the journal Statistical Science.
Materials from the June 2014 course are available on the web.
Community Resources
New Community Resources links include the book and lab from MOOC: PH525x Data Analysis for Genomics. This online course was offered in April 2014 by Rafael Irizarry and Michael Love. Course goals were to enable students to analyze and interpret data generated by modern genomics technology, specifically microarray and next generation sequencing. Applications included gene expression, association of genomic variants to disease, and measuring epigenetic marks.
Also on the Community page are links to YouTube videos made by community members, tips on getting started with R/Bioconductor by Thomas Girke, analysis of 23andme data by Vince Buffalo and Sean Davis’ R/Bioconductor blog.
BioC 2014
The annual meeting is in Boston this year (July 30 - August 1). See the web site for a list of speakers and workshops.
Please send comments or questions to Valerie at vobencha@fhcrc.org.