Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data. It is based primarily on the R programming language.
The Bioconductor release version is updated twice each year, and is appropriate for most users. There is also a development version, to which new features and packages are added prior to incorporation in the release. A large number of meta-data packages provide pathway, organism, microarray and other annotations.
The Bioconductor project started in 2001 and is overseen by a core team, based primarily at the Fred Hutchinson Cancer Research Center, and by other members coming from US and international institutions. It gained widespread exposure in a 2004 Genome Biology paper.
Most Bioconductor components are distributed as R packages. The functional scope of Bioconductor packages includes the analysis of DNA microarray, sequence, flow, SNP, and other data.
The broad goals of the Bioconductor project are:
The R Project for Statistical Computing. Using R provides a broad range of advantages to the Bioconductor project, including:
Documentation and reproducible research. Each Bioconductor package contains one or more vignettes, documents that provide a textual, task-oriented description of the package's functionality. Vignettes come in several forms. Many are "HowTo"s that demonstrate how a particular task can be accomplished with that package's software. Others provide a more thorough overview of the package or discuss general issues related to the package.
Statistical and graphical methods. The Bioconductor project provides access to powerful statistical and graphical methods for the analysis of genomic data. Analysis packages address workflows for analysis of oligonucleotide arrays, sequence analysis, flow cytometry. and other high-throughput genomic data. The R package system itself provides implementations for a broad range of state-of-the-art statistical and graphical techniques, including linear and non-linear modeling, cluster analysis, prediction, resampling, survival analysis, and time-series analysis.
Annotation. The Bioconductor project provides software for associating microarray and other genomic data in real time with biological metadata from web databases such as GenBank, Entrez genes and PubMed (annotate package). Functions are also provided for incorporating the results of statistical analysis in HTML reports with links to annotation web resources. Software tools are available for assembling and processing genomic annotation data, from databases such as GenBank, the Gene Ontology Consortium, Entrez genes, UniGene, the UCSC Human Genome Project (AnnotationDbi package). Annotation data packages are distributed to provide mappings between different probe identifiers (e.g. Affy IDs, Entrez genes, PubMed). Customized annotation libraries can also be assembled.
Bioconductor short courses. The Bioconductor project has developed a program of short courses on software and statistical methods for the analysis of genomic data. Courses have been given for audiences with backgrounds in either biology or statistics. All course materials (lectures and computer labs) are available on this site.
Open source. The Bioconductor project has a commitment to full open source discipline, with distribution via a public subversion (version control) server. All contributions exist under an open source license such as Artistic 2.0, GPL2, or BSD. There are many different reasons why open source software is beneficial to the analysis of microarray data and to computational biology in general. The reasons include:
Open development. Users are encouraged to become developers, either by contributing Bioconductor compliant packages or documentation. Additionally Bioconductor provides a mechanism for linking together different groups with common goals to foster collaboration on software, often at the level of shared development.