What is BioConductor?
Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data.
The project was started in the Fall of 2001. The Bioconductor core team is based primarily at
the Fred Hutchinson Cancer
Research Center. Other members come from various US and international
institutions.
Bioconductor is primarily based on the R programming language but we do accept contributions in
any programming language. There are two releases of Bioconductor every year
(they appear shortly after the corresponding R release). At any one time
there is a release version,
which corresponds to the released version of R, and a development version, which
corresponds to the development version of R. Most users will find the
release version appropriate for their needs. In addition there are a large
number of meta-data packages
available. They are mainly, but not solely oriented towards different types
of microarrays.
You can read the annual reports for further project details.
Bioconductor Packages.
Although initial efforts focused primarily on DNA microarray data analysis, many of the software tools are general and can be used broadly for the analysis of genomic data, such as SAGE, sequence, or SNP data.
Goals of the Bioconductor Project.
The broad goals of the projects are to
- provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data;
- facilitate the integration of biological metadata in the analysis of experimental data: e.g. literature data from PubMed, annotation data from LocusLink;
- allow the rapid development of extensible, scalable, and interoperable software;
- promote high-quality documentation and reproducible research;
- provide training in computational and statistical methods for the analysis of genomic data.
Main Features of the Bioconductor Project
- Use of R. R and the R package system are the main vehicles for designing and releasing software. R (www.r-project.org) is a widely used open source language and environment for statistical computing and graphics - GNU's S-Plus. It provides a high-level programming environment together with a sophisticated packaging and testing paradigm. It has a number of mechanisms that allow it to interact directly with software that has been written in many different languages (see Omega Project). These tools allow users to incorporate modules based on other work. Viewed in that context, adopting R as a vehicle does not exclude other development environments and paradigms. R can, in those cases, provide a glue or connectivity linking what might otherwise be different products. Finally, R is under very active development by a dedicated team of researchers with a strong commitment to good documentation and software design.
-
Documentation and reproducible research. One of the goals of the
project is to provide high-quality documentation and encourage
reproducible research.
Each package contains at least one vignette, which is a document that provides a textual, task-oriented description of the package's functionality and that can be used interactively. Packages vignettes come in several forms. Many are simple "HowTo"s, that is, they are designed to demonstrate how a particular task can be accomplished with that package's software. Others provide a more thorough overview of the package, or might even discuss general issues related to the package. In the future, we are looking towards providing vignettes that are not specifically tied to a package, but rather are demonstrating more complex concepts. As with all aspects of the Bioconductor project, users are encouraged to participate in this effort.
The vignettes are generated using the Sweave function from the R package tools. They are documents that intermix text, code, and output (textual and graphical) and can be regenerated automatically whenever the data or analyses change. Additional supporting software for vignettes will aid users in obtaining data and sample code, step through specific analyses, and apply these analyses to their own data (reposTools package).
- Statistical and graphical methods. The Bioconductor project aims to provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data. Analysis packages are available for: pre-processing Affymetrix and cDNA array data; identifying differentially expressed genes; graph theoretical analyses; plotting genomic data. In addition, the R package system itself provides implementations for a broad range of state-of-the-art statistical and graphical techniques, including linear and non-linear modeling, cluster analysis, prediction, resampling, survival analysis, and time-series analysis.
- Annotation. The Bioconductor project provides software for
associating microarray and other genomic data in real time to biological
metadata from web databases such as GenBank, LocusLink and PubMed (annotate package).
Functions are also provided for incorporating the results of statistical
analysis in HTML reports with links to annotation WWW resources.
Software tools are available for assembling and processing genomic annotation data, from databases such as GenBank, the Gene Ontology Consortium, LocusLink, UniGene, the UCSC Human Genome Project (AnnBuilder package).
Data packages are distributed to provide mappings between different probe identifiers (e.g. Affy IDs, LocusLink, PubMed). Customized annotation libraries can also be assembled. - Bioconductor short courses. The Bioconductor projects has developed a program of short courses on software and statistical methods for the analysis of genomic data. Courses have been given for audiences with backgrounds in either biology or statistics. All course materials (lectures and computer labs) are available on the WWW. Customized short courses may also be designed for interested parties.
-
Open source. Bioconductor has a commitment to full open source
discipline, with distribution via a SourceForge-like platform. All
contributions are expected to exist under an open source license such as
GPL2 or BSD. There are many different reasons why open--source software is
beneficial to the analysis of microarray data and to computational biology
in general. The reasons include:
- full access to algorithms and their implementation
- the ability to fix bugs and extend and improve the supplied software
- to encourage good scientific computing and statistical practice by providing appropriate tools and instruction
- to provide a workbench of tools that allow researchers to explore and expand the methods used to analyze biological data
- to ensure that the international scientific community is the owner of the software tools needed to carry out research
- to lead and encourage commercial support and development of those tools that are successful
- to promote reproducible research by providing open and accessible tools with which to carry out that research [reproducible research is distinct from independent verification]
- Open development. Users are encouraged to become developers, either by contributing Bioconductor compliant packages or documentation.
- New Users. If you are new to Bioconductor you might consider buying Bioinformatics and Computational Biology Solutions Using R and Bioconductor.