Google Summer of Code Ideas List for 2013

This page is the main information point for Bioconductors participation in the Google summer of Code project this year (2013).

Overview

Each selected student (mentee) will be paid USD$5000 to work on a Bioconductor project for 3 months during the summer.

Students should look at the list of projects and see if any project interests them. Email the project mentors to express your interest, and describe any prior experience.

Students with ideas for Bioconductor projects not listed below are encouraged to email any of the mentors listed below with project ideas.

Students will submit project applications directly to Google.

Google will award a certain number of student slots to the Bioconductor project.

The Bioconductor administrators and mentors will rank projects in order of importance to the project, and the top projects will be funded.

Any selected students will be expected to register with the bioconductor and bioc-devel mailing lists.

There is a timeline posted at Google explaining how this works. Students are encouraged to look at this and make sure that they can commit to this. There is also a FAQ in case people have other questions that are not addressed here.

Here are our suggested ideas:

ExperimentHub project

Background/Motivation: As very large genomic data sets become more and more common, computational biologists are spending inordinate time transforming data from the format of the original resource to a format amenable to computation in their programming language of choice. The R / Bioconductor community needs programmatic access to cloud-based experimental data resources that can be readily incorporated into their own work flows.

Goal

AnnotationHub and its supporting packages are primed to support such a project. AnnotationHub provides infrastructure to make well-curated resources available to R software clients, but it needs the addition of a web interface to allow addition of user-supplied resources, including transformation of data into formats amenable to direct use by R clients.

Specific Aim

Work with us to create a web accessible interface that does the following:

  1. allows the user to add large genomic resources and associated metadata to a NOSQL back-end database.
  2. transform genomic resources into R / Bioconductor data structures to simplify end-user access via existing AnnotationHub infrastructure. For instance, the user might select from a list of 'recipes' (functions which transform raw data into a serialized R object) and/or contribute their own recipes for data transformation. In addition, the project might
  3. allow users to upload literate programming documents illustrating the use of their data, and the documents' code would be evaluated by the web application, reporting any errors, and 4) implement social networking concepts like tags and rating systems to help guide users to the most useful resources.

Skills required

Familiarity with R and with AJAX Web 2.0 programming.

Test

Using the R packages Rook and rmongodb, write a web form that asks the user for their name and age and stores this information in a MongoDB collection. A second page asks the user for an age and then displays all records for people that age or older. This should be sent to us as an R package so that all we have to do (once we've installed MongoDb) is install the package and run:

library(testPkg)
run()

..and we'll see the desired functionality.

Mentors

Dan Tenenbaum dtenenba@fhcrc.org Marc Carlson mcarlson@fhcrc.org

Backup Mentor

Martin Morgan mtmorgan@fhcrc.org

Shiny Bioconductor Objects Project

Background/Motivation

The Shiny package allows for easy creation of interactive web graphics from R objects. Bioconductor packages have many objects that represent biological data or results. For each of these Bioconductor objects, there exists a typical set of visualizations to help users explore their data. Normally, these visuals are replotted several times until certain parameters are tweaked to show the image in a way that conveys a specific insight. This project pairs these standard Bioconductor objects with more user-friendly Shiny visualizations via new display() methods.

Tasks

You will also explore the following topics:

Create display() methods for each of the four object types mentioned above.
These methods should use Shiny to display a multi-tabbed version of each of the objects in question where each tab offers an alternate view of the object in question. Each method should allow the caller to view a table of annotations in one tab and to also visualize the data using either a heat map or a Gviz plot ( as appropriate) in another tab.

Make sure that your R code to draws these 4 very similar methods is written in such a way so that you don't repeat yourself all the time.

Integrate these tests into the appropriate bioconductor packages along with proper documentation. Which package is appropriate will be determined later as the project develops.

Alternate Tasks

Skills required

Familiarity with R. Understanding of basic computational biology or an willingness/ability to learn about such things as needed.

Test

Find the example from the manual page for Biobase::ExpressionSet, and run it. And then plot the relevant data contained in the generated ExpressionSet object as a heatmap and save the output. Then send me the plot. BONUS: make sure that all the labels in your plot are fully visible.

Mentors

Marc Carlson mcarlson@fhcrc.org, Michael Lawrence lawrence.michael@gene.com

Backup Mentor

Martin Morgan mtmorgan@fhcrc.org

BiocParallel / BatchJobs integration

Background / Motivation

High-throughput sequencing generates data sets consisting of hundreds of millions of sequence reads per sample. As with any large data, timely processing depends on parallel computing. The Bioconductor project has developed the BiocParallel package, an abstraction around several parallel implementations in R. The API is tailored to typical use cases in biological data analysis and integrates with existing Bioconductor data structures. Another package, BatchJobs, executes R functions as scheduled cluster jobs, through an abstraction that has been implemented for several popular schedulers, including LSF, PBS and SGE.

As sequencing data pipelines are typically executed on managed clusters, there is a need for BiocParallel to interact with cluster schedulers. We aim to add a new backend to BiocParallel that delegates to BatchJobs for this interaction.

Tasks

Skills required

Tests

Mentors

Source Code & Build Reports »

Source code is stored in svn (user: readonly, pass: readonly).

Software packages are built and checked nightly. Build reports:

 

Development Version»

Bioconductor packages under development:


Developer Resources:

Fred Hutchinson Cancer Research Center