Conference Survey: http://goo.gl/forms/NOi0dz8Wmx
This conference highlights current developments within and beyond Bioconductor. Morning scientific talks and afternoon practicals provide conference participants with insights and tools required for the analysis and comprehension of high-throughput genomic data. ‘Developer Day’ precedes the main conference on July 20, providing developers and would-be developers an opportunity to gain insights into project direction and software development best practices.
Morning – 8:30 - 12:00, Arnold Building M1-A303/5/7 (times approximate)
Afternoon – 1:00 - 5:00, Arnold Building M1-A303/5/7
Morning: Invited Scientific and Community Presentations – Pelton Auditorium
Coffee Break, 10:00 - 10:30
Lunch, 12:00 - 1:00
Afternoon: Concurrent Workshops – Pelton Auditorium and Arnold Building M1-A303, 5, 7.
Session 1, 1:00pm - 2:45
Coffee Break, 2:45 - 3:15
Session 2, 3:15pm - 5:00
Evening, 5:15 - 6:30: Poster session; social hour
Morning: Invited Scientific and Community Presentations – Pelton Auditorium
Coffee Break, 10:00 - 10:30
Lunch, 12:00 - 1:00
Afternoon: Concurrent Workshops – Pelton Auditorium and Arnold Building M1-A303, 5, 7.
Session 3, 1:00pm - 2:45
Coffee Break, 2:45 - 3:15
Session 4, 3:15 - 5:00
Evening (5:15 - 6:30): Poster session; social hour
Presentations are made available after the conference; see course material.
Aaron Lun. Detecting differentially bound regions in ChIP-seq data with csaw.
This workshop will use the csaw package to perform a differential binding analysis of a public ChIP-seq data set. Participants will run through all steps of the csaw analysis pipeline, including counting reads into windows, abundance-based filtering, normalization of library-specific biases and statistical modelling. A basic understanding of the ChIP-seq procedure is expected.
Andrzej Oles, Wolfgang Huber. Basics of image data and spatial patterns analysis in R.
We demonstrate how image data can be processed in R, how quantitative information can be extracted from images, and how statistical methods can be used to explore and understand the data. In particular, we use EBImage to load and display images in R, and to manipulate, transform and apply filters to them. Using microscopy of cellular assays as an example, we show how to perform image segmentation followed by the extraction of quantitative object characteristics. Such information can be then used for downstream analyses, for example, analysis of spatial point patterns with the help of spatstat.
Presenter: Andrzej Oles
Bioconductor Core Team. An introduction to R and Bioconductor.
This lab provides an introduction to R / Bioconductor for high-throughput sequence analysis. It is designed for those who have some but not a lot of familiarity with R and Bioconductor. The first part of the lab focuses on R data types, functions, classes, methods, the package and help systems, and the Bioconductor web site. The second part of the lab takes a quick tour of essential packages, classes, and methods for sequence analysis. We will make brief stops at essential Bioconductor packages like GenomicRanges, Biostrings, GenomicAlignments, GenomicFeatures, and AnnotationHub.
Dan Tenenbaum, Tengfei Yin, Nan Xiao. Working with Docker containers for R, Bioconductor, and common workflow language / Rabix integration.
Part I (Dan Tenenbaum): An introduction to Docker and the Bioconductor Docker containers
What is Docker and how can it help you? We’ll introduce Docker and its idea of containerized applications, and how they can help with reproducibility and aspects of development (identical testing environments, a blank slate, pre-installed dependencies). Bioconductor provides pre-built containers which already contain sets of packages used for various workflows. These containers can be used directly or you can in turn build upon them to make your own containers. We’ll discuss and demonstrate a number of use cases, and end with an exercise where you create your own container.
Part II (Tengfei Yin, Nan Xiao): We will introduce common workflow language and R package cwl, the implementation with Rabix , then a demo about how to write R command line tool with docopt, how to convert your R command line tool to CWL, how to use rabix R package’s R interface to describe your tool, and use Rabix to develop, deploy and run it on AWS cloud with SBG platform or run it locally. We will also demonstrate dockerizing R Markdown documents with Rabix support using the liftr package; automating a workflow from raw data uploading, pipeline running, and report retrieving with the sbgr API package.
Greg Finak. Gating Designed Experiments With openCyto.
This workshop will cover flow cytometry and CyTOF data analysis in Bioconductor using the OpenCyto framework. Users will learn to read in raw data, perform compensation and transformation and gating, and generate figures using the new ggcyto visualization framework based on ggplot. The focus will be on designed experiments where multiple samples are matched (e.g. as treatment and control). Users will learn how to use OpenCyto to leverage the positive and negative control samples to derive data-driven gates.
Hervé Pagès, Michael Lawrence. Practical introduction to Bioconductor foundational data structures for high throughput sequencing analysis.
James MacDonald. FAQ Live! Common questions and expert solutions, delivered in person.
Mystified by design matrices? Confused by contrasts? Bugged by errors? Then this is the workshop for you! In this workshop we will cover (at the very least) how to specify and understand design matrices and contrasts (critical for analyzing data using limma, DESeq/DESeq2, edgeR, etc), and how to interpret coefficients of a linear model. We will also show how to debug your (and others’) code. This is a critical skill, not only for ensuring that your own code runs correctly, but for being able to provide feedback to package authors.
These are the two of the most frequently asked questions on the Bioconductor support site, but you might have questions of your own. If you plan on attending this workshop, and have a particular question you would like to discuss, please email me: jmacdon at uw dot edu, and I will add your question to the list. There will be opportunities to ask questions during the workshop as well, but giving me time to prepare will increase the number of questions we can cover.
Kasper Hansen. Methylation and the analysis of Illumina 450k microarrays. In this tutorial we will present tools for analysis of DNA methylation data. We will focus on using the minfi package for analysis of the Illumina 450k methylation microarray. Time permitting, we will also discuss using the bsseq package for analysis of whole genome bisulfite sequencing.
Leonard Goldstein SGSeq and alternative splicing.
The SGSeq package provides a framework for analyzing annotated and novel splice events from RNA-seq data. SGSeq predicts exons and splice junctions from reads aligned against a reference genome and assembles them into a genome-wide splice graph. Splice events are identified from the graph and quantified using reads spanning event boundaries. This workshop provides an introduction to SGSeq functionality, including splice event detection, quantification, annotation and visualization.
Levi Waldron, Tim Triche, Aedin Culhane. Challenges and opportunities from integrative analysis of multi-omic data sets.
The falling cost of genomic assays facilitates collection of multi-assay genomic data (e.g., gene and transcript expression, structural variation, copy number, methylation, and microRNA data) from a set of biological specimens. This workshop takes participants through acquisition of multi-assay -omics data from major cancer genomics projects, representation of these complex data in Bioconductor, and integrative analysis for quality control, visualization, and statistical inference spanning data types.
Marc Carlson, Sonali Arora. Bioconductor annotation resources.
Annotation is the process of adding valuable contextual information to your experimental findings. And the Bioconductor project has always had a large number of resources to assist with this. There have long been hundreds of annotation packages, many web services like biomaRt, and now there are also thousands of other valuable resources in the AnnotationHub.
The first part of this lab will provide an overview of these different kinds of resources, and include demonstrations and exercises to help students learn both what is available and how these resources are commonly used. Topics covered will include accessing resources from the AnnotationHub, how to make use of the select interface with AnnotationDb objects and how to get range based data from TxDb or OrganismDb objects. This part of the lab is primarily aimed at new users who need to learn what sort of annotations are available and how to make use of them. It may also be useful for old timers who are still using outdated ways of accessing data like Bimap objects.
The second part of this lab will focus on the new AnnotationHub, and on how to write recipes so that new resources can be added to the hub. The AnnotationHub has changed dramatically over the past year and we are interested in members of the community who might feel motivated to create recipes that can be used to expand the available content. So this second part of the lab will be aimed at more senior users who are interested seeing newer resources made available in the hub.
Mike Love. Differential expression, manipulation, and visualization of RNA-seq reads.
We will cover basic steps in RNA-seq analysis using a variety of Bioconductor packages, including: loading gene annotations from a variety of sources (GenomicFeatures), creating a count table which can be used by a variety of statistical packages within Bioconductor (GenomicAlignments, Rsubread), exploratory analysis, visualization and differential expression testing (DESeq2), annotation of results tables (AnnotationDbi), generation of HTML reports (ReportingTools), and tools for examining alignments (GenomicAlignments).
Nicole Deflaux, Siddhartha Bagaria and Craig Citro, GoogleGenomics.
Google has some pretty amazing big data computational “hammers” that they have been applying to search and video data for a long time. In this workshop we take those same hammers and apply them to whole genome sequences.
We do this all from the comfort of the R prompt using common packages including VariantAnnotation, ggbio, ggplot2, dplyr, bigrquery, and the new Bioconductor package GoogleGenomics which provides an R interface to Google’s implementation of the Global Alliance for Genomics and Health API.
Thomas Girke. Automated NGS workflows with systemPipeR running on clusters or single machines, with a focus on VAR-seq.
This tutorial introduces systemPipeR, an R package designed for building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications. The package also provides support for running command-line software, such as NGS aligners and variant callers, on both single machines and compute clusters. The first part of the tutorial introduces the basic design of the package, and the second part gives an overview of a typical VAR-Seq analysis workflow including read QC/preprocessing, variant aware read alignments, variant calling, and annotations of SNPs and indels with genomic context information.
Valerie Obenchain, Martin Morgan, et al. Management and analysis of large genomic data.
This lab will cover strategies for managing large genomic data. Scalable computing techniques such as iteration, data restriction, file management and parallel evaluation will be discussed and demonstrated in analysis examples.
The parallel section will include an orientation to the BiocParallel package. BiocParallel offers a unified API to the parallel, snow, BatchJobs and foreach packages. Each a solid package in its own right, BiocParallel aims to provide easy and consistent access to different parallel back-ends while preserving the individual strengths. Topics covered will include logging, error handling, monitoring progress of long running jobs, and setting timeout limits on workers.
Jenny Drnevich, University of Illinois. An online platform for NGS trainers to share teaching experience and materials.
Introduction. Over the last decade, the exponential increase in the use of next generation sequencing (NGS) applications has led to a high demand for researchers who are capable of analysing such data. Consequently, demand for training in this area has increased; researchers with experience in this analysis are often tasked with training other scientists and have to dedicate a significant portion of their time to prepare lectures and practicals, time which is already limited by their research projects.
Aim. To tackle this issue, we wanted to create a system for NGS trainers to exchange teaching experience and materials, therefore not only greatly reducing the time required to put together lectures and practicals, but also promoting exchange between NGS trainers across the globe in an effort to create a community of NGS trainers, to share best training practices and to improve their teaching in order to equip scientists with the skill set needed to analyse and interpret their data.
Results. We here present a platform for trainers to upload their course materials using the “Git” versioning software as a back-end, according to a common set of descriptors. The descriptions are uploaded as markdown files using a set of pre-defined keywords to categorise the materials. A simple user interface allows trainers to search for the descriptive files using such keywords and retrieve specific modules from the materials repository. Information about the repository is available here and content will be accessible through the Goblet training portal.
Shilin Zhao, Vanderbilt University. An R package to identify context-dependent functional transcription factor pairs.
Background. Transcription factors (TFs) are fundamental regulators of gene expression and generally function in a complex and cooperative manner. Identifying context-dependent cooperative TFs is essential for understanding how cells respond to environmental change. The huge amount of various omics data currently available, providing genome-wide physical binding and functional effect information about transcription, holds a great opportunity to study TF cooperativity.
Results. Here we developed an R package, FunTFPair, which provides an easy and powerful way to identify condition-specific TF pairs by integrating transcription factor binding sites from ENCODE and gene expression profiles from GEO. Users only need to provide the GEO ID that they are interested in. FunTFPair will automatically retrieve the expression profiles of the input GEO ID, get TF targets from ENCODE, and identify TF pairs whose common targets show statistically significant differences under changed experimental conditions or exhibit coordinated transcription in a specific condition. The functional TF pairs and their relative importance will be reported in a TF cooperation network. Two datasets from GEO are used as examples to demonstrate the usage and reliability of the package.
Conclusions. FunTFPair provides a simple and powerful way to explore potential cooperative TFs under specific conditions that users are interested in. The package is under development and currently available in a github repository. We plan to submit to Bioconductor in the near future.
Ning Leng, Morgridge Institute for Research. Oscope: a statistical pipeline for identifying oscillatory genes in unsynchronized single cell RNA-seq experiments
Oscillatory gene expression is fundamental to mammalian development, but technologies to monitor expression oscillations are limited. We have developed a statistical approach called Oscope to identify and characterize the transcriptional dynamics of oscillating genes in single-cell RNA-seq data from an unsynchronized cell population. Applications to a number of data sets demonstrate the utility of the approach and also identify a potential artifact in the Fluidigm C1 platform.
Sebastiano Battaglia, Roswell Park Cancer Institute. Genomic profile of nuclear receptors-mediated transcription in prostate cancer.
Prostate cancer (PCa) is the most commonly diagnosed cancer in the US and second most common cause of cancer death. Initial therapeutic approaches aim to reduce tumor burden through androgen deprivation therapy (ADT) but about half of the patients will recur, developing castration recurrent PCa (CRPCa), often clinically lethal. The androgen receptor (AR) is a key mediator of PCa growth in physiological and malignant conditions and AR inhibition is proven to be effective as first line of therapy for PCa patients, however, it is insufficient for patients with CRPCa. The vitamin D receptor (VDR) mediates the intracellular effects of 1,25(OH)2D3 (vitamin D) and numerous studies evaluated the in-vitro and in-vivo effects of vitamin D, however, vitamin D antitumor effect is lost in advanced cancers. The retinoic acid receptors (RARs) modulate cell differentiation and proliferation and retinoid metabolism is altered in PCa. AR, VDR and RARs belong to the nuclear receptor (NR) superfamily and their activity is strictly regulated by coregulatory complexes. The lysine specific demethylase 1A (LSD1) is a transcriptional regulator whose expression correlates with cancer aggressiveness and LSD1 inhibition is proposed as antineoplastic approach in the clinical settings. Here we demonstrate the effects of a novel combinatorial therapy by targeting LSD1 and multiple NR. Furthermore, by integrating RNASeq and ChIPSeq data we describe the effects of LSD1 as dual coregulator for AR, VDR and RARs and show that LSD1 modulates unique NR-activated transcriptional pathways, including MTORC1 and MYC signaling networks. We conclude that LSD1 is a locus specific coactivator and corepressor and, by modulating key oncogenic pathways, it regulates cancer progression and therapeutic response.
Mark Dane, Oregon Health and Science University. Computational Pipelines for High Content Screening of Microenvironment Microarrays
The OHSU MEP-LINCS project is developing a dataset and computational strategy to elucidate how signals from the microenvironment affect observable cellular phenotypes and intracellular transcriptional and proteomic networks. We are using microenvironment microarrays (MEMAs) and immunofluorescence imaging to capture phenotypes of cells grown on pairwise combinations of extracellular matrix proteins and ligands. We have developed two computational pipelines that create fully annotated datasets from the pilot MEP-LINCs high throughput screening experiments. The first pipeline processes population-level data that can be acquired and analyzed within an hour of completing an experiment. This QA pipeline provides feedback on experiment processes immediately after they occur.
The cell-level pipeline is designed to handle high-content imaging data in an automated fashion. The essential tasks are to pre-process, normalize, and assess quality of the image feature data. Finally, we have developed computational approaches that prioritize microenvironment perturbagens (MEPs) for follow-on validation experiments. The pipeline is based on open source R packages and will contribute a MEMA package for public use. The cell-level dataset and accompanying analysis will be available for public download through the LINCs consortia.
Dr. Yuriy Gusev, Georgetown University. Chromosome instability index CIN package for Bioconductor.
The CIN package calculates the chromosome instability (CIN) index that allows to quantitatively characterize genome-wide copy number alterations as a measure of chromosome instability. The algorithm for this method will be described in a paper (in preparation).
Genomic instability is known to be a fundamental trait in the development of tumors; and most human tumors exhibit this instability in structural and numerical alterations: deletions, amplifications, inversions or even losses and gains of whole chromosomes or chromosomes arms. The chromosome instability indicated by these copy number alternations is associated with various events in the development or the severity of tumors in terms of clinical outcome.
To mathematically and quantitatively describe these alternations we first locate their genomic positions and measure their ranges. Such algorithms are referred to as “segmentation algorithms”.
The CIN module accepts these segmentation results and calculates the genomic instability across a chromosome for a global view (referred to as “Chromosome CIN”, or “Standard or Regular CIN”), and the genomic instability across cytobands regions for higher resolution (referred to as “Cytobands CIN”). This allows to assess the impacts of copy number alternations on various biological events or clinical outcomes by studying the association of CIN indices with those events.
The CIN Bioconductor package allows the automated processing of the experimental copy number data generated by Affymetrix SNP 6.0 arrays or similar high throughput technologies. An older version of this algorithm that shows overall instability has been integrated into G-DOC web portal and made available for users as part of the G-DOC Plus analysis tools here. The CIN Bioconductor package calculates not only overall instability, but also gains and losses at the chromosome and cytoband level.
Allison Thompson, Pacific Northwest National Laboratory. Signatures of Climate Change: Implementation of High-Throughput Sequence Data Analysis Methods.
Abstract: Analyzing high-throughput sequencing data is particularly challenging due to the nature of the data collected. Before analysis can be done across samples, normalization must be performed to account for differences in sequencing depths and consequently make samples comparable. Additionally, traditional statistical tests for differential expression are not appropriate with the count data that is produced by sequencing. While numerous normalization and differential abundance algorithms for sequencing data exist, there is no clear optimal method in either case. This research focuses on comparing and contrasting normalization and differential testing methods and the resulting biological inferences. Here we focus on the methods of DESeq and edgeR for pairwise tests and use big data tools developed by the Tessera project to do exploratory data analysis and examine cases where the two tests differ.
Andrew J. Bass, Princeton University. superSeq: Assessing the limits of sequencing depth through read subsampling.
RNA-Seq is a standard gene expression profiling technology for differential expression analysis. In RNA-Seq studies, the read depth strongly affects the power of the test statistics, such that larger read depths induce higher statistical power. After a certain read depth, the power of the test statistics begins to asymptote, at which point there are only marginal improvements in power. Although existing methods, such as subSeq, can help determine if the read depth of an experiment is saturated, they are limited in that they do not provide a way of estimating the appropriate read depth for under- saturated experiments. We provide a new method called superSeq that models and estimates the increase in statistical power that would result in increasing the read depth for a given experiment. We then apply the superSeq framework to 38 RNA-Seq experiments in the Expression Atlas. In the majority of the studies, the method accurately predicts the relationship between the power of the test statistics and the read depth. Researchers can thus use this method, implemented in the forthcoming R package superSeq , to determine the appropriate read depth for a completed experiment in order to maximize the statistical power.
Jianhong Ou, Scot A. Wolfe, Michael H. Brodsky and Lihua J. Zhu, University of Massachusetts. motifStack: A Tool to Visualize Sequence Logo Alignments.
Sequence motifs represent conserved characteristics within sets of related aligned sequences such as DNA binding sites for transcription factors (TFs) or amino acid patterns within related protein domains. To explore the functional and evolutionary relationships between proteins with related functions, tools are required to describe patterns and relationships within large collections of sequence motifs. We developed a flexible open-source R/Bioconductor package, motifStack, to display and annotate aligned sequence motifs. We use motifStack to visualize and compare three collections of motifs describing the DNA binding specificities of homeodomain transcription factors. We find that differences in the experimental and computational methods used to generate motifs can have a greater influence on motif alignments than differences in protein sequence. A comparison of motifs determined by different methods reveals an expected substantial overlap in motifs from mammalian and fly homeodomains. However, we also use motifStack to identify examples of species-specific homeodomain binding specificities.
Sarah Sheppard, Jianhong Ou, Nathan Lawson and Lihua Julie Zhu, University of Massachusetts. cleanUpdTSeq: Application of a naïve Bayes classifer to accurately assign polyadenylation sites from 3′ end deep sequencing data.
3′ end processing is important for transcription termination, mRNA stability, and regulation of gene expression. To identify 3′ ends, most techniques utilize an oligo-dT primer to construct deep sequencing libraries. However, this approach can lead to identification of artifactual polyadenylation sites due to internal priming in homopolymeric stretches of adenines. By analyzing sequence features flanking 3′ ends derived from oligo-dT based sequencing, we developed a naïve Bayes classifier, implemented as cleanUpdTSeq package, to classify them as true or false/internally primed. The resulting algorithm is highly accurate and facilitates identification of novel polyadenylation sites.
Cunye Qiao, Health Canada. Identify potential critical windows of fetal development during the 8-19 week time frame.
Bioconductor is built on the free and open exchange of scientific ideas, and the contributions of our diverse user community. In this spirit, BioC 2015 is dedicated to providing a harassment-free conference experience for everyone. Harassment of any form (verbal, physical, sexual, or other) will not be tolerated in talks, workshops, poster sessions, social activities, or online.
Space is designated in front of the Thomas, Fairview and Yale buildings and in the Arnold building garage for a limited number of visitors on a first come, first serve basis. Free visitor parking is limited to two hours. Street parking and paid lots are nearby. See the Fred Hutch visitor’s website for more details.
Hotels near the conference site (group rates no longer available.)