posted by Valerie Obenchain, October 2014
The GRCh38 assembly includes both a primary assembly (non-redundant haploid assembly) and alternate sequences (alt loci). Alt loci are provided for regions of the genome where variation prevents representation by a single sequence. These regions are not new but have become more prominent as tools for variant detection have matured.
The previous GRCh37 assembly included patch releases tagged as ‘fix’ or ‘novel’. The ‘fix’ patches were incorporated in the primary assembly of GRCh38 while the ‘novel’ patches were moved into the alt loci units. The ‘multi-sequence’ nature of GRCh38 raises questions about how to best work with these alternate sequences with respect to alignment and downstream analysis.
The samtools library and associated sub-tools play an integral role in the analysis of HTS data. The htslib is the successor of libbam which is currently provided by samtools. Specifically, htslib is a C library for handling high-throughput sequencing data, providing APIs for manipulating SAM, BAM, and CRAM sequence files (similar to but more flexible than the old Samtools API) and for manipulating VCF or BCF variant files.
An implementation of htslib is in the works for
Bioconductor and will likely
be implemented as stand-alone package. Follow the development on Martin’s
In September Hervé completed the move of non-range based code from
S4Vectors. The virtual
List classes moved as
Hits. Developers using or building on these
classes should now import from
In September the
Bioconductor mailing list was replaced with a fork of
Biostars and renamed the
Bioconductor Support Site. This affects
the bioconductor list only; bioc-devel remains unchanged (email@example.com).
The move was motivated by the volume of list traffic which highlighted the need for advanced searching, tagging and real-time editing of posts. Ideally the new interface will encourage participation from first-time users and simplify topic management.
Marc has imported the last 11+ years of posts to create continuity in the new environment. A FAQ is available to help with site navigation and common tasks such as posting, merging, or tracking topics.
Thanks to Marc and Dan for their work on this.
BiocStyle package provides a fast and easy approach to styling markdown
Bioconductor fashion. It includes all standard formatting
styles for creating PDF and HTML documents of vignettes, workflows
or other project documents.
version of the package vignette is a demo of the styling and color theme.
The package offers formatting advantages over standard markdown such as automatic centering of figures, improved table display and Latex-compatible math symbols. Custom style sheets can be included by wrapping them in ` BiocStyle:::markdown`:
BiocStyle::markdown(css.files = c(‘my.css’))
Bioconductor infrastructure contains a wealth of tools for HTS analysis.
Because these methods and containers exist across a number of packages they can
be difficult for new users to discover and for developers to remember when
adding new functionality.
import generic in rtracklayer is one such tool.
import reads and parses
large file formats such as BED, BAM, BigWig, GFF, Fasta, and Chain files. The
methods operate on *File objects (e.g.,
BamFile) and param (e.g.,
ScanBamParam) objects, which allow flexible control over the parsing and
subsetting of data. Data returned from
import are parsed into useful
downstream containers such as
import should be the tool of choice when interacting with large files, such as
those available in
AnnotationHub, or when developing new reading/parsing
During Developer Day at BioC 2014, Levi Waldron’s discussion of his
biocMultiAssay project generated a good deal of interest in the community. This
effort, led by Levi, Vince, Kasper and Martin, aims to create
tools for the efficient manipulation and analysis of multi-assay omics
The primary motivation is to combine data across multiple experiments for a common group of samples or patients. Goals are to develop classes and methods for the extraction of data subsets defined by indices such as genomic position or gene ID, and to streamline analyses that span multiple genomic data types. The data are high-dimensional assays such as gene and protein expression, copy number, methylation, somatic mutation, or microRNA.
This section of the newsletter highlights the work of an individual or group in
Bioconductor community. This month we spoke to Janet Young from the Fred
Hutchinson Cancer Research Center. Janet is originally from the UK with an
undergraduate degree in Natural Sciences from the University of Cambridge, and a
PhD in Genetics from University College London. She is currently a Staff
Scientist in the Malik lab in the Basic Sciences Division.
Q: To begin would you tell us a bit about yourself?
I joined Fred Hutch in 2000 and worked in the Trask lab first as a post-doc then as a staff scientist. My own research focused on the evolution and transcriptional regulation of mammalian olfactory receptor gene families, but I also helped others with projects to measure genomic copy-number gain and loss in prostate cancer and measurement of methylation levels in healthy human tissues. When Barb (Trask) retired I spent time in the Tapscott lab where we studied how transposable elements might be involved in a form of muscular dystrophy. Currently I provide bioinformatics support to a variety of projects in the Malik lab. The group studies evolutionary biology and genetic conflict, primarily in drosophila, primates and yeast.
Q: How did you get started with
I started working with
Bioconductor when helping others in the Trask and
Tapscott labs with various microarray projects. Initially I used
Bioconductor simply for creating diagnostic plots of microarray data, but soon
started using limma and lumi for the analysis steps.
Q: How does
Bioconductor fit into your current workflows?
I’m largely using it for analysis of deep sequencing data these days. We use a
variety of upstream software such as TopHat, BWA, and GATK. I use
for things like differential expression analysis, comparing coverage to look for
genomic copy number changes, filtering SNPs, or retrieving and analyzing gene/
annotations. Often I use rtracklayer to export the data for viewing in IGV or
the UCSC genome browser. As well as being a great analysis tool itself,
Bioconductor acts as the glue to help me integrate results from other tools.
Q: Are there any
Bioconductor resources you find particularly useful?
The local classes offered at the Hutch were very helpful. I also like the responsive Q and A on the mailing list. All software has bugs; knowing that the bugs get fixed in a timely manner makes you keep using it. The package vignettes are a valuable ‘stand alone’ resource that help get you going with a specific package or task right away.
Thanks for talking with us and sharing your insights.
Bioconductor project continues to expand globally. Over the next quarter
there are course offerings in
Japan, Germany the UK and US. In August 2014, the Latin American Bioconductor
LAB foundation held its official inauguration
in Ribeirao Preto, Brazil. LAB is a non-profit scientific initiative created
to represent and expand
Bioconductor to the research community in Latin America
and is headed up by Benilton Carvalho and Houtan Noushmehr.
Google analytics reports the following new visitors to the website for the period of July 1 to September 28, 2014:
|Returning Visitor||179,242 (63.81%)|
|New Visitor||101,668 (36.19%)|
Overall website traffic by country:
The number of distinct IP downloads of
Bioconductor software packages for
July, August and September were 36900, 36749, and 36618 respectively for an
average of 36756. A full summary of package download stats is available
Bioconductormaterials by topic
Materials from past courses and conferences have long been available on the
Bioconductor web site categorized by conference name and date. At BioC 2014
this year we had several requests for a more refined search of these
materials by topic area or key word.
In response, Sonali and Dan have categorized all 2014 materials and implemented a new key word(s) search table interface. The plan is to index all future materials while years prior to 2014 will be available in the old format (see ‘Courses by year’ below the search table).
If you are looking for resources to enhance your knowledge of working with
genomic ranges and sequences in
Bioconductor the following publications may be
Software for Computing and Annotating GenomicRanges
This manuscript describes data structures available in the
infrastructure for representing and annotating ranges on the genome. Focus is
GeomicFeatures packages which provide
support for transcript structures, read alignments and coverage vectors.
Scalable Genomics with R and Bioconductor
Strategies for analyzing large genomic data are described and implemented in
Bioconductor. Topics include scalable processing, summarization and
The release of
Bioconductor 3.0 is scheduled for October 14. This version
will continue to use the current version of
R (3.1.1). Visit the website for
help updating packages and for a
look at the
Practical Course on Analysis of High-Throughput Sequencing Data
EBI, Hinxton, UK
October 20-25, 2014
Learning R / Bioconductor for Sequence Analysis FHCRC, Seattle WA, USA October 27-29, 2014
BioC Europe 2015 EMBL, Heidelberg, Germany January 12-15, 2015
Please send comments or questions to Valerie at firstname.lastname@example.org.