Package: AnnotationFilter
Authors: Martin Morgan [aut], Johannes Rainer [aut], Bioconductor Maintainer [cre]
Last modified: 2017-04-24 16:35:20
Compiled: Mon Apr 24 21:05:12 2017

1 Introduction

A large variety of annotation resources are available in Bioconductor. Accessing the full content of these databases or even of single tables is computationally expensive and in many instances not required, as users may want to extract only sub-sets of the data e.g. genomic coordinates of a single gene. In that respect, filtering annotation resources before data extraction has a major impact on performance and increases the usability of such genome-scale databases.

The AnnotationFilter package was thus developed to provide basic filter classes to enable a common filtering framework for Bioconductor annotation resources. AnnotationFilter defines filter classes for some of the most commonly used features in annotation databases, such as symbol or genename. Each filter class is supposed to work on a single database table column and to facilitate filtering on the provided values. Such filter classes enable the user to build complex queries to retrieve specific annotations without needing to know column or table names or the layout of the underlying databases. While initially being developed to be used in the Organism.dplyr and ensembldb packages, the filter classes and the related filtering concept can be easily added to other annotation packages too.

2 Filter classes

All filter classes extend the basic AnnotationFilter class and take one or more values and a condition to allow filtering on a single database table column. Based on the type of the input value, filter classes are divided into:

  • CharacterFilter: takes a character value of length >= 1 and supports conditions ==, !=, startsWith and endsWith. An example would be a GeneIdFilter that allows to filter on gene IDs.

  • IntegerFilter: takes a single integer as input and supports the conditions ==, !=, >, <, >= and <=. An example would be a GeneStartFilter that filters results on the (chromosomal) start coordinates of genes.

  • GRangesFilter: is a special filter, as it takes a GRanges as value and performs the filtering on a combination of columns (i.e. start and end coordinate as well as sequence name and strand). To be consistent with the findOverlaps method from the IRanges package, the constructor of the GRangesFilter filter takes a type argument to define its condition. Supported values are "any" (the default) that retrieves all entries overlapping the GRanges, "start" and "end" matching all features with the same start and end coordinate respectively, "within" that matches all features that are within the range defined by the GRanges and "equal" that returns features that are equal to the GRanges.

The names of the filter classes are intuitive, the first part corresponding to the database column name with each character following a _ being capitalized, followed by the key word Filter. The name of a filter for a database table column gene_id is thus called GeneIdFilter. The default database column for a filter is stored in its field slot (accessible via the field method).

The supportedFilters method can be used to get an overview of all available filter objects defined in AnnotationFilter.

library(AnnotationFilter)
supportedFilters()
##  [1] "CdsEndFilter"      "CdsStartFilter"    "EntrezFilter"     
##  [4] "ExonEndFilter"     "ExonIdFilter"      "ExonNameFilter"   
##  [7] "ExonRankFilter"    "ExonStartFilter"   "GRangesFilter"    
## [10] "GeneBiotypeFilter" "GeneEndFilter"     "GeneIdFilter"     
## [13] "GeneStartFilter"   "GenenameFilter"    "ProteinIdFilter"  
## [16] "SeqNameFilter"     "SeqStrandFilter"   "SymbolFilter"     
## [19] "TxBiotypeFilter"   "TxEndFilter"       "TxIdFilter"       
## [22] "TxNameFilter"      "TxStartFilter"     "UniprotFilter"

Note that the AnnotationFilter package does provides only the filter classes but not the functionality to apply the filtering. Such functionality is annotation resource and database layout dependent and needs thus to be implemented in the packages providing access to annotation resources.

3 Usage

Filters are created via their dedicated constructor functions, such as the GeneIdFilter function for the GeneIdFilter class. Because of this simple and cheap creation, filter classes are thought to be read-only and thus don’t provide setter methods to change their slot values. In addition to the constructor functions, AnnotationFilter provides the functionality to translate query expressions into filter classes (see further below for an example).

Below we create a SymbolFilter that could be used to filter an annotation resource to retrieve all entries associated with the specified symbol value(s).

library(AnnotationFilter)

smbl <- SymbolFilter("BCL2")
smbl
## class: SymbolFilter 
## condition: == 
## value: BCL2

Such a filter is supposed to be used to retrieve all entries associated to features with a value in a database table column called symbol matching the filter’s value "BCL2".

Using the "startsWith" condition we could define a filter to retrieve all entries for genes with a gene name/symbol starting with the specified value (e.g. "BCL2" and "BCL2L11" for the example below.

smbl <- SymbolFilter("BCL2", condition = "startsWith")
smbl
## class: SymbolFilter 
## condition: startsWith 
## value: BCL2

In addition to the constructor functions, AnnotationFilter provides a functionality to create filter instances in a more natural and intuitive way by translating filter expressions (written as a formula, i.e. starting with a ~).

smbl <- AnnotationFilter(~ symbol == "BCL2")
smbl
## class: SymbolFilter 
## condition: == 
## value: BCL2

Individual AnnotationFilter objects can be combined in an AnnotationFilterList. This class extends list and provides an additional logOp slot that defines how its individual filters are supposed to be combined. The length of logOp has to be 1 less than the number of filter objects. Each element in logOp defines how two consecutive filters should be combined. Below we create a AnnotationFilterList containing two filter objects to be combined with a logical AND.

flt <- AnnotationFilter(~ symbol == "BCL2" &
                            tx_biotype == "protein_coding")
flt
## class: AnnotationFilterList 
## length: 2
## filters:
## 
## class: SymbolFilter 
## condition: == 
## value: BCL2 
## 
##  & 
## 
## class: TxBiotypeFilter 
## condition: == 
## value: protein_coding

Note that the AnnotationFilter function does not (yet) support translation of nested