1 Motivation

1.1 Background

VCF objects of the VariantAnnotation package contain a plethora of information imported from specific fields of source VCF files and stored in dedicated slots (e.g. fixed, info, geno), as well as optional Ensembl VEP predictions (McLaren et al. 2010) stored under a given key of their INFO slot.

This information may be used to identify and filter variants of interest for further analysis. However, the size of genetic data sets and the variety of filter rules—and their combinatorial explosion—create considerable challenges in terms of workspace memory and entropy (i.e. size and number of objects in the workspace, respectively).

The FilterRules class implemented in the S4Vectors package provides a powerful tool to create flexible and lightweight filter rules defined in the form of expression and function objects that can be evaluated within given environments. The TVTB package extends this FilterRules class into novel classes of VCF filter rules, applicable to information stored in the distinct slots of VCF objects (i.e. CollapsedVCF and ExpandedVCF classes), as described below:

Motivation for each of the new classes extending FilterRules, to define VCF filter rules
Class Motivation
VcfFixedRules Filter rules applied to the fixed slot of a VCF object.
VcfInfoRules Filter rules applied to the info slot of a VCF object.
VcfVepRules Filter rules applied to the Ensembl VEP predictions stored in a given INFO key of a VCF object.
VcfFilterRules Combination of VcfFixedRules, VcfInfoRules, and VcfVepRules applicable to a VCF object.

Table: Motivation for each of the new classes extending FilterRules to define VCF filter rules.

Note that FilterRules objects themselves are applicable to VCF objects, with two important difference from the above specialised classes:

  • Expressions must explicitely refer to the different VCF slots
  • As a consequence, a single expression can refer to fields from different VCF slots, for instance:
fr <- S4Vectors::FilterRules(list(
    mixed = function(x){
        VariantAnnotation::fixed(x)[,"FILTER"] == "PASS" &
            VariantAnnotation::info(x)[,"MAF"] >= 0.05
    }
))
fr
## FilterRules of length 1
## names(1): mixed

1.2 Features

As they inherit from the FilterRules class, these new classes benefit from accessors and methods defined for their parent class, including:

  • VCF filter rules can be toggled individually between an active and an inactive states
  • VCF filter rules can be subsetted, edited, replaced, and deleted

To account for the more complex structure of VCF objects, some of the new VCF filter rules classes implemented in the TVTB package require additional information stored in new dedicated slots, associated with the appropriate accessors and setters. For instance:

  • VcfVepRules require the INFO key where predictions of the Ensembl Variant Effect Predictor are stored in a VCF object. The vep accessor method may be used to access this slot.
  • VcfFilterRules—which may combine any number of filter rules stored in FixedRules, VcfFixedRules, VcfInfoRules, VcfVepRules, and other VcfFilterRules objects— mark each filter rule with their type in the combined object. The information is stored in the type slot, which may be accessed using the read-only accessor method type.

2 Demonstration data

For the purpose of demonstrating the utility and usage of VCF filter rules, a set of variants and associated phenotype information was obtained from the 1000 Genomes Project Phase 3 release. It can be imported as a CollapsedVCF object using the following code:

library(TVTB)
extdata <- system.file("extdata", package = "TVTB")
vcfFile <- file.path(extdata, "chr15.phase3_integrated.vcf.gz")
tabixVcf <- Rsamtools::TabixFile(file = vcfFile)
vcf <- VariantAnnotation::readVcf(file = tabixVcf)

VCF filter rules may be applied to ExpandedVCF objects equally:

evcf <- VariantAnnotation::expand(x = vcf, row.names = TRUE)

2.1 CollapsedVCF and ExpandedVCF

As described in the documentation of the VariantAnnotation package, the key difference between CollapsedVCF and ExpandedVCF objects —both extending the VCF class—is the expansion of multi-allelic records into bi-allelic records, respectively. In other words (quoting the VariantAnnotation documentation):

CollapsedVCF objects contains the ALT data as a DNAStringSetList allowing for multiple alleles per variant. In contrast, the ExpandedVCF stores the ALT data as a DNAStringSet where the ALT column has been expanded to create a flat form of the data with one row per variant-allele combination.”

This difference has implications for filter rules using the "ALT" field of the info slot, as demonstrated in a later section.

3 Fields available for the definition of filter rules

First, let us examine which fields (i.e. column names) are available in the VCF objects to create VCF filter rules:

fixedVcf <- colnames(fixed(vcf))
fixedVcf
## [1] "REF"    "ALT"    "QUAL"   "FILTER"
infoVcf <- colnames(info(vcf))
infoVcf
##  [1] "CIEND"         "CIPOS"         "CS"            "END"          
##  [5] "IMPRECISE"     "MC"            "MEINFO"        "MEND"         
##  [9] "MLEN"          "MSTART"        "SVLEN"         "SVTYPE"       
## [13] "TSD"           "AC"            "AF"            "NS"           
## [17] "AN"            "EAS_AF"        "EUR_AF"        "AFR_AF"       
## [21] "AMR_AF"        "SAS_AF"        "DP"            "AA"           
## [25] "VT"            "EX_TARGET"     "MULTI_ALLELIC" "CSQ"
csq <- ensemblVEP::parseCSQToGRanges(x = evcf)
vepVcf <- colnames(mcols(csq))
vepVcf
##  [1] "Allele"             "Consequence"        "IMPACT"            
##  [4] "SYMBOL"             "Gene"               "Feature_type"      
##  [7] "Feature"            "BIOTYPE"            "EXON"              
## [10] "INTRON"             "HGVSc"              "HGVSp"             
## [13] "cDNA_position"      "CDS_position"       "Protein_position"  
## [16] "Amino_acids"        "Codons"             "Existing_variation"
## [19] "DISTANCE"           "STRAND"             "FLAGS"             
## [22] "VARIANT_CLASS"      "SYMBOL_SOURCE"      "HGNC_ID"           
## [25] "CANONICAL"          "TSL"                "APPRIS"            
## [28] "CCDS"               "ENSP"               "SWISSPROT"         
## [31] "TREMBL"             "UNIPARC"            "GENE_PHENO"        
## [34] "SIFT"               "PolyPhen"           "DOMAINS"           
## [37] "HGVS_OFFSET"        "GMAF"               "AFR_MAF"           
## [40] "AMR_MAF"            "EAS_MAF"            "EUR_MAF"           
## [43] "SAS_MAF"            "AA_MAF"             "EA_MAF"            
## [46] "ExAC_MAF"           "ExAC_Adj_MAF"       "ExAC_AFR_MAF"      
## [49] "ExAC_AMR_MAF"       "ExAC_EAS_MAF"       "ExAC_FIN_MAF"      
## [52] "ExAC_NFE_MAF"       "ExAC_OTH_MAF"       "ExAC_SAS_MAF"      
## [55] "CLIN_SIG"           "SOMATIC"            "PHENO"             
## [58] "PUBMED"             "MOTIF_NAME"         "MOTIF_POS"         
## [61] "HIGH_INF_POS"       "MOTIF_SCORE_CHANGE" "CADD_PHRED"        
## [64] "CADD_RAW"

4 Usage of VCF filter rules

4.1 Filter rules using a single field

The value of a particular field can be used to define expressions that represent simple filter rules based on that value alone. Multiple rules may be stored in any one FilterRules objects. Ideally, VCF filter rules should be named to facilitate their use, but also as a reminder of the purpose of each particular rule. For instance, in the chunk of code below, two filter rules are defined using fields of the fixed slot:

  • A rule named "pass" identifies variants for which the value in the FILTER field is "PASS"
  • A rule named "qual20" identifies variants where the value in the QUAL field is greater than or equal to 20