1 Introduction

The VCF Tool Box (TVTB) offers S4 classes and methods to filter, summarise and visualise genetic variation data stored in VCF files pre-processed by the Ensembl Variant Effect Predictor (VEP) (McLaren et al. 2010). An RStudio/Shiny web-application, the Shiny Variant Explorer (tSVE), provides a convenient interface to demonstrate those functionalities integrated in a programming-free environment.

Currently, major functionalities in the TVTB package include:

A class to store recurrent parameters of genetic analyses

  • List of reference homozygote, heterozygote, and alternate homozygote genotype encodings
  • Key of the INFO field where Ensembl VEP predictions are stored in the VCF file
  • Suffix of the INFO fields where calculated data will be stored
  • List of genomic ranges to analyse and visualise
  • Parameters for parallel calculations (using BiocParallel)

Genotype counts and allele frequencies

  • Calculated from the data of an ExpandedVCF objects (i.e. bi-allelic records)
  • Stored in INFO fields defined by the above suffixes
    • overall counts and frequencies (i.e. across all samples)
    • counts and frequencies within level(s) of phenotype(s)

Classes of VCF filter rules

  • Filter rules applicable to the fixed slot of an VCF object
  • Filter rules applicable to the info slot of an VCF object
  • Filter rules applicable to Ensembl VEP predictions stored in a given INFO field
  • A container for combinations of filter rules listed above
  • Subset VCF objects using the above filter rules

2 Installation

The VCF Tool Box can be installed using the following code:

source("http://bioconductor.org/biocLite.R")
biocLite("TVTB")

Once installed, the package can be loaded and attached as follows:

library(TVTB)

3 Recurrent settings: TVTBparam

Most functionalities in TVTB require recurrent information such as:

  • Genotype encoding (homozygote reference, heterozygote, homozygote alternate),
  • INFO key that contains the Ensembl VEP predictions in the VCF file,
  • List of genomic ranges within which data must be summarised or visualised
    • affecting the genomic ranges to import from the VCF file
  • Suffixes of INFO keys where counts and frequencies of genotypes must be stored:
    • Counts and frequencies may be calculated for individual levels of selected phenotypes, in which case the data will be stored under INFO keys formed as <phenotype>_<level>_<suffix>,
    • Counts and frequencies across all samples are stored in INFO keys simply named <suffix>.
  • Settings for parallel calculations.

To reduce the burden of repetition during programming, and to facilitate analyses using consistent sets of parameters, TVTB implements the TVTBparam class. The TVTBparam class offer a container for parameters recurrently used across the package. A TVTBparam object may be initialised as follows:

tparam <- TVTBparam(Genotypes(
    ref = "0|0",
    het = c("0|1", "1|0", "0|2", "2|0", "1|2", "2|1"),
    alt = c("1|1", "2|2")),
    ranges = GenomicRanges::GRangesList(
        SLC24A5 = GenomicRanges::GRanges(
            seqnames = "15",
            IRanges::IRanges(
                start = 48413170, end = 48434757)
            )
        )
    )

TVTBparam objects have a convenient summary view and accessor methods:

tparam
## class: TVTBparam
## @genos: class: Genotypes
##     @ref (hom. ref.): "REF" {0|0}
##     @het (heter.): "HET" {0|1, 1|0, 0|2, 2|0, 1|2, 2|1}
##     @alt (hom. alt.): "ALT" {1|1, 2|2}
##   @ranges: 1 GRanges on 1 sequence(s)
##   @aaf (alt. allele freq.): "AAF"
##   @maf (minor allele freq.): "MAF"
##   @vep (Ensembl VEP key): "CSQ"
##   @svp: <ScanVcfParam object>
##   @bp: <SerialParam object>

In this example:

  • Genotypes
    • Accessor: genos(x)
    • Class: Genotypes
    • Homozygote reference genotype is encoded "0|0".
    • Counts of reference genotypes are stored in INFO keys suffixed with "REF".
    • Heterozygote genotypes are encoded as "0|1", "1|0", "0|2", "2|0", "1|2", and "2|1".
    • Counts of heterozygote genotypes are stored in INFO keys suffixed with "HET".
    • Homozygote alternate genotype is encoded "1|1".
    • Counts of alternate genotypes are stored in INFO keys suffixed with "ALT".
  • Genomic ranges
    • Accessor: ranges(x)
    • Class: GRangesList
    • A gene-coding region on chromosome "15".
  • Alternate allele frequency
    • Accessor: aaf(x)
    • Calculated values will be stored under INFO keys suffixed with "AAF".
  • Minor allele frequency
    • Accessor: maf(x)
    • Calculated values will be stored under INFO keys suffixed with "MAF".
  • Ensembl VEP predictions
    • Accessor: vep(x)
    • Information will be imported from the INFO field "CSQ".
  • Parallel calculations
    • Accessor: bp(x)
    • Class: BiocParallelParam
    • Serial evaluation (i.e. do not run parallel code)
  • VCF scan parameters
    • Accessor: svp(x)
    • Class: ScanVcfParam
    • which slot automatically populated with reduce(unlist(ranges(x)))

Default values are provided for all slots except genotypes, as these may vary more frequently from one data set to another (e.g. phased, unphased, imputed).

4 Data import

4.1 Genetic variants

Functionalities in TVTB support CollapsedVCF and ExpandedVCF objects (both extending the virtual class VCF) of the VariantAnnotation package.

Typically, CollapsedVCF objects are produced by the VariantAnnotation readVcf method after parsing a VCF file, and ExpandedVCF objects result of the VariantAnnotation expand method applied to a CollapsedVCF object.

Any information that users deem relevant for the analysis may be imported from VCF files and stored in VCF objects passed to TVTB methods. However, to enable the key functionalities of the package, the slots of a VCF object should include at least the following information:

  • fixed(x)
    • fields "REF" and "ALT".
  • info(x)
    • field <vep>: where <vep> stands for the INFO key where the Ensembl VEP predictions are stored in the VCF object.
  • geno(x)
    • GT: genotypes.
  • colData(x): phenotypes.

4.2 List of genomic ranges

In the near future, TVTB functionalities are expected to produce summary statistics and plots faceted by meta-features, each potentially composed of multiple genomic ranges.

For instance, burden tests may be performed on a set of transcripts, considering only variants in their respective sets of exons. The GenomicRanges GRangesList class is an ideal container in example, as each GRanges in the GRangesList would represent a transcript, and each element in the GRanges would represent an exon.

Furthermore, TVTBparam objects may be supplied as the param argument of the VariantAnnotation readVcf. In this case, the TVTBparam object is used to import only variants overlapping the relevant genomic regions. Moreover, the readVcf method also ensured that the vep slot of the TVTBparam object is present in the header of the VCF file.

svp <- as(tparam, "ScanVcfParam")
svp
## class: ScanVcfParam 
## vcfWhich: 1 elements
## vcfFixed: character() [All] 
## vcfInfo:  
## vcfGeno:  
## vcfSamples:

4.3 Phenotypes

Although VCF objects may be constructed without attached phenotype data, phenotype information is critical to interpret and compare genetic variants between groups of samples (e.g. burden of damaging variants in different phenotype levels).

VCF objects accept phenotype information (as S4Vectors DataFrame) in the colData slot. This practice has the key advantage of keeping phenotype and genetic information synchronised through operation such as subsetting and re-ordering, limiting workspace entropy and confusion.

4.4 Example

An ExpandedVCF object that contains the minimal data necessary for the rest of the vignette can be created as follows:

Step 1: Import phenotypes

phenoFile <- system.file(
    "extdata", "integrated_samples.txt", package = "