# Scalable nucleotide tallies with HDF5

## Abstract

With the increased affordability of genomic sequencing technologies we have at our disposal a flood of whole genome/exome datasets from many patients. When using the .bam file format large scale comparative analyses become cumbersome to implement and a lot of time is spend parsing files and collating data. Given that the tally (the table of counts of nucleotides x sample x strand x genomic position) is at the center of many genomics analyses (e.g. variant calling, copy-number estimation, mutation spectrum analysis, etc.) it is crucial to collate this information in an easily accessible format. Here we propose the use of HDF5 as this format and give concrete examples of the internal layout of a tally-HDF5 file and possible applications. We introduce tools for the creation of a tally file from a set of .bam files and tools for interacting with the tally file and performing genomics analyses.

## Motivation

Currently there is a large interest in analyses of cancer genome data across large cohorts, but the current standard file formats are ill-suited to the task. The BAM-file format is too low-level and does not scale well (large file size; slicing by genomic position is usually inefficient) while the VCF format is too high-level with an emphasis on (positive) calls with low FP rate and no control over negative calls and FN rate (just considering every position not mentioned in the VCF as 'no variant' would have have a high FN rate, esp. in the face of subclonality and coverage fluctuations). There is a need for an intermediate format that is scalable, compact and cross-platform accessible.

## The 'tally' data structure

The tally data structure proposed here consists of 4 datasets that are stored for each chromosome (or contig). Those datasets are:

• Counts: A table that contains the number of observed mismatches at any combination of base, sample, strand and genomic position,
• Coverages: A table that contains the number of reads overlapping at any combination of sample, strand and genomic position
• Deletions: A Table that contains the number of observed deletions of bases at any combination of sample, strand and genomic position
• Reference: A Table that contains the reference base against which the calls in the 'Deletions' and 'Counts' table were made.

We outline the basic layout of this set of tables here:

Name Dimensions Datatype
Counts [ #bases, #samples, #strands, #positions ] int
Coverages [ #samples, #strands, #positions ] int
Deletions [ #samples, #strands, #positions ] int
Reference [ #positions ] int

An HDF5 file has an internal structure that is similar to a file system, where groups are the directories and datasets are the files. The layout of the tally file is as follows:

A tally file can contain data from more than one study but each study will reside in a separte tree with a group named with the study-ID at its root and sub-groups for all the chromosomes / contigs that are present in the study. Attached to each of the chromosome groups are the 4 datasets described above.

Additionally each chromsome group stores sample data about the samples involved in the experiment (patientID, type, location in the sample dimension) as HDF5 attributes. Convenience functions for extracting the metadata are provided, see examples below.

## Interacting with tally files

In the first step we load the h5vc package and have a look at the example tally file provided with the h5vcData package.

library(h5vc)

## Loading required package: grid
## Loading required package: gridExtra
## Loading required package: ggplot2

tallyFile <- system.file("extdata", "example.tally.hfs5", package = "h5vcData")


Now we can have a look into the tally file. We use the h5ls function from the rhdf5 package.

library(rhdf5)
h5ls(tallyFile)

##               group         name       otype  dclass                   dim
## 0                 / ExampleStudy   H5I_GROUP
## 1     /ExampleStudy           16   H5I_GROUP
## 2  /ExampleStudy/16       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 90354753
## 3  /ExampleStudy/16    Coverages H5I_DATASET INTEGER      6 x 2 x 90354753
## 4  /ExampleStudy/16    Deletions H5I_DATASET INTEGER      6 x 2 x 90354753
## 5  /ExampleStudy/16    Reference H5I_DATASET INTEGER              90354753
## 6     /ExampleStudy           22   H5I_GROUP
## 7  /ExampleStudy/22       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 51304566
## 8  /ExampleStudy/22    Coverages H5I_DATASET INTEGER      6 x 2 x 51304566
## 9  /ExampleStudy/22    Deletions H5I_DATASET INTEGER      6 x 2 x 51304566
## 10 /ExampleStudy/22    Reference H5I_DATASET INTEGER              51304566


This returns a table of all the objects present within the tally file: /home/biocbuild/bbs-2.14-bioc/R/library/h5vcData/extdata/example.tally.hfs5. This table encodes the tree-like structure shown in Figure 1.

We can extract the separately stored sample data from the tally file by using the getSampleData function:

sampleData <- getSampleData(tallyFile, "/ExampleStudy/16")
print(sampleData)

##                      SampleFiles           Sample Column    Type  Patient
## 1     ../Input/PT8PrimaryDNA.bam    PT8PrimaryDNA      6    Case Patient8
## 2     ../Input/PT5PrimaryDNA.bam    PT5PrimaryDNA      2    Case Patient5
## 3     ../Input/PT5RelapseDNA.bam    PT5RelapseDNA      3    Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA      5    Case Patient8
## 5     ../Input/PT5ControlDNA.bam    PT5ControlDNA      1 Control Patient5
## 6     ../Input/PT8ControlDNA.bam    PT8ControlDNA      4 Control Patient8


## Extracting data from a tally file

We use the h5dapply function as our general purpose accessor to data stored in HDF5 based tally files and provide the following parameters:

• file: The name of the tally file to interact with
• group: The name of the group we want to access, this will always be a string of the form when dealing with tally files
• names: A character vector listing the datasets to be extracted
• blocksize: The size of the blocks in which we want to apply our function
• range: The range of values within the dimension we apply over that we want to process
• FUN: The function to apply to the blocks, always will be passed the parameter data which will contain the data for the current block. Defaults to returning the data as is

### Using h5dapply blank to get the data directly

When not specifying the function to be applied (argument FUN) we get the data returned that would otherwise be supplied to the function as it's first argument (i.e. FUN = function(x) x).

data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/16", names = c("Coverages"),
blocksize = 5000, range = c(2.9e+07, 29010000))
str(data)

## List of 2
##  $2.9e+07:29004999 :List of 2 ## ..$ Coverages   : int [1:6, 1:2, 1:5000] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$h5dapplyInfo:List of 4 ## .. ..$ Blockstart: num 2.9e+07
##   .. ..$Blockend : num 2.9e+07 ## .. ..$ Datasets  :'data.frame':    1 obs. of  7 variables:
##   .. .. ..$group : chr "/ExampleStudy/16" ## .. .. ..$ name    : chr "Coverages"
##   .. .. ..$otype : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5 ## .. .. ..$ dclass  : chr "INTEGER"
##   .. .. ..$dim : chr "6 x 2 x 90354753" ## .. .. ..$ DimCount: int 3
##   .. .. ..$PosDim : int 3 ## .. ..$ Group     : chr "/ExampleStudy/16"
##  $29005000:29009999:List of 2 ## ..$ Coverages   : int [1:6, 1:2, 1:5000] 24 26 17 12 30 19 25 33 9 9 ...
##   ..$h5dapplyInfo:List of 4 ## .. ..$ Blockstart: num 2.9e+07
##   .. ..$Blockend : num 2.9e+07 ## .. ..$ Datasets  :'data.frame':    1 obs. of  7 variables:
##   .. .. ..$group : chr "/ExampleStudy/16" ## .. .. ..$ name    : chr "Coverages"
##   .. .. ..$otype : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5 ## .. .. ..$ dclass  : chr "INTEGER"
##   .. .. ..$dim : chr "6 x 2 x 90354753" ## .. .. ..$ DimCount: int 3
##   .. .. ..$PosDim : int 3 ## .. ..$ Group     : chr "/ExampleStudy/16"


As you can see the data object is a list with one element per block and each of those is a list with one slot for each dataset (only “Coverages” in our example; note the dimensions of the “Coverage” array are [1:6, 1:2, 1:10000], i.e. 6 samples, 2 strands and 10000 position in the current block). Furthermore the slot “h5dapplyInfo” is being attached to the list data and is itself a list containing information about the current block (“Blockstart”,“Blockend”), the Datasets represented here (slot “Datasets”) and the group we are processing right now (“/ExampleStudy/16”). This information can be used within the call to FUN to, e.g. calculate the correct offset for genomic positions by adding the block start etc.

### Example of using h5dapply to extract coverage data

data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/16", names = c("Coverages"),
blocksize = 1000, range = c(2.9e+07, 29010000), FUN = function(x) rowSums(x$Coverages), verbose = TRUE)  ## Processing Block #1: 2.9e+07:29000999 ## Processing Block #2: 29001000:29001999 ## Processing Block #3: 29002000:29002999 ## Processing Block #4: 29003000:29003999 ## Processing Block #5: 29004000:29004999 ## Processing Block #6: 29005000:29005999 ## Processing Block #7: 29006000:29006999 ## Processing Block #8: 29007000:29007999 ## Processing Block #9: 29008000:29008999 ## Processing Block #10: 29009000:29009999  coverages <- do.call(rbind, data) colnames(coverages) <- sampleData$Sample[order(sampleData$Column)] coverages  ## PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA ## 2.9e+07:29000999 32315 40117 18158 18998 ## 29001000:29001999 38044 48938 17083 19910 ## 29002000:29002999 43580 58418 22193 21437 ## 29003000:29003999 40858 55464 22724 21332 ## 29004000:29004999 42560 50930 23465 21304 ## 29005000:29005999 44662 60327 22169 21701 ## 29006000:29006999 43503 60565 20120 17943 ## 29007000:29007999 41379 49689 23940 19213 ## 29008000:29008999 41048 52724 21879 19402 ## 29009000:29009999 37825 47701 19167 18208 ## PT8EarlyStageDNA PT8PrimaryDNA ## 2.9e+07:29000999 36377 36113 ## 29001000:29001999 38014 36222 ## 29002000:29002999 49731 43917 ## 29003000:29003999 48404 41585 ## 29004000:29004999 40319 36121 ## 29005000:29005999 44560 43149 ## 29006000:29006999 45475 40463 ## 29007000:29007999 47064 44508 ## 29008000:29008999 48728 39106 ## 29009000:29009999 35987 34718  Here we want to apply the function function(x) rowSums(x$Coverages) to the genomic interval 16:29000000-29010000 in blocks of size 1000 and for each of the blocks we want to only extract the dataset “Coverages” since we only access this dataset in the applied function and extracting the others would be unneccessary additional work.

The return value of h5dapply is a list with one element per block containing the output of the function FUN that will be applied to each block. In the example above we compute the sum of coverages per sample within each block, then use do.call(rbind, data) to merge the results into a matrix and set the column names to the respective sample ids that are specified in the sampleData object we loaded from the tally file.

#### Example of more involved coverage analysis

Let's extract the coverage in 500bp bins from a small region on chromosome 22. We use the binnedCoverage function suplied by h5vc, which is almost identical to the function(x) rowSums(x$Coverages) used above plus some checks and reformatting of the output. data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/22", names = c("Coverages"), blocksize = 500, range = c(38750000, 39250000), FUN = binnedCoverage, sampledata = sampleData) data <- do.call(rbind, data) rownames(data) = NULL head(data)  ## PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA PT8EarlyStageDNA ## 1 17276 25603 13203 25869 24315 ## 2 19973 25611 11227 26917 22674 ## 3 22130 25876 11735 23568 22064 ## 4 26193 22309 11056 21408 19888 ## 5 20538 21349 9439 22409 18602 ## 6 23880 26834 14427 25372 23910 ## PT8PrimaryDNA Start End ## 1 7596 38750000 38750499 ## 2 10032 38750500 38750999 ## 3 10086 38751000 38751499 ## 4 13041 38751500 38751999 ## 5 10443 38752000 38752499 ## 6 8864 38752500 38752999  The library size can be estimated from the colSums of the data object. librarySizes <- colSums(data[, 1:6]) sizeFactors <- librarySizes/mean(librarySizes) sizeFactors  ## PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA ## 1.1424 1.3641 0.5777 1.2359 ## PT8EarlyStageDNA PT8PrimaryDNA ## 1.1440 0.5359  for (n in names(sizeFactors)) { data[[n]] <- data[[n]]/sizeFactors[n] }  Next we can plot the pairwise log2 ratios of the size factor adjusted coverage in this region. suppressPackageStartupMessages(library(reshape)) suppressPackageStartupMessages(library(ggplot2)) pairWiseRatio <- function(data, sampledata, trans = log2) { sampledata <- subset(sampledata, Sample %in% colnames(data)) ret <- data.frame(Start = data$Start)
for (patient in sampledata$Patient) { control <- subset(sampledata, Patient == patient & Type == "Control")$Sample
if (length(control) != 1) {
stop(paste("Exactly one control sample per patient is expected. Found",
length(control), "in patient", patient))
}
for (case in subset(sampledata, Patient == patient & Type == "Case")\$Sample) {
ret[[paste(case, control, sep = "/")]] <- trans(data[[case]]/data[[control]])
}
}
return(ret)
}
plot_data <- melt(pairWiseRatio(data, sampleData), id.vars = c("Start"))
colnames(plot_data) = c("Position", "Samples", "Ratio")
ggplot(plot_data, aes(x = Position, y = Ratio, color = Samples)) + geom_line() +
facet_wrap(~Samples, ncol = 1)


## Tools for variant calling and visualization

In this part of the vignette we will perform some basic variant calling on a tally file and explore the visualization tools provided by h5vc. We will use the example tally file from the previous section to work on.