1 Introduction

This R package is the reference implementation of the BSmooth algorithm for analyzing whole-genome bisulfite sequencing (WGBS) data. This package does not contain alignment software, which is available from . This package is not intended for analysis of single-loci bisulfite sequencing (typically Sanger bisulfite sequencing or pyro bisulfite sequencing).

The package has been used to analyze capture bisulfite sequencing data. For this type of data, the most useful parts of the package is its data-handling functions. The BSmooth algorithm itself may not be useful for a standard capture bisulfite sequencing experiment, since it relies heavily on smoothing, which again requires that we have measured methylation in bigger regions of the genome.

The BSmooth algorithm is described in detail in (Hansen, Langmead, and Irizarry 2012). It was applied to human cancer data in (Hansen, Timp, et al. 2011) and we have also used it to analyze data from Rhesus Macaque (Tung et al. 2012). Specifically, the algorithm uses smoothing to get reliable semi-local methylation estimates from low-coverage bisulfite sequencing. After smoothing it uses biological replicates to estimate biological variation and identify differentially methylated regions (DMRs). The smoothing portion could be used even on a single sample, but we believe that variation between individuals is an important aspect of DNA methylation and needs to be accounted for, see also (Hansen, Wu, et al. 2011) for a relevant discussion.

The main focus for our development has been the analysis of CpG methylation in humans, and we have successfully used it in other primates (Tung et al. 2012). It is highly likely that the approach will work in non-human organisms, but care must be taken: the smoothing parameters (which we have selected based on human data) should be evaluated carefully. Furthermore, it may not be suitable at all for organisms with vastly different (from humans) methylation structures.

With respect to non-CpG methylation, the situation is mixed. We have never used the algorithm to analyze non-CpG methylation, but it should be straightforward to do so. However, the data structures used in the current code might not conveniently scale from the 28.2M CpGs in the human genome to the roughly 2x585M Cs (it may well be possible to do an separate analysis for each chromosome). This should be less of an issue for organisms with smaller genomes. We are considering changing these underlying data structures to allow for easier analysis of non-CpG methylation in humans.

1.1 System Requirements

The package requires that all the data is loaded into system memory. By ``data’’ we do not mean the individual reads (which is big for a whole-genome experiment). Instead, what we need are summarized data given us the number of reads supporting methylation as well as the total number of reads at each loci. Focusing on human data, we need to work with objects with a maximum of 28.2 million entries, per sample (since there are roughly 28.2 millions CpGs in the human genome). This provides us with an upper limit on the data object.

Based on this, the 8 samples from (Hansen, Timp, et al. 2011) including the smoothed values, take up roughly 1.2GB of RAM, meaning an analysis can easily be done with 8GB of RAM. In order to improve speed, the package allows for easy parallel processing of samples/chromosomes. This will require multiple copies of the data object for each core used, increasing the memory usage substantially to perhaps even 32GB. This can be avoided by processing the samples sequentially at the cost of speed.

On a 64GB node, the 8 samples from (Hansen, Timp, et al. 2011) takes roughly one hour to process in parallel using 8 cores (using much less than 64GB of RAM in total). This does not including parsing the data from the alignment output files.

1.2 Some terminology

Because not all methylation happens at CpG sites, we have tried to use the term ``methylation loci’’ instead of CpG. We use this term to refer to a single-base methylation site.

Some standard terms from the DNA methylation field: differentially methylated region (DMR), hyper methylation (methylation that is higher in one condition/sample than in another), hypo methylation (methylation that is lower in one condition/sample than in another), and finally differentially methylated position (DMP) referring to a single loci.

1.3 Citation

If you use this package, please cite our BSmooth paper (Hansen, Langmead, and Irizarry 2012).

1.4 Dependencies

library(bsseq)

2 Overview

The package assumes that the following data has been extracted from alignments:

  • genomic positions, including chromosome and location, for methylation loci.
  • a (matrix) of M (Methylation) values, describing the number of read supporting methylation covering a single loci. Each row in this matrix is a methylation loci and each column is a sample.
  • a (matrix) of Cov (Coverage) values, describing the total number of reads covering a single loci. Each row in this matrix is a methylation loci and each column is a sample.

We can have a look at some data from (Lister et al. 2009), specifically chromosome 22 from the IMR90 cell line.

data(BS.chr22)
BS.chr22
## An object of type 'BSseq' with
##   494728 methylation loci
##   2 samples
## has not been smoothed
## All assays are in-memory

The genomic positions are stored as a GRanges object GRanges are general genomic regions; we represent a single base methylation loci as an interval of width 1 (which may seem a bit strange, but there are good reasons for this). For example, the first 4 loci in the Lister data are

head(granges(BS.chr22), n = 4)
## GRanges object with 4 ranges and 0 metadata columns:
##       seqnames               ranges strand
##          <Rle>            <IRanges>  <Rle>
##   [1]    chr22 [14430632, 14430632]      *
##   [2]    chr22 [14430677, 14430677]      *
##   [3]    chr22 [14430687, 14430687]      *
##   [4]    chr22 [14430702, 14430702]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We also have the M and Cov matrices

head(getCoverage(BS.chr22, type = "M"), n = 4)
## DelayedMatrix object of 4 x 2 doubles:
##      r1 r2
## [1,] 17 20
## [2,]  4 20
## [3,]  6 19
## [4,]  2  4
head(getCoverage(BS.chr22), n = 4)
## DelayedMatrix object of 4 x 2 doubles:
##      r1 r2
## [1,] 18 23
## [2,] 11 28
## [3,] 10 25
## [4,]  8 21

Since CpG methylation is symmetric on the two strands of a chromosome, we aggregated reads on the forward and reverse strand to form a single value, and we assume the genomic position points to the C of the CpG. It is not crucial in any way to do this, one may easily analyze each strand separately, but CpG methylation is symmetric and this halves the number of loci.

How to input your methylation data into this data structure (called a BSseq object) is described in a section below. We also have a section on how to operate on these types of objects.

An analysis typically consists of the following steps.

  1. Smoothing, using the function BSmooth().
  2. Compute t-statistics, using the function BSmooth.tstat(). This converts the BSseq object into a BSseqTstat object.
  3. Threshold these t-statistics to identify DMRs, using the function dmrFinder(), returning a simple data.frame.

It is usually a good idea to look at the smoothed data either before or after identifying DMRs. This can be done using the functions plotRegion() and plotManyRegions().

We also have functions for assessing goodness of fit for binomial and poison models; this is like to be of marginal interest to most users. See the man page for binomialGoodnessOfFit().

We also allow for easy computation of Fisher’s exact test for each loci, using the function fisherTests().

3 Using objects of class BSseq

3.1 Basic operations

Objects of class BSseq contains a GRanges object with the genomic locations. This GRanges object can be obtained by granges(). A number of standard GRanges methods works directly on the BSseq object, such as start(), end(), seqnames() (chromosome names) etc.

These objects also contains a phenoData object for sample pheno data. Useful methods are sampleNames(), pData().

Finally, we have methods such as dim(), ncol() (number of columns; number of samples), nrow() (number of rows; number of methylation loci).

Objects can be subsetted using two indicies BSseq[i,j] with the first index being methylation loci and the second index being samples. Another very useful way of subsetting the object is by using the method subsetByOverlaps(). This selects all methylation loci inside a set of genomic intervals (there is a difference between first and second argument and either can be BSseq or GRanges).

Examples:

head(start(BS.chr22), n = 4)
## [1] 14430632 14430677 14430687 14430702
head(seqnames(BS.chr22), n = 4)
## factor-Rle of length 4 with 1 run
##   Lengths:     4
##   Values : chr22
## Levels(1): chr22
sampleNames(BS.chr22)
## [1] "r1" "r2"
pData(BS.chr22)
## DataFrame with 2 rows and 1 column
##            Rep
##    <character>
## r1  replicate1
## r2  replicate2
dim(BS.chr22)
## [1] 494728      2
BS.chr22[1:6,1]
## An object of type 'BSseq' with
##   6 methylation loci
##   1 samples
## has not been smoothed
## All assays are in-memory
subsetByOverlaps(BS.chr22, 
                 GRanges(seqnames = "chr22", 
                         ranges = IRanges(start = 1, end = 2*10^7)))
## An object of type 'BSseq' with
##   67082 methylation loci
##   2 samples
## has not been smoothed
## All assays are in-memory

3.2 Data handling

We have a number of functions for manipulating one or more BSseq objects.

BSseq() instantiates an object of class BSseq. Genomic locations are passed in, either as a GRanges object (argument gr) or as chromosome and location vectors (arguments chr and pos). The arguments M and Cov accepts matrices, and it is possible to directly give it a phenoData object.

M <- matrix(0:8, 3, 3)
Cov <- matrix(1:9, 3, 3)
BStmp <- BSseq(chr = c("chr1", "chrX", "chr1"), pos = 1:3, 
               M = M, Cov = Cov, sampleNames = c("A1","A2", "B"))

A BSseq object may be ordered by orderBSseq(). This ensures that da