Contents

1 TL;DR (Too long; didn’t read)

e <- enrichR(treatment = "ChIP.bam",
             control   = "Control.bam",
             genome    = "hg19")
de <- diffR(treatment = "ChIP1.bam",
            control   = "ChIP2.bam",
            genome    = "hg19")
re <- regimeR(treatment = "ChIP.bam",
              control   = "Control.bam",
              genome    = "hg19",
              models    = k)
#export enriched regions with FDR<=10% for downstream analysis
exportR(obj      = e,
        filename = "enriched.bed",
        type     = "bed",
        fdr      = 0.1)
#or
#write normalized differential enrichment to bigWig for genome browser display
exportR(obj      = de,
        filename = "diffEnrichment.bw",
        type     = "bigWig")
citation("normr")
## 
## To cite package 'normr' in publications use:
## 
##   Johannes Helmuth and Ho-Ryun Chung (2017). normr: Normalization
##   and difference calling in ChIP-seq data. R package version 1.6.0.
##   https://github.com/your-highness/normR
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {normr: Normalization and difference calling in ChIP-seq data},
##     author = {Johannes Helmuth and Ho-Ryun Chung},
##     year = {2017},
##     note = {R package version 1.6.0},
##     url = {https://github.com/your-highness/normR},
##   }

2 Introduction to normR

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) provides genome-wide localization data for DNA-associated proteins. To infer the regions bound by such proteins the read densities obtained by such a ChIP-seq experiment are compared to the corresponding read profile obtained by a control experiment. A meaningful comparison requires normalization to mitigate the effects of technical biases, e.g. different sequencing depth. But more importantly the effect of the enrichment of certain regions on the overall read statistics. Normalization requires knowledge of the regions that remained unchanged, such that normalization and difference calling are inseparable.

This package, normR (normR obeys regime mixture Rules), follows this logic and performs normalization and difference calling simultaneously to identify genomic regions enriched by the ChIP-procedure (enrichR()). In addition, normR enables the comparison between ChIP-seq data obtained from different conditions allowing for unraveling genomic regions that change their association with the ChIP-target (diffR()). Lastly, normR is capable to differentiate multiple regimes of enrichment, i.e. broad domains and sharp peaks (regimeR()). In brief, all these routines encompass three steps:

  1. Count reads in fixed-size windows along the genome;
  2. Fit a binomial mixture model by Expectation Maximization;
  3. Assign each window a significance based on the fitted background component.

This vignette explains a common workflow of normR analysis on NGS data for calling enrichment, identification of differential ChIP-seq enrichment and stratification of enrichment regimes. Herein, we provide examples for the export of results to common data formats like bigWig and bed. We show how analysis statistics and diagnostic plots are helpful for studying results. In a latter section, we cover more advanced topics including the alteration of read counting strategies, the application of normR on user-defined regions (e.g. promoters) and the integration of Copy Number Variation information in differential ChIP-seq enrichment calls.

3 Toy Examples

3.1 enrichR(): Calling Enrichment with an Input Control

In this first section, we would like to call regions significantly enriched for reads in the ChIP-seq experiment given a Control alignment. Here, we analyze ChIP-seq data for both H3K4me3 (pointy enrichment) and H3K36me3 (broad enrichment) given an Input-seq control alignment originating from a human immortalized myelogenous leukemia line (K562). Using normR, we show that our representative region on human chromsome 1 (chr1:22500000-25000000) is enriched for H3K4me3 mostly at promoters and precludes H3K36me3 enrichment which is overrepresented in gene bodies.

IGV browser shot of Input (grey), H3K4me3 (green) and H3K36me3 (purple) alignment data on chr1 22.5Mb to 25Mb with genes (black) drawn.

As part of the normR package, we provide 3 alignment files in bam format (Input, H3K4me3 and H3K36me3 ChIP-seq) containing reads for human chr1 22.5Mb to 25Mb. Note, bam files used in normR need to be sorted by read coordinates (samtools sort x.bam x_sorted) and indexed (samtools index x_sorted.bam). Our example files already fullfil this criteria.

Firstly, we retrieve filepaths for toy data:

#Loading required package
library("normr")

inputBamfile <- system.file("extdata", "K562_Input.bam", package="normr")
k4me3Bamfile <- system.file("extdata", "K562_H3K4me3.bam", package="normr")
k36me3Bamfile <- system.file("extdata", "K562_H3K36me3.bam", package="normr")

Secondly and lastly, we need to specify the genome of the alignment. The genome argument can be a character specifying a UCSC genome identifier, e.g. “hg19”, or we can provide a 2-dimensional data.frame with columns seqnames and length. We will follow the later option: You can generate a genome data.frame yourself or can use GenomeInfoDb for retrieving the data.frame from UCSC for given genome identifier:

#Fetch chromosome information
genome <- GenomeInfoDb::fetchExtendedChromInfoFromUCSC("hg19")

#Filter out irregular chromosomes and delete unnecessary columns
idx <- which(!genome$circular & genome$SequenceRole=="assembled-molecule")
genome <- genome[idx,1:2]
genome
##    UCSC_seqlevel UCSC_seqlength
## 1           chr1      249250621
## 2           chr2      243199373
## 3           chr3      198022430
## 4           chr4      191154276
## 5           chr5      180915260
## 6           chr6      171115067
## 7           chr7      159138663
## 8           chr8      146364022
## 9           chr9      141213431
## 10         chr10      135534747
## 11         chr11      135006516
## 12         chr12      133851895
## 13         chr13      115169878
## 14         chr14      107349540
## 15         chr15      102531392
## 16         chr16       90354753
## 17         chr17       81195210
## 18         chr18       78077248
## 19         chr19       59128983
## 20         chr20       63025520
## 21         chr21       48129895
## 22         chr22       51304566
## 23          chrX      155270560
## 24          chrY       59373566
#Toy data has only "chr1"
genome <- genome[genome[,1] == "chr1",]
genome
##   UCSC_seqlevel UCSC_seqlength
## 1          chr1      249250621

Now, we are all set to do a enrichment call with default parameters:

#Enrichment Calling for H3K4me3 and H3K36me3
k4me3Fit <- enrichR(treatment = k4me3Bamfile, control = inputBamfile,
                    genome = genome, verbose = FALSE)
k36me3Fit <- enrichR(treatment = k36me3Bamfile, control = inputBamfile,
                     genome = genome, verbose = FALSE)

That was easy and fast! You must know that all results are stored as NormRFit-class objects. They provide convenient access to count data and fitting results. For example, let’s have a look at some simple fitting statistics for H3K4me3:

k4me3Fit
## NormRFit-class object
## 
##  Type:                    enrichR 
##  Number of Regions:       997003 
##  Theta* (naive bg):       0.3928 
##  Background component B:  1 
## 
## +++ Results of fit +++ 
## Mixture Proportions:
## Background     Class 1  
##    94.997%      5.003%   
## Theta:
## Background     Class 1  
##    0.09148     0.92761

The “Type” of the NormRFit object is defined by the function generating it, i.e. enrichR(), diffR() or regimeR(). The “Number of Regions” is the number of 250bp bins along the specified genome (default binsize). The “Number of Components” is 2 (background and enriched) in the case of enrichR(). The parameter \(\theta^*\) (“Theta* (naive bg)”) describes a naive background parametrization if the effect of enrichment is not taken into account. The actual \(\theta_B\) is with ~0.09 much smaller than \(\theta^*\) which allows for more sensitive enrichment calling. Furthermore, by looking at the “Mixture Proportions” we find H3K4me3 is enriched in ~5% of the windows. For H3K36me3, we find ~23% of the regions enriched.

k36me3Fit
## NormRFit-class object
## 
##  Type:                    enrichR 
##  Number of Regions:       997003 
##  Theta* (naive bg):       0.5143 
##  Background component B:  1 
## 
## +++ Results of fit +++ 
## Mixture Proportions:
## Background     Class 1  
##     76.75%      23.25%   
## Theta:
## Background     Class 1  
##     0.1131      0.8383

3.1.1 Analyzing Results

We can use some methods provided by the NormRFit-class to get a grasp on the quality of our normR call, e.g. print a more concise summary that gives the number of significant regions under different False Discovery Rates (\(FDR\)).

summary(k4me3Fit)
## NormRFit-class object
## 
## Type:                  'enrichR'
## Number of Regions:     997003
## Number of Components:  2
## Theta* (naive bg):     0.393
## Background component B: 1
## 
## +++ Results of fit +++ 
## Mixture Proportions:
## Background       Class 1    
##        95%            5%    
## Theta:
## Background       Class 1    
##     0.0915        0.9276    
## 
## Bayesian Information Criterion:  73401
## 
## +++ Results of binomial test +++ 
## T-Filter threshold: 4
## Number of Regions filtered out: 988560
## Significantly different from background B based on q-values:
## TOTAL:
##            ***        **         *         .                n.s.
## Bins        24       433        67        63        71      7785
## %        0.239     4.554     5.222     5.850     6.557    77.578
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '  ' 1 'n.s.'

Note, summary() prints an additional section containing information on the statistical testing. The “T-Filter threshold” filters out regions that are not considered for multiple testing correction due to low power. The sum of counts in treatment and control is less than this quantity (Dialsingh et al., Bioinformatics, 2015, 1–7). The “Number of Regions filtered out” is very large in our example because the toy alignment files are filtered for reads within chr1 22.5Mb to 25Mb. However, the specified genome covers chr1:0-249250621 which results in alot of zero counts. This does not influence the fit but for testing the regions are filtered. Based on computed q-vlaues, H3K4me3 shows 587 (24+433+67+63) regions significantly enriched for \(FDR \le 0.05\). The summary for H3K36me3 enrichment calls confirms 2,378 (0+1951+212+215) regions significantly enriched for \(FDR \le 0.05\):

summary(k36me3Fit)
## NormRFit-class object
## 
## Type:                  'enrichR'
## Number of Regions:     997003
## Number of Components:  2
## Theta* (naive bg):     0.514
## Background component B: 1
## 
## +++ Results of fit +++ 
## Mixture Proportions:
## Background       Class 1    
##      76.7%         23.3%    
## Theta:
## Background       Class 1    
##      0.113         0.838    
## 
## Bayesian Information Criterion:  131166
## 
## +++ Results of binomial test +++ 
## T-Filter threshold: 4
## Number of Regions filtered out: 988119
## Significantly different from background B based on q-values:
## TOTAL:
##          ***      **       *       .            n.s.
## Bins       0    1951     212     215     121    6385
## %        0.0    12.7    14.1    15.5    16.3    41.5
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '  ' 1 'n.s.'

Based on the fitted background, normR can calculate a standardized, regularized enrichment for further processing:

#background normalized enrichment
k4me3Enr <- getEnrichment(k4me3Fit)

#restrict to regions with non-zero counts
idx <- which(rowSums(do.call(cbind, getCounts(k4me3Fit))) != 0)
summary(k4me3Enr[idx])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2590 -0.9680 -0.7688 -0.4050  0.1409  2.0151

If we compare H3K4me3 and H3K36me3 enrichment in non-zero regions, we can see some mutual exclusivity of both marks represented by off-diagonal elements):

x <- k4me3Enr[idx]
y <- getEnrichment(k36me3Fit)[idx]
d.x <- density(x); d.y <- density(y)
limits <- range(x,y)
layout( matrix( c(0,2,2,1,3,3,1,3,3),ncol=3) )
plot(d.x$x, d.x$y, xlim=limits, type='l',
     main="H3K36me3 normalized Enrichment", xlab="", ylab="Density")
abline(v=0, lty=3, lwd=2, col=4)
plot(d.y$y, d.y$x, ylim=limits, xlim=rev(range(d.y$y)), type='l',
     main="H3K4me3 normalized Enrichment", xlab="Density", ylab="")
abline(h=0, lty=3, lwd=2, col=4)
color <- rep("grey10", length(idx))
plot(x, y, xlim=limits, ylim=limits, pch=20, xlab="", ylab="",
     col=adjustcolor(color, alpha.f=.2))
abline(0, 1, lty=2, lwd=3, col=2)
abline(v=0, lty=3, lwd=2, col=4)
abline(h=0, lty=3, lwd=2, col=4)