Contents

1 Introduction

Chromatin immunoprecipitation (ChIP) followed by DNA sequencing (ChIP-seq) and ChIP followed by genome tiling array analysis (ChIP-chip) have become prevalent high throughput technologies for identifying the binding sites of DNA-binding proteins genome-wise. A number of algorithms have been published to facilitate the identification of the binding sites of the DNA-binding proteins of interest. The identified binding sites as the list of peaks are usually converted to BED or bigwig files to be loaded to the UCSC genome browser as custom tracks for investigators to view the proximity to various genomic features such as genes, exons or conserved elements. However, clicking through the genome browser is a daunting task when the number of peaks gets large or the peaks spread widely across the genome.

Here we developed ChIPpeakAnno, a Bioconductor1 package, to facilitate the batch annotation of the peaks identified from ChIP-seq or ChIP-chip experiments. We implemented functionality to find the nearest gene, exon, miRNA or other custom features supplied by users such as the most conserved elements and other transcription factor binding sites leveraging GRanges. Since the genome annotation gets updated frequently, we have leveraged the biomaRt package to retrieve the annotation data on the fly. The users also have the flexibility to pass their own annotation data or annotation from GenomicFeatures as GRanges. We have also leveraged BSgenome and biomaRt to retrieve the sequences around the identified peak for peak validation or motif discovery2. To understand whether the identified peaks are enriched around genes with certain GO terms, we have implemented the Gene Ontology (GO) enrichment test in the ChIPpeakAnno package leveraging the hypergeometric test phyper in the stats package and integrated with the GO annotation from the GO.db package and multiplicity adjustment functions from the multtest package3–8. The pathway analysis using reactome or KEGG is also supported. Starting 3.4, we also implement the functions for permutation test to determine whether there is a significant overlap between two sets of peaks. In addition, binding patterns of multiple transcription factors (TFs) or distributions of multiple epigenetic markers around genomic features could be visualized and compared easily using a side-by-side heatmap and density plot.

2 Quick start

library(ChIPpeakAnno)
## import the MACS output
macs <- system.file("extdata", "MACS_peaks.xls", package="ChIPpeakAnno")
macsOutput <- toGRanges(macs, format="MACS")
## annotate the peaks with precompiled ensembl annotation
data(TSS.human.GRCh38)
macs.anno <- annotatePeakInBatch(macsOutput, AnnotationData=TSS.human.GRCh38)
## add gene symbols
library(org.Hs.eg.db)
macs.anno <- addGeneIDs(annotatedPeak=macs.anno, 
                        orgAnn="org.Hs.eg.db", 
                        IDs2Add="symbol")

if(interactive()){## annotate the peaks with UCSC annotation
    library(GenomicFeatures)
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    ucsc.hg38.knownGene <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
    macs.anno <- annotatePeakInBatch(macsOutput, 
                                     AnnotationData=ucsc.hg38.knownGene)
    macs.anno <- addGeneIDs(annotatedPeak=macs.anno, 
                            orgAnn="org.Hs.eg.db", 
                            feature_id_type="entrez_id",
                            IDs2Add="symbol")
}

3 An example of ChIP-seq analysis workflow using ChIPpeakAnno

We illustrate here a common downstream analysis workflow for ChIP-seq experiments. The input of ChIPpeakAnno is a list of called peaks identified from ChIP-seq experiments. The peaks are represented by GRanges in ChIPpeakAnno. We implemented a conversion functions toGRanges to convert commonly used peak file formats, such as BED, GFF, or other user defined formats such as MACS (a popular peak calling program) output file to GRanges. Please type ?toGRanges for more information.

The workflow here exemplifies converting the BED and GFF files to GRanges, finding the overlapping peaks between the two peak sets, and visualizing the number of common and specific peaks with Venn diagram.

bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE) 
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
## [1] TRUE
gr1[1:2]
## GRanges object with 2 ranges and 1 metadata column:
##               seqnames         ranges strand |     score
##                  <Rle>      <IRanges>  <Rle> | <numeric>
##   MACS_peak_1     chr1 [28341, 29610]      * |    160.81
##   MACS_peak_2     chr1 [90821, 91234]      * |    133.12
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr1.import[1:2] #note the name slot is different from gr1
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames         ranges strand |        name     score
##          <Rle>      <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr1 [28341, 29610]      * | MACS_peak_1    160.81
##   [2]     chr1 [90821, 91234]      * | MACS_peak_2    133.12
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol)
venn diagram of overlaps for duplicated experiments

Figure 1: venn diagram of overlaps for duplicated experiments

## $p.value
##      gr1 gr2 pval
## [1,]   1   1    0
## 
## $vennCounts
##      gr1 gr2 Counts
## [1,]   0   0      0
## [2,]   0   1     61
## [3,]   1   0     62
## [4,]   1   1    166
## attr(,"class")
## [1] "VennCounts"

A pie chart is used to demonstrate the overlap features of the common peaks.

pie1(table(ol$overlappingPeaks[["gr1///gr2"]]$overlapFeature))
Pie chart of common peaks among features

Figure 2: Pie chart of common peaks among features

After finding the overlapping peaks, you can use annotatePeakInBatch to annotate the overlapping peaks with the genomic features in the AnnotationData within certain distance away specified by maxgap, which is 5kb in the following example.

overlaps <- ol$peaklist[["gr1///gr2"]]
## ============== old style ===========
## data(TSS.human.GRCh37) 
## overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData, 
##                                      output="overlapping", maxgap=5000L)
## overlaps.anno <- addGeneIDs(overlaps.anno, "org.Hs.eg.db", "symbol")
## ============== new style ===========
library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- toGRanges(EnsDb.Hsapiens.v75, feature="gene")
annoData[1:2]
## GRanges object with 2 ranges and 1 metadata column:
##                   seqnames         ranges strand |   gene_name
##                      <Rle>      <IRanges>  <Rle> | <character>
##   ENSG00000223972     chr1 [11869, 14412]      + |     DDX11L1
##   ENSG00000227232     chr1 [14363, 29806]      - |      WASH7P
##   -------
##   seqinfo: 273 sequences from GRCh37 genome
overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData, 
                                    output="overlapping", maxgap=5000L)
overlaps.anno$gene_name <- 
    annoData$gene_name[match(overlaps.anno$feature,
                             names(annoData))]
head(overlaps.anno)
## GRanges object with 6 ranges and 11 metadata columns:
##                        seqnames           ranges strand |
##                           <Rle>        <IRanges>  <Rle> |
##   X001.ENSG00000228327     chr1 [713791, 715578]      * |
##   X001.ENSG00000237491     chr1 [713791, 715578]      * |
##   X002.ENSG00000237491     chr1 [724851, 727191]      * |
##   X003.ENSG00000272438     chr1 [839467, 840090]      * |
##   X004.ENSG00000223764     chr1 [856361, 856999]      * |
##   X004.ENSG00000187634     chr1 [856361, 856999]      * |
##                                                            peakNames
##                                                      <CharacterList>
##   X001.ENSG00000228327 gr1__MACS_peak_13,gr2__region_0,gr2__region_1
##   X001.ENSG00000237491 gr1__MACS_peak_13,gr2__region_0,gr2__region_1
##   X002.ENSG00000237491               gr2__region_2,gr1__MACS_peak_14
##   X003.ENSG00000272438               gr1__MACS_peak_16,gr2__region_3
##   X004.ENSG00000223764               gr1__MACS_peak_17,gr2__region_4
##   X004.ENSG00000187634               gr1__MACS_peak_17,gr2__region_4
##                               peak         feature start_position
##                        <character>     <character>      <integer>
##   X001.ENSG00000228327         001 ENSG00000228327         700237
##   X001.ENSG00000237491         001 ENSG00000237491         714150
##   X002.ENSG00000237491         002 ENSG00000237491         714150
##   X003.ENSG00000272438         003 ENSG00000272438         840214
##   X004.ENSG00000223764         004 ENSG00000223764         852245
##   X004.ENSG00000187634         004 ENSG00000187634         860260
##                        end_position feature_strand insideFeature
##                           <integer>    <character>      <factor>
##   X001.ENSG00000228327       714006              -  overlapStart
##   X001.ENSG00000237491       745440              +  overlapStart
##   X002.ENSG00000237491       745440              +        inside
##   X003.ENSG00000272438       851356              +      upstream
##   X004.ENSG00000223764       856396              -  overlapStart
##   X004.ENSG00000187634       879955              +      upstream
##                        distancetoFeature shortestDistance
##                                <numeric>        <integer>
##   X001.ENSG00000228327               215              215
##   X001.ENSG00000237491              -359              359
##   X002.ENSG00000237491             10701            10701
##   X003.ENSG00000272438              -747              124
##   X004.ENSG00000223764                35               35
##   X004.ENSG00000187634             -3899             3261
##                        fromOverlappingOrNearest     gene_name
##                                     <character>   <character>
##   X001.ENSG00000228327              Overlapping RP11-206L10.2
##   X001.ENSG00000237491              Overlapping RP11-206L10.9
##   X002.ENSG00000237491              Overlapping RP11-206L10.9
##   X003.ENSG00000272438              Overlapping  RP11-54O7.16
##   X004.ENSG00000223764              Overlapping   RP11-54O7.3
##   X004.ENSG00000187634              Overlapping        SAMD11
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Once the peaks are annotated, the distribution of the distance to the nearest feature such as the transcription start sites (TSS) can be plotted. The sample code here plots the distribution of the aggregated peak scores and the number of peaks around the TSS.

gr1.copy <- gr1
gr1.copy$score <- 1
binOverFeature(gr1, gr1.copy, annotationData=annoData,
               radius=5000, nbins=10, FUN=c(sum, length),
               ylab=c("score", "count"), 
               main=c("Distribution of aggregated peak scores around TSS", 
                      "Distribution of aggregated peak numbers around TSS"))
Distribution of aggregated peak scores or peak numbers around transcript start sites.

Figure 3: Distribution of aggregated peak scores or peak numbers around transcript start sites

The distribution of the peaks over exon, intron, enhancer, proximal promoter, 5’ UTR and 3’ UTR can be summarized in peak centric or nucleotide centric view using the function assignChromosomeRegion. Please note that setting nucleotideLevel = TRUE will give a nucleotide level distribution over different features.

if(require(TxDb.Hsapiens.UCSC.hg19.knownGene)){
    aCR<-assignChromosomeRegion(gr1, nucleotideLevel=FALSE, 
                           precedence=c("Promoters", "immediateDownstream", 
                                         "fiveUTRs", "threeUTRs", 
                                         "Exons", "Introns"), 
                           TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
    barplot(aCR$percentage)
}
Peak distribution over different genomic features.

Figure 4: Peak distribution over different genomic features

4 Detailed Use Cases and Scenarios

Here we describe some details in using different functions in ChIPpeakAnno for different tasks. As shown in the last section, the common workflow includes: loading called peaks from BED, GFF, or other formats; evaluating and visualizing the concordance among the biological replicates; combining peaks from replicates; preparing genomic annotation(s) as GRanges; associating/annotating peaks with the annotation(s); summarizing peak distributions over exon, intron, enhancer, proximal promoter, 5’UTR and 3’UTR regions; retrieving the sequences around the peaks; and enrichment analysis of GO and biological pathway. We also implemented the functions to plot the heatmap of given peak ranges, and perform permutation test to determine if there is a significant overlap between two sets of peaks.

4.1 Determine the overlapping peaks and visualize the overlaps with Venn diagram

Prior to associating features of interest with the peaks, it is a common practice to evaluate the concordance among the peaks from biological replicates and combine the peaks from biological replicates. Also, it is biologically interesting to obtain overlapping peaks from different ChIP-seq experiments to imply the potential formation of transcription factor complexes. ChIPpeakAnno implemented functions to achieve those goals and quantitatively determine the significance of peak overlaps and generate a Venn diagram for visualization.

Here is the sample code to obtain the overlapping peaks with maximum gap of 1kb for two peak ranges.

peaks1 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", 
                              "2", "6", "6", "6", "6", "5"),
                   ranges=IRanges(start=c(967654, 2010897, 2496704, 3075869, 
                                          3123260, 3857501, 201089, 1543200, 
                                          1557200, 1563000, 1569800, 167889600),
                                  end= c(967754, 2010997, 2496804, 3075969, 
                                         3123360, 3857601, 201089, 1555199,
                                         1560599, 1565199, 1573799, 167893599),
                                  names=paste("Site", 1:12, sep="")),
                  strand="+")

peaks2 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", "1", "2", "3", 
                                     "4", "5", "6", "6", "6", "6", "6", "5"),
                          ranges=IRanges(start=c(967659, 2010898, 2496700, 
                                                 3075866, 3123260, 3857500, 
                                                 96765, 201089, 249670, 307586, 
                                                 312326, 385750, 1549800, 
                                                 1554400, 1565000, 1569400,
                                                 167888600), 
                                         end=c(967869, 2011108, 2496920, 
                                               3076166,3123470, 3857780, 
                                               96985, 201299, 249890, 307796, 
                                               312586, 385960, 1550599, 1560799,
                                               1565399, 1571199, 167888999), 
                                         names=paste("t", 1:17, sep="")),
                          strand=c("+", "+", "+", "+", "+", "+", "-", "-", "-", 
                                   "-", "-", "-", "+", "+", "+", "+", "+"))

ol <- findOverlapsOfPeaks(peaks1, peaks2, maxgap=1000)
peaklist <- ol$peaklist

The function findOverlapsOfPeaks returns an object of overlappingPeaks, which contains there elements: venn_cnt, peaklist (a list of
overlapping peaks or unique peaks), and overlappingPeaks (a list of data frame consists of the annotation of all the overlapping peaks).

Within the overlappingPeaks element of the overlappingPeaks object ol (which is also a list), the element “peaks1///peaks2” is a data frame representing the overlapping peaks with maximum gap of 1kb between the two peak lists. Using the overlapFeature column in this data frame, a pie graph can be generated to describe the distribution of the features of the relative positions of peaks1 to peaks2 for the overlapping peaks.

overlappingPeaks <- ol$overlappingPeaks
names(overlappingPeaks)
## [1] "peaks1///peaks2"
dim(overlappingPeaks[["peaks1///peaks2"]])
## [1] 13 14
overlappingPeaks[["peaks1///peaks2"]][1:2, ]
##          peaks1 seqnames   start     end width strand peaks2 seqnames
## Site1_t1  Site1        1  967654  967754   101      +     t1        1
## Site2_t2  Site2        2 2010897 2010997   101      +     t2        2
##            start     end width strand overlapFeature shortestDistance
## Site1_t1  967659  967869   211      +   overlapStart                5
## Site2_t2 2010898 2011108   211      +   overlapStart                1
pie1(table(overlappingPeaks[["peaks1///peaks2"]]$overlapFeature))
Pie chart of common peaks among features.

Figure 5: Pie chart of common peaks among features

The following code returns the merged overlapping peaks from the peaklist object.

peaklist[["peaks1///peaks2"]]
## GRanges object with 11 ranges and 1 metadata column:
##        seqnames                 ranges strand |
##           <Rle>              <IRanges>  <Rle> |
##    [1]        1 [   967654,    967869]      + |
##    [2]        2 [   201089,    201299]      * |
##    [3]        2 [  2010897,   2011108]      + |
##    [4]        3 [  2496700,   2496920]      + |
##    [5]        4 [  3075866,   3076166]      + |
##    [6]        5 [  3123260,   3123470]      + |
##    [7]        5 [167888600, 167893599]      + |
##    [8]        6 [  1543200,   1560799]      + |
##    [9]        6 [  1563000,   1565399]      + |
##   [10]        6 [  1569400,   1573799]      + |
##   [11]        6 [  3857500,   3857780]      + |
##                                        peakNames
##                                  <CharacterList>
##    [1]                  peaks1__Site1,peaks2__t1
##    [2]                  peaks1__Site7,peaks2__t8
##    [3]                  peaks1__Site2,peaks2__t2
##    [4]                  peaks2__t3,peaks1__Site3
##    [5]                  peaks2__t4,peaks1__Site4
##    [6]                  peaks1__Site5,peaks2__t5
##    [7]                peaks2__t17,peaks1__Site12
##    [8] peaks1__Site8,peaks2__t13,peaks2__t14,...
##    [9]                peaks1__Site10,peaks2__t15
##   [10]                peaks2__t16,peaks1__Site11
##   [11]                  peaks2__t6,peaks1__Site6
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

The peaks in peaks1 but not overlap with the peaks in peaks2 can be obtained with:

peaklist[["peaks1"]]
## NULL

The peaks in peaks2 but not overlap with the peaks in peaks1 can be obtained with:

peaklist[["peaks2"]]
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames           ranges strand |       peakNames
##          <Rle>        <IRanges>  <Rle> | <CharacterList>
##   [1]        1 [ 96765,  96985]      - |      peaks2__t7
##   [2]        3 [249670, 249890]      - |      peaks2__t9
##   [3]        4 [307586, 307796]      - |     peaks2__t10
##   [4]        5 [312326, 312586]      - |     peaks2__t11
##   [5]        6 [385750, 385960]      - |     peaks2__t12
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

Venn diagram can be generated by the function makeVennDiagram using the output of findOverlapsOfPeaks as an input.

The makeVennDiagram also outputs p-values indicating whether the overlapping is significant.

makeVennDiagram(ol, totalTest=1e+2)