Contents

1 Abstract

ChIPseeker is an R package for annotating ChIP-seq data analysis. It supports annotating ChIP peaks and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions. Comparison of ChIP peak profiles and annotation are also supported. Moreover, it supports evaluating significant overlap among ChIP-seq datasets. Currently, ChIPseeker contains 17,000 bed file information from GEO database. These datasets can be downloaded and compare with user’s own data to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation.

2 Citation

If you use ChIPseeker1 in published research, please cite:

G Yu, LG Wang, QY He. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 2015, 31(14):2382-2383. doi:[10.1093/bioinformatics/btv145](http://dx.doi.org/10.1093/bioinformatics/btv145)

3 Introduction

Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) has become standard technologies for genome wide identification of DNA-binding protein target sites. After read mappings and peak callings, the peak should be annotated to answer the biological questions. Annotation also create the possibility of integrating expression profile data to predict gene expression regulation. ChIPseeker1 was developed for annotating nearest genes and genomic features to peaks.

ChIP peak data set comparison is also very important. We can use it as an index to estimate how well biological replications are. Even more important is applying to infer cooperative regulation. If two ChIP seq data, obtained by two different binding proteins, overlap significantly, these two proteins may form a complex or have interaction in regulation chromosome remodelling or gene expression. ChIPseeker1 support statistical testing of significant overlap among ChIP seq data sets, and incorporate open access database GEO for users to compare their own dataset to those deposited in database. Protein interaction hypothesis can be generated by mining data deposited in database. Converting genome coordinations from one genome version to another is also supported, making this comparison available for different genome version and different species.

Several visualization functions are implemented to visualize the coverage of the ChIP seq data, peak annotation, average profile and heatmap of peaks binding to TSS region.

Functional enrichment analysis of the peaks can be performed by my Bioconductor packages DOSE2, ReactomePA3, clusterProfiler4.

## loading packages
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(clusterProfiler)

4 ChIP profiling

The datasets CBX6 and CBX7 in this vignettes were downloaded from GEO (GSE40740)5 while ARmo_0M, ARmo_1nM and ARmo_100nM were downloaded from GEO (GSE48308)6 . ChIPseeker provides readPeakFile to load the peak and store in GRanges object.

files <- getSampleFiles()
print(files)
## $ARmo_0M
## [1] "/tmp/RtmpLlsNCj/Rinstd6a7c3c5422/ChIPseeker/extdata/GEO_sample_data/GSM1174480_ARmo_0M_peaks.bed.gz"
## 
## $ARmo_1nM
## [1] "/tmp/RtmpLlsNCj/Rinstd6a7c3c5422/ChIPseeker/extdata/GEO_sample_data/GSM1174481_ARmo_1nM_peaks.bed.gz"
## 
## $ARmo_100nM
## [1] "/tmp/RtmpLlsNCj/Rinstd6a7c3c5422/ChIPseeker/extdata/GEO_sample_data/GSM1174482_ARmo_100nM_peaks.bed.gz"
## 
## $CBX6_BF
## [1] "/tmp/RtmpLlsNCj/Rinstd6a7c3c5422/ChIPseeker/extdata/GEO_sample_data/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz"
## 
## $CBX7_BF
## [1] "/tmp/RtmpLlsNCj/Rinstd6a7c3c5422/ChIPseeker/extdata/GEO_sample_data/GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz"
peak <- readPeakFile(files[[4]])
peak
## GRanges object with 1331 ranges and 2 metadata columns:
##          seqnames                 ranges strand |             V4        V5
##             <Rle>              <IRanges>  <Rle> |       <factor> <numeric>
##      [1]     chr1     [ 815093,  817883]      * |    MACS_peak_1    295.76
##      [2]     chr1     [1243288, 1244338]      * |    MACS_peak_2     63.19
##      [3]     chr1     [2979977, 2981228]      * |    MACS_peak_3    100.16
##      [4]     chr1     [3566182, 3567876]      * |    MACS_peak_4    558.89
##      [5]     chr1     [3816546, 3818111]      * |    MACS_peak_5     57.57
##      ...      ...                    ...    ... .            ...       ...
##   [1327]     chrX [135244783, 135245821]      * | MACS_peak_1327     55.54
##   [1328]     chrX [139171964, 139173506]      * | MACS_peak_1328    270.19
##   [1329]     chrX [139583954, 139586126]      * | MACS_peak_1329    918.73
##   [1330]     chrX [139592002, 139593238]      * | MACS_peak_1330    210.88
##   [1331]     chrY [ 13845134,  13845777]      * | MACS_peak_1331     58.39
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

4.1 ChIP peaks coverage plot

After peak calling, we would like to know the peak locations over the whole genome, covplot function calculates the coverage of peak regions over chromosomes and generate a figure to visualize. GRangesList is also supported and can be used to compare coverage of multiple bed files.

covplot(peak, weightCol="V5")

covplot(peak, weightCol="V5", chrs=c("chr17", "chr18"), xlim=c(4.5e7, 5e7))

4.2 Profile of ChIP peaks binding to TSS regions

First of all, for calculating the profile of ChIP peaks binding to TSS regions, we should prepare the TSS regions, which are defined as the flanking sequence of the TSS sites. Then align the peaks that are mapping to these regions, and generate the tagMatrix.

## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrix <- getTagMatrix(peak, windows=promoter)
##
## to speed up the compilation of this vignettes, we use a precalculated tagMatrix
data("tagMatrixList")
tagMatrix <- tagMatrixList[[4]]

In the above code, you should notice that tagMatrix is not restricted to TSS regions. The regions can be other types that defined by the user.

4.2.1 Heatmap of ChIP binding to TSS regions

tagHeatmap(tagMatrix, xlim=c(-3000, 3000), color="red")
Heatmap of ChIP peaks binding to TSS regions

Figure 1: Heatmap of ChIP peaks binding to TSS regions

ChIPseeker provide a one step function to generate this figure from bed file. The following function will generate the same figure as above.

peakHeatmap(files[[4]], TxDb=txdb, upstream=3000, downstream=3000, color="red")

4.2.2 Average Profile of ChIP peaks binding to TSS region

plotAvgProf(tagMatrix, xlim=c(-3000, 3000),
            xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
Average Profile of ChIP peaks binding to TSS region

Figure 2: Average Profile of ChIP peaks binding to TSS region

The function plotAvgProf2 provide a one step from bed file to average profile plot. The following command will generate the same figure as shown above.

plotAvgProf2(files[[4]], TxDb=txdb, upstream=3000, downstream=3000,
             xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

Confidence interval estimated by bootstrap method is also supported for characterizing ChIP binding profiles.

plotAvgProf(tagMatrix, xlim=c(-3000, 3000), conf = 0.95, resample = 1000)