1 ABSTRACT

Chromatin immunoprecipitation (ChIP) followed by DNA sequencing (ChIP-seq) has become prevalent high throughput technologies for identifying the binding sites of DNA-binding proteins genome-wise. A number of tools have been devlelopped to facilitate the identification and annotation of the binding sites of the DNA-binding proteins of interest. However, too much selection make it difficult for novel hand. The workflow is based primarily on R packages from the open-source Bioconductor project and covers all steps of the analysis pipline. Analyses will be demonstrated on a published ChIP-seq data set. This will provide readers with practical usage examples that can be applied in their own studies.

Keywords: ChIP-seq, annotation

Availability: The packages are freely available at http://www.bioconductor.org.

Contact: julie.zhu@umassmed.edu

2 1 INTRODUCTION

Chromatin immunoprecipitation followed by DNA sequencing (ChIP-seq) is a popular technique for identifying the genomic binding sites of a target protein(Park, 2009). Conventional analysis of ChIP-seq data including detecting absolute binding based duplicated test and the relative changes in the binding profile between conditions or transcription factors (TFs). In this kind of analysis, the analysis could be divided into 4 parts:

mapping reads
call peaks for each samples
compare the peaks for different conditions or TFs.
annotations

This article describes a computational workflow for performing a ChIP-seq analysis. The aim is to facilitate the practical implementation of ChIPpeakAnno(Zhu et al., 2010) by providing detailed code and expected output. The workflow described here applies to any ChIP-seq experiment with multiple experimental conditions and with multiple biological samples within one or more of the conditions.

workflow

Figure 1. flow chart of ChIP-seq analysis.

The workflow is based primarily on software packages from the open-source Bioconductor project(Gentleman et al., 2004). It contains all steps that are necessary for detecting peaks, starting from the raw read sequences. Reads are first aligned to the genome using the Rsubread(Liao et al., 2013) package. Peaks are called by MACS2(Zhang et al., 2008). The IDR (Irreproducible Discovery Rate) framework(Li et al., 2011) was used to assess the consistency of replicate experiments and to obtain a high-confidence single set of peak calls for each transcription factor. The peaks were annotated and virualized by ChIPpeakAnno and trackViewer packages.

The application of the methods in this article will be demonstrated on a publicly available ChIP-seq data sets. The data set studies genome-wide association between YAP/TAZ/TEAD peaks at enhancers drives oncogenic growth(Zanconato et al., 2015). The intention is to provide readers with a full usage example from which they can construct analysis of their own data.

3 2 Downloading data from Gene Expression Omnibus (GEO)

The first task is to download the relevant ChIP-seq libraries from the NCBI GEO. These are obtained from the data series GSE66081 and GSE49651. GSE66081 saves the data of ChIP-seq raw reads for the study of association between YAP/TAZ/TEAD and AP-1 at enhancers drivers oncogenic trowth. And GSE66081 is saved as SRA project SRP055170(Zanconato et al., 2015). GSE49651 saves the H3K27Ac, H3K4me1, H3K4me3 ChIP-seq data in breast cancer cell line(Rhie et al., 2014). The experiment contains two biological replicates in total for each of the TFs. Here we use SRAdb(Zhu et al., 2013) package to download data.

## load library SRAdb to retrieve SRA files for GSE66081 dataset
library(SRAdb)
## SRAdb need to download a light sqlite file 
## to extract sra file information
## and then download by getSRAfile function.
sqlfile <- "SRAmetadb.sqlite"
if(!file.exists(sqlfile)) 
    sqlfile <- getSRAdbFile()
sra_con <- dbConnect(SQLite(), sqlfile)
conversion <- sraConvert(c("SRP055170"), sra_con=sra_con)
out <- getSRAfile(conversion$sample, 
                  sra_con=sra_con, 
                  fileType="sra")

## ectract file annotations
rs <- listSRAfile(conversion$sample, 
                  sra_con=sra_con, 
                  fileType="sra")
experiment <- dbGetQuery(sra_con, 
                         paste0("select * from experiment where",
                                " experiment_accession in ('",
                                paste(conversion$experiment, 
                                      collapse="','"), "')"))
info <- merge(experiment[, c("experiment_accession", "title", "library_layout")], 
              rs[, c("experiment", "run")], by=1)

info[, 2:3]

##                                                     title library_layout
## 1    GSM1614028: YAP_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 2    GSM1614028: YAP_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 3    GSM1614028: YAP_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 4    GSM1614029: YAP_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 5    GSM1614029: YAP_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 6    GSM1614029: YAP_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 7    GSM1614030: TAZ_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 8    GSM1614030: TAZ_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 9    GSM1614030: TAZ_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 10   GSM1614031: TAZ_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 11   GSM1614031: TAZ_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 12   GSM1614031: TAZ_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 13   GSM1614032: IgG_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 14   GSM1614032: IgG_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 15   GSM1614032: IgG_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 16   GSM1614033: IgG_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 17   GSM1614033: IgG_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 18   GSM1614033: IgG_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 19 GSM1614034: TEAD4_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 20 GSM1614034: TEAD4_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 21 GSM1614035: TEAD4_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 22 GSM1614035: TEAD4_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 23   GSM1614036: JUN_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 24   GSM1614036: JUN_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 25   GSM1614037: JUN_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 26   GSM1614038: IgG_rep3_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 27   GSM1614038: IgG_rep3_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 28   GSM1614039: IgG_rep4_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 29   GSM1614039: IgG_rep4_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 30   GSM1614040: IgG_rep5_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE -

These files are downloaded in the SRA format, and need to be unpacked to the FASTQ format prior to alignment. This can be done using the fastq-dump utility from SRA Toolkit(Leinonen et al., 2010).

runs <- info[, "run"]
## extract fastq files from sra by sratoolkit
sapply(paste0(runs, ".sra"), 
       function(.ele) system(paste("fastq-dump", .ele)))

Technical replicates are merged together prior to further processing. This reflects the fact that they originate from a single library of DNA fragments.

grp <- do.call(rbind, strsplit(info$title, "(;|:) "))
info <- cbind(info, grp)
runs <- split(runs, grp[, 2])
group <- gsub("_ChIPSeq", "", names(runs))
mapply(function(.ele, n) 
    system(paste("cat", paste0(.ele, ".fastq", collapse=" "), 
                 ">>", paste0(n, ".fastq"), collapse=" ")), 
       runs, group)
rm(paste0(info$run, ".fastq"))

4 3 Aligning reads to hg19 build of human genome

Reads in each library are aligned to the hg19 build of human genome, using the Rsubread package. An index is constructed with the prefix index/hg19 and the index can be re-used. Here, the type of sequencing data is set to 1 for genomic DNA-seq data. The uniquely mapped reads should be reported only by setting the uniqe to TRUE.

library(Rsubread)
fastq.files <- paste0(group, ".fastq")
bam.files <- paste0(group, ".bam")

##getPhredOffset, if the offset is not correct, it will report error.
x <- qualityScores(filename=fastq.files[1], 
                   input_format="FASTQ", offset=33)

x[1:5, 1:10]

##       1  2  3  4  5  6  7  8  9 10
## [1,] 34 34 34 37 37 37 35 35 39 39
## [2,] 34 34 34 37 37 37 37 37 39 39
## [3,] 34 34 34 37 37 37 37 37 35 37
## [4,] 33 34 34 37 35 35 37 37 39 39
## [5,] 34 34 34 37 37 37 37 37 39 39

## build index and this can be re-used.
hg19.fasta <- 
    "Genomes/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa"
dir.create("index")
buildindex(basename="index/hg19",
           reference=hg19.fasta)

## alignment
align(index="./index/hg19", readfile1=fastq.files, type=1, 
      output_file=bam.files, maxMismatches=2, nthreads=2, 
      phredOffset=33, unique=TRUE)

Potential PCR duplicates are removed by removeDupReads function in Rsubread package. This step can be also done by MarkDuplicates tool from the Picard software suite. By asBam function from Rsamtools package, the alignment is sorted by their mapping location, and an index created on the destination (with extension ‘.bai’) when indexDestination=TRUE.

## remove reads which are mapped to identical locations, 
## using mapping location of the first base of each read.
## and resort by postion
library(Rsamtools)
null <- sapply(group, function(gp){
    out <- asSam(paste0(gp, ".bam"), "tmp")
    removeDupReads(out, threshold=2, outputFile="tmp.rmdup.sam")
    asBam("tmp.rmdup.sam", paste0(gp, "rmdup.bam"), indexDestination=TRUE)
    unlink(out)
    unlink("tmp.rmdup.sam")
})

5 4 Calling peaks

The control datasets are pooled before calling peaks according the description of paper.

## pool IgG depend on experiment
IgG.bg <- list(rep1.2=c("IgG_rep1.rmdup.bam", "IgG_rep2.rmdup.bam", 
                        "IgG_rep1.2.rmdup.bam"),
                rep3.4=c("IgG_rep3.rmdup.bam", "IgG_rep4.rmdup.bam", 
                         "IgG_rep3.4.rmdup.bam"),
                rep3.5=c("IgG_rep3.rmdup.bam", "IgG_rep5.rmdup.bam", 
                         "IgG_rep3.5.rmdup.bam"))
null <- sapply(IgG.bg, function(.ele){
    out <- mergeBam(files=.ele[-3], destination = .ele[3], 
                    indexDestination=TRUE)
})

Before we call peaks, we may want to check the difference between ChIP signal and background signal. By ChIPpeakAnno::cumulativePercentage function, the difference between the cumulative percentage tag allocation in Input and IP could be clearly checked(Diaz et al., 2012,Ramírez et al. (2016)).

library(ChIPpeakAnno)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
pair <- data.frame(treat=c("YAP", "TAZ", "TEAD4", "JUN"), 
                   control=c("1.2", "1.2", "3.4", "3.5"))
# use chromosome 1 to save the time.
chr1 <- as(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)["chr1"], "GRanges")
cPout <- list()
for(i in seq.int(nrow(pair))){
    cPout[[i]] <- cumulativePercentage(
        bamfiles = c(paste0(pair[i, "treat"], "_rep1.rmdup.bam"),
                     paste0(pair[i, "treat"], "_rep2.rmdup.bam"),
                     paste0("IgG_rep", pair[i, "control"], ".rmdup.bam")), 
        gr=chr1)
    }

The MACS2 is used for peak calling. In this step, multiple tools can be used, for example BroadPeak(Wang et al., 2013), CCAT(Xu et al., 2010), CisGenome(Ji et al., 2008), Dfilter(Kumar et al., 2013), dpeak(Chung et al., 2013), EDD(Lund et al., 2014), ERANGE(Mortazavi et al., 2008), FindPeaks(Fejes et al., 2008), F-Seq(Boyle et al., 2008), GEM(Guo et al., 2012), Homer(Heinz et al., 2010), Hotspot(John et al., 2011), Hpeak(Qin et al., 2010), PeakRanger(Feng et al., 2011), PeakSeq(Rozowsky et al., 2009), QuEST(Valouev et al., 2008), RSEG(Song and Smith, 2011), SICER(Zang et al., 2009), SISSRs(Narlikar and Jothi, 2012), ZINBA(Rashid et al., 2011), or R/Bioconductor tools BayesPeak(Spyrou et al., 2009), CSAR(Muiño et al., 2011), csaw(Lun and Smyth, 2014), DiffBind(Ross-Innes et al., 2012), SPP(Kharchenko et al., 2008). However, too much choices make it difficult for users to determine which one should be used. According google scholar, MACS2 and Homer have the highest citation rate for sharp peaks calling and SICER and CCAT have the highest citation rate for broad peaks calling and ZINBA has the highest citation rate for mixing peaks calling.

for(i in seq.int(nrow(pair))){
    system(paste("macs2 callpeak --to-large -t", 
                 paste0(pair[i, "treat"], "_rep1.rmdup.bam"), 
                 "-c ", paste0("IgG_rep", pair[i, "control"], ".rmdup.bam"), 
                 " -f BAM -g hs -q 0.1 -n", 
                 paste0(pair[i, "treat"], "_rep1.q1")))
    system(paste("macs2 callpeak --to-large -t", 
                 paste0(pair[i, "treat"], "_rep2.rmdup.bam"), 
                 "-c ", paste0("IgG_rep", pair[i, "control"], ".rmdup.bam"), 
                 " -f BAM -g hs -q 0.1 -n", 
                 paste0(pair[i, "treat"], "_rep2.q1")))
    ## we will use idr to filter the results later
}

We set loose filter condition for MACS2 to get more peaks for irreproducibility discovery rate (IDR) analysis. After peak calling, the output of MACS2 will be import into R by ChIPpeakAnno::toGRanges function. The top 10000 peaks sorted by pValue (FDR) will be used for IDR analysis.

library(ChIPpeakAnno)
macs2.files <- dir(".", pattern="*.q1_peaks.narrowPeak$")
peaks <- sapply(macs2.files, function(.ele) 
    toGRanges(.ele, format="narrowPeak"))
peaks.bk <- peaks
peaks <- lapply(peaks, function(.ele) 
    .ele[order(.ele$pValue, decreasing=TRUE)])

There are lots of noise peaks when we set loose filter condition for MACS2. We will use IDR framework to filter the low reproducible peaks. To filter the peak only overlapping peaks will be used for IDR calculation.

peaks <- lapply(peaks, function(.ele) .ele[1:min(1e5, length(.ele))])

lengths(peaks)

##   JUN_rep1.q1_peaks.narrowPeak   JUN_rep2.q1_peaks.narrowPeak 
##                           8877                          23544 
##   TAZ_rep1.q1_peaks.narrowPeak   TAZ_rep2.q1_peaks.narrowPeak 
##                           3647                           4200 
## TEAD4_rep1.q1_peaks.narrowPeak TEAD4_rep2.q1_peaks.narrowPeak 
##                           6850                           1520 
##   YAP_rep1.q1_peaks.narrowPeak   YAP_rep2.q1_peaks.narrowPeak 
##                           4187                           2593

group <- gsub("^(.*?)_.*$", "\\1", names(peaks))
peaks.grp <- split(peaks, group)
## find overlapping peaks
GSE66081 <- sapply(peaks.grp, findOverlapsOfPeaks, simplify = FALSE)
GSE66081.bk <- GSE66081
GSE66081 <- sapply(GSE66081, function(.ele)
    .ele$peaklist[[names(.ele$peaklist)[grepl("\\/\\/\\/", 
                                        names(.ele$peaklist))]]],
                   simplify=FALSE)

lengths(GSE66081)

##   JUN   TAZ TEAD4   YAP 
##  3226  1136   693   923

The IDR are calculated by the average coverages of each overlapping peak. The reads counts for peaks are done by summarizeOverlaps function in GenomicAlignments(Lawrence et al., 2013) package.

library(ChIPpeakAnno)
new.group.names <- 
    lapply(GSE66081.bk, 
           function(.ele) 
               sub(".q1_peaks.narrowPeak", ".bam", 
                   names(.ele$peaklist)[!grepl("\\/\\/\\/", 
                                               names(.ele$peaklist))]))
GSE66081 <- mapply(function(.peaks, .bamfiles) 
    IDRfilter(peaksA=.peaks[[1]], peaksB=.peaks[[2]],
              bamfileA=.bamfiles[1], bamfileB=.bamfiles[2],
              IDRcutoff=0.01),
    peaks.grp, new.group.names)
## The original peaks are filtered to decrease the memory usage.
peaks.keep <- sapply(GSE66081, function(.ele) 
    as.character(unlist(.ele$peakNames)))
peaks.keep <- lapply(peaks.keep, function(.ele) 
    gsub("^.*__", "", .ele))
lengths(peaks.keep)
peaks.keep <- unlist(peaks.keep)
peaks.s <- lapply(peaks, function(.ele) 
    .ele[names(.ele) %in% peaks.keep])
lengths(peaks.s)
peaks.unlist <- unlist(GRangesList(peaks.s), use.names = FALSE)

Export the filtered peak by export function in rtracklayer(Lawrence et al., 2009) package.

library(rtracklayer)
null <- mapply(function(.dat, .name) 
    export(.dat, gsub("^(.*?).q1.*$", "\\1.fil.bed", .name),
           format="BED"), 
    peaks.s, names(peaks.s))
## filter the peaks by qValue < 10^-10 to get similar number peaks 
## as filtered by idr to confirm the effect of IDR filter.
peaks.bk <- lapply(peaks.bk, function(.ele) .ele[.ele$qValue>10])
null <- mapply(function(.dat, .name) 
    export(.dat, gsub("^(.*?).q1.*$", "\\1.fil.q01.bed", .name),
           format="BED"), 
    peaks.bk, names(peaks.bk))

6 5 Quality control

After that the quality of the ChiP-seq experiment could be checked by ChIPQC package. Usally, before mapping, FastQC could be used for sequence quality accessment. And after mapping, SAMStat(Lassmann et al., 2011) could be used for mapping quality analysis, and strand cross-correlation could be sued for checking the experiment quality via csaw package. Here, we focus on the peak calling quality analysis.

## QC
library(csaw)
scc <- lapply(gsub("^(.*?).q1.*$", "\\1.rmdup.bam", names(peaks)),
              correlateReads, cross=TRUE,
              param=readParam(minq=30, 
                              restrict=paste0("chr", c(1:21, "X", "Y")),
                              dedup=TRUE))
names(scc) <- gsub(".q1.*$", "", names(peaks))
scc.param <- lapply(scc, function(.ele){
    readsLength <- 50
    cutline <- 2 * readsLength
    read_length <- which.max(.ele[1:cutline])
    fragment_length <- which.max(.ele[cutline:length(.ele)]) + cutline - 1
    baseline <- which.min(.ele[cutline:length(.ele)]) + cutline - 1
    #normalized.strand.coefficient
    NSC <- .ele[fragment_length] / .ele[baseline] 
    # relative strand correlation
    RSC <- (.ele[fragment_length] - .ele[baseline])/(.ele[read_length] - .ele[baseline]) 
    c(read_length=read_length-1,
         fragment_length=fragment_length-1,
         baseline=baseline-1,
         NSC=NSC, 
         RSC=RSC)
})

The normalized strand coefficient (NSC) and relative strand correlation (RSC) are strong metrics for assessing signal-to-noise ratios in a ChIP-seq experiment. High-quality ChIP-seq data sets should have NSC values >= 1.05 and RSC values >= 0.8(Landt et al., 2012).

do.call(rbind, scc.param)[, c("NSC", "RSC")]

##                 NSC       RSC
## JUN_rep1   3.410316 1.4603898
## JUN_rep2   2.280298 1.1741713
## TAZ_rep1   2.185869 0.8979216
## TAZ_rep2   2.038841 0.8035722
## TEAD4_rep1 2.409461 0.9944640
## TEAD4_rep2 1.889183 0.6928578
## YAP_rep1   2.295337 0.8845790
## YAP_rep2   2.049041 0.8366900

op <- par(mfrow=c(4, 2))
for(i in 1:length(scc)){
    plot(1:length(scc[[i]])-1, scc[[i]], 
         xlab="Delay (bp)", ylab="cross-correlation", 
         main=names(scc)[i], type="l", 
         ylim=c(scc[[i]][scc.param[[i]]["baseline"]]*.8, max(scc[[i]])*1.1))
    abline(v=scc.param[[i]]["fragment_length"], col="red", lty=3)
    abline(h=scc[[i]][scc.param[[i]][c("fragment_length", "read_length", "baseline")]+1], 
           col=c("blue", "green", "gray30"), lty=2)
}

Figure 2. Strand cross correlation against delay distance for the ChIP-seq datasets. The absolute and relative height of ‘phantom’ peak and ChIP peak are useful determinants of the success of a ChIP-seq experiment.

Now we have estimated fragment length and can generate track files for UCSC genome browser.

library(GenomicAlignments)
library(rtracklayer)
bwPath <- "bw"
dir.create(bwPath)
cvgs <- mapply(function(.ele, .name){
    gal <- readGAlignments(paste0(.name, ".rmdup.bam"))
    cvg <- coverage(gal, width = as.numeric(.ele["fragment_length"]))
    seqinfo(cvg) <- seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)
    export.bw(cvg, 
              con=file.path(bwPath, 
                            paste0(.name, ".rmdup.bigWig")))
    cvg
}, scc.param, names(scc.param))

par(op)
library(ChIPQC)
samples <- 
    data.frame(SampleID=gsub("^(.*?).q1.*$", "\\1", names(peaks)), 
               Factor=gsub("^(.*?)_.*$", "\\1", names(peaks)),
               Replicate=gsub("^.*?_rep(.*?).q1.*$", "\\1", 
                              names(peaks)),
               bamReads=gsub("^(.*?).q1.*$", "\\1.rmdup.bam", 
                             names(peaks)),
               Peaks=names(peaks),
               PeakCaller="narrow")
samples

##     SampleID Factor Replicate             bamReads
## 1   JUN_rep1    JUN         1   JUN_rep1.rmdup.bam
## 2   JUN_rep2    JUN         2   JUN_rep2.rmdup.bam
## 3   TAZ_rep1    TAZ         1   TAZ_rep1.rmdup.bam
## 4   TAZ_rep2    TAZ         2   TAZ_rep2.rmdup.bam
## 5 TEAD4_rep1  TEAD4         1 TEAD4_rep1.rmdup.bam
## 6 TEAD4_rep2  TEAD4         2 TEAD4_rep2.rmdup.bam
## 7   YAP_rep1    YAP         1   YAP_rep1.rmdup.bam
## 8   YAP_rep2    YAP         2   YAP_rep2.rmdup.bam
##                            Peaks PeakCaller
## 1   JUN_rep1.q1_peaks.narrowPeak     narrow
## 2   JUN_rep2.q1_peaks.narrowPeak     narrow
## 3   TAZ_rep1.q1_peaks.narrowPeak     narrow
## 4   TAZ_rep2.q1_peaks.narrowPeak     narrow
## 5 TEAD4_rep1.q1_peaks.narrowPeak     narrow
## 6 TEAD4_rep2.q1_peaks.narrowPeak     narrow
## 7   YAP_rep1.q1_peaks.narrowPeak     narrow
## 8   YAP_rep2.q1_peaks.narrowPeak     narrow

## before filter
exampleExp <- ChIPQC(experiment=samples, annotaiton="hg19")
ChIPQCreport(exampleExp, reportFolder="ChIPQC", facetBy="Factor")
samples.fil <- samples
samples.fil$Peaks <- gsub("^(.*?).q1.*$", "\\1.fil.bed", names(peaks))
samples.fil$PeakCaller="bed"
## after filter
exampleExp.fil <- ChIPQC(experiment=samples.fil, annotaiton="hg19")
ChIPQCreport(exampleExp.fil, 
             reportFolder="ChIPQC.fil", 
             facetBy="Factor")
samples.fil.q01 <- samples.fil
samples.fil.q01$Peaks <- 
    gsub("^(.*?).q1.*$", "\\1.fil.q01.bed", names(peaks))
## filtered by qValue from the MACS2 result
exampleExp.fil.q01 <- ChIPQC(experiment=samples.fil.q01, 
                             annotaiton="hg19")
ChIPQCreport(exampleExp.fil.q01, 
             reportFolder="ChIPQC.fil.q01", 
             facetBy="Factor")

Here we can see the effect of IDR. The average profile across within peak regions of duplicates, and the correlation between samples as heatmap and by principal component analysis show higher quality of IDR filtered peaks for replicate samples because that they are clustered together in the heatmap and spatially grouped within the PCA plot.

ChIPQC

Figure 3. Peak profile of duplicates. A, B, C) Plot of the average signal profile across peaks before filter, filtered by IDR and filtered by FDR. D, E, F) Plot of correlation between peaksets before filter, filtered by IDR and filtered by FDR. G, H, I) PCA of peaksets before filter, filtered by IDR and filtered by FDR.

7 6 Interpreting the results

7.1 Correlation test to determin if there is an association among different sets of peaks

Test the overlaps of all the peaks. From the testing result, we can confirm the widespread association between YAP, TAZ, TEAD and JUN. The vennDiagram shows the numbers of overlapping peak for these TFs.

ol <- findOverlapsOfPeaks(GSE66081, connectedPeaks="keepAll")

makeVennDiagram(ol)

## $p.value
##      JUN TAZ TEAD4 YAP pval
## [1,]   0   0     1   1    0
## [2,]   0   1     0   1    0
## [3,]   0   1     1   0    0
## [4,]   1   0     0   1    0
## [5,]   1   0     1   0    0
## [6,]   1   1     0   0    0
## 
## $vennCounts
##       JUN TAZ TEAD4 YAP Counts count.JUN count.TAZ count.TEAD4 count.YAP
##  [1,]   0   0     0   0      0         0         0           0         0
##  [2,]   0   0     0   1     26         0         0           0        26
##  [3,]   0   0     1   0     48         0         0          48         0
##  [4,]   0   0     1   1     17         0         0          17        17
##  [5,]   0   1     0   0     80         0        80           0         0
##  [6,]   0   1     0   1     29         0        29           0        29
##  [7,]   0   1     1   0     28         0        28          28         0
##  [8,]   0   1     1   1     63         0        63          63        63
##  [9,]   1   0     0   0   2172      2172         0           0         0
## [10,]   1   0     0   1     73        73         0           0        75
## [11,]   1   0     1   0     44        44         0          44         0
## [12,]   1   0     1   1     23        23         0          23        24
## [13,]   1   1     0   0    187       187       188           0         0
## [14,]   1   1     0   1    255       256       260           0       258
## [15,]   1   1     1   0     51        53        56          53         0
## [16,]   1   1     1   1    412       418       432         417       431
## attr(,"class")
## [1] "VennCounts"

Figure 4. VennDiagram of overlapping peaks for YAP/TAZ/TEAD/JUN.

We want to confirm that not only the peaks are overlapped but also their summits are close to each other.

peaks.summit <- lapply(peaks, function(.ele) {
    .ele <- shift(.ele, .ele$peak)
    width(.ele) <- 1
    .ele
})

TEAD.summit <- unlist(GRangesList(peaks.summit[group=="TEAD4"]))
TAZ.summit <- unlist(GRangesList(peaks.summit[group=="TAZ"]))
bof <- binOverFeature(TEAD.summit, annotationData=TAZ.summit, 
                      radius=300, nbins=300, FUN=length)

plot(as.numeric(rownames(bof)), bof, 
     ylab="Peak density", 
     xlab="Distance to the summit of TAZ peaks", 
     main="Position of TEAD4 peak summit", 
     type="l", xlim=c(-300, 300))

Figure 5. Position of TEAD4 peak summits relative to the summits of the overlapping YAP/TAZ peaks, in a 600bp window surrounding the summit of YAP/TAZ peaks.

7.2 Annotation the peaks by ChIPpeakAnno

When we get the peaks called by MACS2, we can analyse the distribution of peaks relative to genes annotated in the human genome. The distribution of peaks over exon, intron, enhancer, proximal promoter, 5’UTR and 3’UTR could give you some clues how to annotate the peaks. The distribution can be summarized in the peak centric or nucleotide centric view using the function assignChromosomeRegion in ChIPpeakAnno package.

overlaps <- ol$peaklist[["TAZ///TEAD4///YAP"]]
seqlevelsStyle(overlaps) ## == UCSC
aCR<-assignChromosomeRegion(overlaps, nucleotideLevel=FALSE, 
                            precedence=c("Promoters", 
                                         "immediateDownstream", 
                                         "fiveUTRs", "threeUTRs", 
                                         "Exons", "Introns"), 
                            TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)

par(mar=c(10.1, 4.1, 4.1, 2.1))
barplot(aCR$percentage, las=3)

Figure 6. Distribution of YAP/TAZ/TEAD peaks over promoter (upstream 1K from TSS), immediatedDownstream (downstream 1K from TES), 5’UTR, 3’UTR, exon, intron, and intergenic region.

The chromosome distribtuion of the overlapping peaks shows that most of the peaks are located in intergeic regions. Then we use annotedPeakInBatch or annoPeaks function in ChIPpeakAnno package to annotate the peaks by most close features in both sides of the peaks. To facilitate the creation and documentation of the annotation data, ChIPpeakAnno implement an annoGR class, which is an extension of GRanges class, to represent the annotation data. An annoGR object can be constructed from EnsDb, TxDb, or the user defined GRanges object by calling the annoGR function. The advantage of this class is that it contains the meta data such as the source and the timestamp (date) of the data source. Use ?annoGR for more information. Because most of the peaks are far from TSS, we set the bindingRegion a very big region (100K).

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- annoGR(TxDb.Hsapiens.UCSC.hg19.knownGene, feature="gene")

annoData

## GRanges object with 23056 ranges and 0 metadata columns:
##         seqnames                 ranges strand
##            <Rle>              <IRanges>  <Rle>
##       1    chr19 [ 58858172,  58874214]      -
##      10     chr8 [ 18248755,  18258723]      +
##     100    chr20 [ 43248163,  43280376]      -
##    1000    chr18 [ 25530930,  25757445]      -
##   10000     chr1 [243651535, 244006886]      -
##     ...      ...                    ...    ...
##    9991     chr9 [114979995, 115095944]      -
##    9992    chr21 [ 35736323,  35743440]      +
##    9993    chr22 [ 19023795,  19109967]      -
##    9994     chr6 [ 90539619,  90584155]      +
##    9997    chr22 [ 50961997,  50964905]      -
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

## annotate the peaks by nearest annotations.
YAP.TAZ.TEAD4.nearest <- 
    annotatePeakInBatch(overlaps, 
                        AnnotationData=annoData)
## annotate the peaks by both side in a given range.
YAP.TAZ.TEAD4.2side <- 
    annotatePeakInBatch(overlaps, 
                        AnnotationData=annoData, 
                        output="bindingRegion_bothSidesNSS", 
                        bindingRegion=c(-100000, 100000))

pie1(table(YAP.TAZ.TEAD4.2side$insideFeature))

Figure 7. Pie chart of the features of the overlapping peaks.

The annotations could be saved in XLS file by WriteXLS package to avoid the gene name errors that be introduced inadvertently when using Excel(Zeeberg et al., 2004).

## save annotations
YAP.TAZ.TEAD4.2side.m <- as.data.frame(YAP.TAZ.TEAD4.2side)
library(WriteXLS)
WriteXLS("YAP.TAZ.TEAD4.2side.m", 
         "YAP.TAZ.TEAD4.overlapping.peaks.anno.xls")

With the annotated peak data, you can call the function getEnrichedGO to obtain a list of enriched GO terms. For pathway analysis, you can call function getEnrichedPATH using reactome or KEGG database. Please note that the default setting of feature_id_type is “ensembl_gene_id”. If you are using TxDb as annotation data, please try to change it to “entrez_id”.

## enrichment analysis
library(org.Hs.eg.db)
over <- getEnrichedGO(YAP.TAZ.TEAD4.2side, 
                      orgAnn="org.Hs.eg.db", 
                      feature_id_type="entrez_id", 
                      maxP=.01, condense=TRUE)

head(over[["bp"]][, -c(3, 10)])

##        go.id                        go.term Ontology count.InDataset
## 1 GO:0002252        immune effector process       BP               8
## 2 GO:0006879  cellular iron ion homeostasis       BP               2
## 3 GO:0006882  cellular zinc ion homeostasis       BP               2
## 4 GO:0007033           vacuole organization       BP               4
## 5 GO:0010043           response to zinc ion       BP               2
## 6 GO:0018195 peptidyl-arginine modification       BP               2
##   count.InGenome      pvalue totaltermInDataset totaltermInGenome
## 1           1169 0.008433204               3535           1462775
## 2             48 0.006117350               3535           1462775
## 3             25 0.001687957               3535           1462775
## 4            161 0.000679114               3535           1462775
## 5             52 0.007145379               3535           1462775
## 6             25 0.001687957               3535           1462775

dim(over[["bp"]])

## [1] 15 10

library(reactome.db)
path <- getEnrichedPATH(YAP.TAZ.TEAD4.2side, 
                        orgAnn="org.Hs.eg.db", 
                        pathAnn="reactome.db", 
                        feature_id_type="entrez_id", 
                        maxP=.05)

head(path)

##        path.id EntrezID count.InDataset count.InGenome     pvalue
## 1 R-HSA-165054     1104               1             33 0.04744971
## 2 R-HSA-166663     1401               1             23 0.03331229
## 3 R-HSA-166786     1401               1             12 0.01752021
## 4 R-HSA-168249    10623               5           1291 0.04294018
## 5 R-HSA-168249   388698               5           1291 0.04294018
## 6 R-HSA-168249     8915               5           1291 0.04294018
##   totaltermInDataset totaltermInGenome PATH
## 1                159            108031   NA
## 2                159            108031   NA
## 3                159            108031   NA
## 4                159            108031   NA
## 5                159            108031   NA
## 6                159            108031   NA

Bidirectional promoters are the DNA regions located between the 5’ ends of two adjacent genes coded on opposite strands. The two adjacent genes are transcribed to the opposite directions, and often co-regulated by this shared promoter region(Robertson et al., 2007). Here is an example to find peaks with bi-directional promoters and output the percentage of the peaks near bi-directional promoters.

bdp <- peaksNearBDP(overlaps, annoData, MaxDistance=100000)

bdp

## $peaksWithBDP
## GRangesList object of length 9:
## $2 
## GRanges object with 2 ranges and 9 metadata columns:
##      seqnames               ranges strand |
##         <Rle>            <IRanges>  <Rle> |
##   X2     chr1 [16249621, 16250159]      * |
##   X2     chr1 [16249621, 16250159]      * |
##                                     peakNames   bdp_idx        peak
##                               <CharacterList> <integer> <character>
##   X2 TEAD4__olp0089,YAP__olp0120,TAZ__olp0155         2          X2
##   X2 TEAD4__olp0089,YAP__olp0120,TAZ__olp0155         2          X2
##          feature       feature.ranges feature.strand  distance
##      <character>            <IRanges>          <Rle> <integer>
##   X2      149563 [16330731, 16333184]              +     80571
##   X2      729614 [16160710, 16174642]              -     74978
##      insideFeature distanceToSite
##           <factor>      <integer>
##   X2      upstream          80571
##   X2      upstream          74978
## 
## $4 
## GRanges object with 2 ranges and 9 metadata columns:
##      seqnames               ranges strand |
##   X4     chr1 [17516131, 17517419]      * |
##   X4     chr1 [17516131, 17517419]      * |
##                                     peakNames bdp_idx peak feature
##   X4 TEAD4__olp0109,TAZ__olp0187,YAP__olp0140       4   X4   11240
##   X4 TEAD4__olp0109,TAZ__olp0187,YAP__olp0140       4   X4   29943
##            feature.ranges feature.strand distance insideFeature
##   X4 [17393256, 17445948]              -    70182      upstream
##   X4 [17531621, 17572501]              +    14201      upstream
##      distanceToSite
##   X4          70182
##   X4          14201
## 
## $22 
## GRanges object with 2 ranges and 9 metadata columns:
##       seqnames               ranges strand |
##   X22     chr1 [85666766, 85667899]      * |
##   X22     chr1 [85666766, 85667899]      * |
##                                      peakNames bdp_idx peak feature
##   X22 YAP__olp0713,TEAD4__olp0529,TAZ__olp0876      22  X22  646626
##   X22 YAP__olp0713,TEAD4__olp0529,TAZ__olp0876      22  X22   84144
##             feature.ranges feature.strand distance insideFeature
##   X22 [85742041, 85865646]              +    74141      upstream
##   X22 [85623356, 85666728]              -       37      upstream
##       distanceToSite
##   X22          74141
##   X22             37
## 
## ...
## <6 more elements>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $percentPeaksWithBDP
## [1] 0.1428571
## 
## $n.peaks
## [1] 63
## 
## $n.peaksWithBDP
## [1] 9

The peaks could be visualized by trackViewer package.

library(trackViewer)
gene <- "SYDE2"
eID <- mget(gene, org.Hs.egSYMBOL2EG)
gr <- as(annoData[eID[[1]]], "GRanges")
gr.promoter <- promoters(gr, upstream=5000, downstream=2000)
seqlevels(gr.promoter) <- "chr1"
seqinfo(gr.promoter) <- seqinfo(gr.promoter)["chr1"]
bams.rep1 <- sub(".bam", ".rmdup.bam", 
                 bam.files[grepl("rep1", bam.files)])
syde <- lapply(bams.rep1, importBam, ranges=gr.promoter)
names(syde) <- gsub("_rep1.rmdup.bam", "", bams.rep1)
trs <- geneModelFromTxdb(TxDb.Hsapiens.UCSC.hg19.knownGene, 
                         orgDb = org.Hs.eg.db, gr = gr.promoter)
trs.names <- sapply(trs, function(.ele) .ele@name)
trs <- trs[match(gene, trs.names)]
names(trs) <- gene
optSty <- optimizeStyle(trackList(syde, trs, heightDist=c(.9, .1)), 
                        theme="col")
trackList <- optSty$tracks
viewerStyle <- optSty$style
for(i in 1:length(trackList)) 
    setTrackStyleParam(trackList[[i]], "ylabgp", list(cex=.8))

vp <- viewTracks(trackList, gr=gr.promoter, viewerStyle=viewerStyle)

Figure 8. tracks of YAP/TAZ/TEAD-bound CDC6 upstream co-ocupied by JUN.

browseTracks(trackList, gr=gr.promoter)

7.3 Discover the binding pattern

Owing to most of the peaks are in the intergenic regions, we questioned whether most YAP/TAZ/TEAD common peaks are located in enhancers. Enhancers can be distinguished from promoters by their epigenetic features, that is, relative enrichment of histone H3 monomethylation (H3K4me1) on Lys4 at enhancers, and trimethylation (H3K4me3) at promoters. We first determine the distribution of the distance between the peaks and known genes. And then check the correlation of peaks and enhancers.

dist2TSS <- 
    lapply(list(YAP=GSE66081[["YAP"]],
                TAZ=GSE66081[["TAZ"]],
                TEAD4=GSE66081[["TEAD4"]],
                YAP.TAZ.TEAD4=ol.b$peaklist[["YAP.TAZ///TEAD"]]),
           function(.ele) {
               .ele <- annotatePeakInBatch(.ele, 
                                           AnnotationData=annoData,
                                           output="nearestLocation")
               .ele$shortestDistance})

dist2TSS.cut <- lapply(dist2TSS, cut, breaks=c(0, 1e3, 1e4, 1e5, 1e10))
dist2TSS.table <- sapply(dist2TSS.cut, table)
dist2TSS.percentage <- apply(dist2TSS.table, 2, 
                             function(.ele) .ele/sum(.ele))
library(ggplot2)
library(reshape2)
dist2TSS.percentage <- melt(dist2TSS.percentage)
dist2TSS.percentage$value <- dist2TSS.percentage$value * 100

ggplot(dist2TSS.percentage, aes(x=Var2, y=value, fill=Var1)) +
    geom_bar(stat="identity") +
    xlab("") + ylab("%") + labs(title="Distance to TSS") + theme_bw() + 
    scale_fill_manual(name="",
                      values=c("#FF0000FF", "#FF0000BB", 
                               "#FF000077", "#FF000033"),
                      labels=c("<1kb", "1-10kb", "10-100kb", ">100kb"))

Figure 9. Absolute distance of YAP, TAZ, TEAD4 or overlapping YAP/TAZ/TEAD peaks to nearest TSS.

Because over 75% of the peaks are far from known TSS (more than 10Kb), we want to confirm the peaks are bind to active enhancer region. The ChIP-seq data of epigenetic marks of same cell line are availble in GSE49651. We use GEOquery(Davis and Meltzer, 2007) package to download data from GEO. We can also download the data by SRAdb package and call peaks from raw reads. Here, we use the peaks saved in the GSE49651.

library(GEOquery)
getGEOSuppFiles("GSE49651")
epipeaks.files <- dir("GSE49651", full.names=TRUE)
library(ChIPpeakAnno)
epipeaks <- lapply(epipeaks.files[3:5], 
                   function(.ele) toGRanges(gzfile(.ele), format="BED"))
names(epipeaks) <- 
    gsub("GSE49651_MDAMB231.(.*?).hg19.*._peaks.txt.gz", "\\1", 
         basename(epipeaks.files)[3:5])

The presence of H3K4me1 and H3K4me3 peaks, their genomic locations and their overlap can be used as the criteria to define promoters and enhancers. H3K4me3 peaks not overlapping with H3K4me1 peaks and cloase to a TSS (5kb) are defined as promoters, as NA otherwise; H3K4me1 peaks not overlapping with H3K4me3 peaks are defined as enhancers. Promoters or enhancers are defined as active if overlapping with H3K27ac peaks.

promoter.UCSC <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene, 5e3, 5e3)
promoter.UCSC <- unique(promoter.UCSC)
attach(epipeaks)
ol.promoter <- findOverlapsOfPeaks(H3K4me1, H3K4me3, H3K27Ac, 
                                   promoter.UCSC, 
                                   ignore.strand = FALSE)
promoter <- c(ol.promoter$peaklist[["H3K4me3///H3K27Ac///promoter.UCSC"]],
              ol.promoter$peaklist[["H3K4me1///H3K4me3///H3K27Ac///promoter.UCSC"]])
enhancer.active <- ol.promoter$peaklist[["H3K4me1///H3K27Ac"]]
enhancer.inactive <- ol.promoter$peaklist[["H3K4me1"]]

mcols(enhancer.active) <- DataFrame(type="enhancer.active")
mcols(enhancer.inactive) <- DataFrame(type="enhancer.inactive")
mcols(promoter) <- DataFrame(type="promoter")
types <- c(promoter, enhancer.active, enhancer.inactive)

YAP.TAZ.TEAD <- ol$peaklist[["TAZ///TEAD4///YAP"]]
seqlevelsStyle(YAP.TAZ.TEAD) <- "UCSC"
YAP.TAZ.TEAD <- YAP.TAZ.TEAD[seqnames(YAP.TAZ.TEAD) %in% 
                                 seqlevels(types)]
YAP.TAZ.TEAD4.type <- findOverlaps(YAP.TAZ.TEAD, types)
tbl <- table(types[subjectHits(YAP.TAZ.TEAD4.type)]$type)
tbl["not.classified"] <- length(YAP.TAZ.TEAD) - 
    length(unique(queryHits(YAP.TAZ.TEAD4.type)))

pie1(tbl)

Figure 10. Fraction of YAP/TAZ/TEAD peaks associated with each category.

The binding pattern could be visualized and compared by heatmap and distribution curve from the binding ranges of overlapping peaks of target TFs. For big bedgraph files, bedtools(Quinlan and Hall, 2010) are used to decrease the file size before import into R.

##heatmap
YAP.TAZ.TEAD.assigned <- YAP.TAZ.TEAD[queryHits(YAP.TAZ.TEAD4.type)]
YAP.TAZ.TEAD.assigned$type <- types[subjectHits(YAP.TAZ.TEAD4.type)]$type
YAP.TAZ.TEAD.assigned <- c(YAP.TAZ.TEAD.assigned[YAP.TAZ.TEAD.assigned$type=="enhancer.active"], YAP.TAZ.TEAD.assigned[YAP.TAZ.TEAD.assigned$type=="promoter"])
YAP.TAZ.TEAD.assigned <- unique(YAP.TAZ.TEAD.assigned)

library(rtracklayer)
YAP.TAZ.TEAD.assigned.center <- 
    reCenterPeaks(YAP.TAZ.TEAD.assigned, width=1)
YAP.TAZ.TEAD.assigned.reCenter <- 
    reCenterPeaks(YAP.TAZ.TEAD.assigned, width=2000)
untar("GSE49651/GSE49651_RAW.tar")

bams <- c("YAP", "TAZ", "TEAD4", "JUN")
library(BSgenome.Hsapiens.UCSC.hg19)
len <- seqlengths(Hsapiens)
write.table(len, file="hg19.genome.txt", 
            quote=FALSE, col.names=FALSE, 
            sep="\t")
sapply(bams, function(.ele){
    system(paste0("genomeCoverageBed -bga -split -ibam ", 
                  .ele, "_rep1.rmdup.bam -g hg19.genome.txt > ", 
                  .ele, "_rep1.bedGraph"))
})

sapply(c("GSM1204470_MDAMB231.H3K4me1_1.hg19.tags.bedGraph.gz",
         "GSM1204472_MDAMB231.H3K4me3_1.hg19.tags.bedGraph.gz",
         "GSM1204474_MDAMB231.H3K27Ac_1.hg19.tags.bedGraph.gz"), gunzip)
files <- c(YAP="YAP_rep1.bedGraph",
           TAZ="TAZ_rep1.bedGraph",
           TEAD4="TEAD4_rep1.bedGraph",
           JUN="JUN_rep1.bedGraph",
           H3K4me1="GSM1204470_MDAMB231.H3K4me1_1.hg19.tags.bedGraph",
           H3K4me3="GSM1204472_MDAMB231.H3K4me3_1.hg19.tags.bedGraph",
           H3K27Ac="GSM1204474_MDAMB231.H3K27Ac_1.hg19.tags.bedGraph")
export(YAP.TAZ.TEAD.assigned.reCenter, "tmp.bed", format="BED")
sapply(files, function(.ele) 
    system(paste("intersectBed -a", .ele, "-b tmp.bed >", 
                 gsub("bedGraph", "sub.bedGraph", .ele))))
unlink("tmp.bed")
library(trackViewer)
cvglists <- sapply(gsub("bedGraph", "sub.bedGraph", files), 
                   importData, format="bedGraph")
names(cvglists) <- names(files)
sig <- featureAlignedSignal(cvglists, YAP.TAZ.TEAD.assigned.center, 
                            upstream=1000, downstream=1000)

names(sig)

## [1] "YAP"     "TAZ"     "TEAD4"   "JUN"     "H3K4me1" "H3K4me3" "H3K27Ac"

heatmap <- featureAlignedHeatmap(sig, YAP.TAZ.TEAD.assigned.center, 
                                 upstream=1000, downstream=1000,
                                 annoMcols=c("type"), 
                                 margin=c(.1, .01, .15, .25))

Figure 11. Heatmap representing YAP/TAZ/TEAD/JUN binding sites located on promoters and enhancers. the peaks are ranked from the stongest to weakest signal in YAP ChIP, in a window of 1kb centred on the summit of YAP/TAZ/TEAD peaks. H3K27ac, H3K4me1 and H3K4me3 signal in the corresponding genomic regions is shown on the right.

featureAlignedDistribution(sig, YAP.TAZ.TEAD.assigned.center, 
                           upstream=1000, downstream=1000,
                           type="l")

Figure 12. Bimodal distribution of H3K4me1 signal around the summit of YAP/TAZ, TEAD4 and JUN peaks.

7.4 Output a summary of consensus in the peaks

We first get the sequences of the peaks and then summarize the enriched oligos.

library(BSgenome.Hsapiens.UCSC.hg19)
YAP.TAZ.TEAD4.uniq <- unique(YAP.TAZ.TEAD4)
YAP.TAZ.TEAD4.uniq <- YAP.TAZ.TEAD4.uniq[!is.na(YAP.TAZ.TEAD4.uniq$feature)]
strand(YAP.TAZ.TEAD4.uniq) <- as.character(YAP.TAZ.TEAD4.uniq$feature_strand)
seq <- getAllPeakSequence(YAP.TAZ.TEAD4.uniq, 
                          upstream=0, downstream=0, 
                          genome=Hsapiens)
freqs <- oligoFrequency(Hsapiens$chr1, MarkovOrder=3)
os <- oligoSummary(seq, oligoLength=6, MarkovOrder=3, 
                   quickMotif=TRUE, freqs=freqs, revcomp=TRUE) 
zscore <- sort(os$zscore)

h <- hist(zscore, breaks=100)
text(zscore[length(zscore)], max(h$counts)/10, 
     labels=names(zscore[length(zscore)]), adj=1)

Figure 13. The Histogram of zscores of hexamer for YAP/TAZ/TEAD peaks.

library(motifStack)
pfms <- mapply(function(.ele, id)
    new("pfm", mat=.ele, name=paste("SAMPLE motif", id)), 
    os$motifs, 1:length(os$motifs))

motifStack(pfms)

Figure 14. The motif logo of enriched oligos.

ap1 <- GSE66081[["JUN"]]
seq.ap1 <- getAllPeakSequence(ap1, upstream=0, downstream=0, genome=Hsapiens)
os.ap1 <- oligoSummary(seq.ap1, oligoLength=7, MarkovOrder=3, 
                       quickMotif=FALSE, freqs=freqs, revcomp=TRUE)
zscore.ap1 <- sort(os.ap1$zscore)

h.ap1 <- hist(zscore.ap1, breaks=100)
text(zscore.ap1[length(zscore.ap1)], max(h.ap1$counts)/10, 
     labels=names(zscore.ap1[length(zscore.ap1)]), adj=1)

Figure 15. The Histogram of zscores of hexamer for JUN peaks.

By function oligoSummary, we get exactly same motifs as it shown in the model of complex formed by YAP/TAZ/TEAD and AP-1 reported by Francesca(Zanconato et al., 2015).

The oligoSummary is a quick tool to calculate the enriched k-mers in the peaks. To get accurate motifs, multiple softwares could be used, such as the MEME Suite(Bailey et al., 1994), Consensus(Hertz et al., 1990), rGADEM, Homer, and et. al.

7.5 Build Regulatory Network from ChIP-seq and Expression Data

Intersecting genes with promoter regions bound by YAP/TAZ from the ChIP experiment and genes differential expressed in knockdown of YAP/TAZ allow us to identify genes directly/indirectly regulated by YAP/TAZ by GeneNetworkBuilder package.

Gene expression data are downloaded from GSE59232 by GEOquery package.

library(GEOquery)
library(limma)
gset <- getGEO("GSE59232", GSEMatrix =TRUE, AnnotGPL=TRUE)[[1]]
pD <- pData(gset)[, c("title", "geo_accession")]
gset <- gset[, grepl("MDA-MB-231 cells", pD$title, fixed = TRUE)]
pD <- pData(gset)[, c("title", "geo_accession")]
ex <- exprs(gset) ## log2 transformed expression profile
fl <- do.call(rbind, 
              strsplit(sub("MDA-MB-231 cells ", 
                           "", pD$title), ", "))
sml <- sub("#.$", "", fl[, 1])
design.table <- data.frame(group=sml, 
                           batch=fl[, 1],
                           replicate=fl[, 2])
design <- model.matrix(~ -1 + group + batch + replicate, 
                       data = design.table)
colnames(design) <- sub("group", "", 
                        make.names(colnames(design)))
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(contrasts = "siYAP.TAZ-siControl", 
                             levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
tT <- topTable(fit2, adjust="fdr", number = nrow(fit2))


library(simpIntLists)
interactionmap <- findInteractionList("human", "Official")
interactionmap <- lapply(interactionmap, function(.ele) 
    cbind(from=as.character(.ele$name), 
          to=as.character(.ele$interactors)))
interactionmap <- do.call(rbind, interactionmap)
library(GeneNetworkBuilder)
rootgene <- "TAZ"
TFbindingTable <- 
    cbind(from=rootgene,
          to=unique(YAP.TAZ.TEAD4.nearest.anno$symbol))
sifNetwork <-
    buildNetwork(TFbindingTable=TFbindingTable, 
                 interactionmap=interactionmap, level=2)
tT$symbols <- tT$Gene.symbol
tT <- tT[, c("ID", "logFC", "P.Value", "symbols")]
expressionData <- tT[!duplicated(tT[, "symbols"]), ]
cifNetwork <-
    filterNetwork(rootgene=rootgene, sifNetwork=sifNetwork, 
                  exprsData=expressionData, mergeBy="symbols",
                  miRNAlist=character(0), 
                  tolerance=1, cutoffPVal=0.001, 
                  cutoffLFC=1.5)
## polish network
gR<-polishNetwork(cifNetwork)

## plot by Rgraphviz
library(Rgraphviz)
plotNetwork<-function(gR, layouttype="dot", ...){
    if(!is(gR,"graphNEL")) stop("gR must be a graphNEL object")
    if(!(GeneNetworkBuilder:::inList(layouttype, c("dot", "neato", "twopi", "circo", "fdp")))){
        stop("layouttype must be dot, neato, twopi, circo or fdp")
    }
    g1<-Rgraphviz::layoutGraph(gR, layoutType=layouttype, ...)
    nodeRenderInfo(g1)$col <- nodeRenderInfo(gR)$col
    nodeRenderInfo(g1)$fill <- nodeRenderInfo(gR)$fill
    renderGraph(g1)
}
plotNetwork(gR)

browseNetwork(gR)

8 Summary

This workflow describes the steps of comparing two or more TFs with duplicates, from data downloading, to generating the motif logos. All steps are performed within the R environment and mostly use functions from Bioconductor packages. In particular, the core of the downstream ananlysis is based on ChIPpeakAnno package. This workflow get same results as the original report.

9 Software availability

This workflow depends on various packages from version 3.3 of the Bioconductor project, running on R version 3.3.0 or higher. Version numbers for all packages used are shown below.

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] BiocStyle_2.5.8                        
##  [2] Rgraphviz_2.21.0                       
##  [3] plotly_4.7.1                           
##  [4] simpIntLists_1.13.0                    
##  [5] limma_3.33.7                           
##  [6] png_0.1-7                              
##  [7] trackViewer_1.13.8                     
##  [8] knitr_1.16                             
##  [9] motifStack_1.21.1                      
## [10] ade4_1.7-6                             
## [11] MotIV_1.33.0                           
## [12] grImport_0.9-0                         
## [13] XML_3.98-1.9                           
## [14] BSgenome.Hsapiens.UCSC.hg19_1.4.0      
## [15] BSgenome_1.45.1                        
## [16] reactome.db_1.59.1                     
## [17] org.Hs.eg.db_3.4.1                     
## [18] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [19] GenomicFeatures_1.29.8                 
## [20] AnnotationDbi_1.39.2                   
## [21] csaw_1.11.1                            
## [22] BiocParallel_1.11.4                    
## [23] ChIPQC_1.13.1                          
## [24] DiffBind_2.5.8                         
## [25] ggplot2_2.2.1                          
## [26] idr_1.2                                
## [27] GenomicAlignments_1.13.4               
## [28] SummarizedExperiment_1.7.5             
## [29] DelayedArray_0.3.17                    
## [30] matrixStats_0.52.2                     
## [31] Biobase_2.37.2                         
## [32] rtracklayer_1.37.3                     
## [33] GeneNetworkBuilder_1.19.1              
## [34] Rcpp_0.12.12                           
## [35] ChIPpeakAnno_3.11.4                    
## [36] VennDiagram_1.6.17                     
## [37] futile.logger_1.4.3                    
## [38] Rsamtools_1.29.0                       
## [39] Biostrings_2.45.3                      
## [40] XVector_0.17.0                         
## [41] GenomicRanges_1.29.12                  
## [42] GenomeInfoDb_1.13.4                    
## [43] IRanges_2.11.12                        
## [44] S4Vectors_0.15.5                       
## [45] Rsubread_1.27.4                        
## [46] reshape2_1.4.2                         
## [47] SRAdb_1.35.0                           
## [48] RCurl_1.95-4.8                         
## [49] bitops_1.0-6                           
## [50] graph_1.55.0                           
## [51] BiocGenerics_0.23.0                    
## [52] RSQLite_2.0                            
## 
## loaded via a namespace (and not attached):
##   [1] htmlwidgets_0.9                          
##   [2] BatchJobs_1.6                            
##   [3] munsell_0.4.3                            
##   [4] systemPipeR_1.11.0                       
##   [5] colorspace_1.3-2                         
##   [6] BiocInstaller_1.27.2                     
##   [7] Category_2.43.1                          
##   [8] labeling_0.3                             
##   [9] GenomeInfoDbData_0.99.1                  
##  [10] hwriter_1.3.2                            
##  [11] bit64_0.9-7                              
##  [12] pheatmap_1.0.8                           
##  [13] fail_1.3                                 
##  [14] rprojroot_1.2                            
##  [15] TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2  
##  [16] lambda.r_1.1.9                           
##  [17] biovizBase_1.25.1                        
##  [18] regioneR_1.9.1                           
##  [19] R6_2.2.2                                 
##  [20] locfit_1.5-9.1                           
##  [21] AnnotationFilter_1.1.3                   
##  [22] assertthat_0.2.0                         
##  [23] scales_0.4.1                             
##  [24] nnet_7.3-12                              
##  [25] gtable_0.2.0                             
##  [26] ensembldb_2.1.10                         
##  [27] seqLogo_1.43.0                           
##  [28] rlang_0.1.1                              
##  [29] genefilter_1.59.0                        
##  [30] BBmisc_1.11                              
##  [31] splines_3.4.1                            
##  [32] lazyeval_0.2.0                           
##  [33] acepack_1.4.1                            
##  [34] GEOquery_2.43.0                          
##  [35] dichromat_2.0-0                          
##  [36] brew_1.0-6                               
##  [37] checkmate_1.8.3                          
##  [38] yaml_2.1.14                              
##  [39] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
##  [40] backports_1.1.0                          
##  [41] httpuv_1.3.5                             
##  [42] Hmisc_4.0-3                              
##  [43] RBGL_1.53.0                              
##  [44] tools_3.4.1                              
##  [45] gplots_3.0.1                             
##  [46] RColorBrewer_1.1-2                       
##  [47] plyr_1.8.4                               
##  [48] base64enc_0.1-3                          
##  [49] progress_1.1.2                           
##  [50] zlibbioc_1.23.0                          
##  [51] purrr_0.2.2.2                            
##  [52] prettyunits_1.0.2                        
##  [53] rpart_4.1-11                             
##  [54] pbapply_1.3-3                            
##  [55] chipseq_1.27.0                           
##  [56] ggrepel_0.6.5                            
##  [57] cluster_2.0.6                            
##  [58] magrittr_1.5                             
##  [59] data.table_1.10.4                        
##  [60] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2  
##  [61] futile.options_1.0.0                     
##  [62] amap_0.8-14                              
##  [63] ProtGenerics_1.9.0                       
##  [64] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2  
##  [65] mime_0.5                                 
##  [66] evaluate_0.10.1                          
##  [67] xtable_1.8-2                             
##  [68] gridExtra_2.2.1                          
##  [69] compiler_3.4.1                           
##  [70] biomaRt_2.33.3                           
##  [71] tibble_1.3.3                             
##  [72] KernSmooth_2.23-15                       
##  [73] htmltools_0.3.6                          
##  [74] GOstats_2.43.0                           
##  [75] Formula_1.2-2                            
##  [76] tidyr_0.6.3                              
##  [77] sendmailR_1.2-1                          
##  [78] DBI_0.7                                  
##  [79] MASS_7.3-47                              
##  [80] rGADEM_2.25.0                            
##  [81] ShortRead_1.35.1                         
##  [82] Matrix_1.2-10                            
##  [83] gdata_2.18.0                             
##  [84] Gviz_1.21.1                              
##  [85] bindr_0.1                                
##  [86] pkgconfig_2.0.1                          
##  [87] revealjs_0.9                             
##  [88] foreign_0.8-69                           
##  [89] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2     
##  [90] annotate_1.55.0                          
##  [91] multtest_2.33.0                          
##  [92] AnnotationForge_1.19.4                   
##  [93] stringr_1.2.0                            
##  [94] VariantAnnotation_1.23.6                 
##  [95] digest_0.6.12                            
##  [96] rmarkdown_1.6                            
##  [97] htmlTable_1.9                            
##  [98] edgeR_3.19.3                             
##  [99] GSEABase_1.39.0                          
## [100] curl_2.8.1                               
## [101] shiny_1.0.3                              
## [102] gtools_3.5.0                             
## [103] rjson_0.2.15                             
## [104] jsonlite_1.5                             
## [105] bindrcpp_0.2                             
## [106] seqinr_3.3-6                             
## [107] viridisLite_0.2.0                        
## [108] lattice_0.20-35                          
## [109] Nozzle.R1_1.1-1                          
## [110] Rhtslib_1.9.1                            
## [111] httr_1.2.1                               
## [112] survival_2.41-3                          
## [113] GO.db_3.4.1                              
## [114] interactiveDisplayBase_1.15.0            
## [115] glue_1.1.1                               
## [116] bit_1.1-12                               
## [117] stringi_1.1.5                            
## [118] blob_1.1.0                               
## [119] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 
## [120] AnnotationHub_2.9.5                      
## [121] latticeExtra_0.6-28                      
## [122] caTools_1.17.1                           
## [123] memoise_1.1.0                            
## [124] dplyr_0.7.2

The command-line tools used in the workflow including SRA Toolkit (version 2.3.4), MACS2 (version 2.1.0) and Bedtools (version 2.25.0) must be installed on the system.

10 Author contributions

J.O. developed and tested the workflow. L.J.Z. provided direction on the design of the workflow. J.O., J.Y. and L.J.Z. wrote the article.

11 ACKNOWLEDGEMENTS

We would like to thank the support from the Department of Molecular, Cell and Cancer Biology (MCCB) at UMass Medical School (UMMS). We thank the early adopters who provided great ideas and feedbacks to enhance the features of the software.

REFERENCES

Bailey, T.L., Elkan, C., and others (1994). Fitting a mixture model by expectation maximization to discover motifs in bipolymers.

Boyle, A.P., Guinney, J., Crawford, G.E., and Furey, T.S. (2008). F-seq: A feature density estimator for high-throughput sequence tags. Bioinformatics 24, 2537–2538.

Chung, D., Park, D., Myers, K., Grass, J., Kiley, P., Landick, R., and Keleş, S. (2013). DPeak: High resolution identification of transcription factor binding sites from pet and set chip-seq data. PLoS Comput Biol 9, e1003246.

Davis, S., and Meltzer, P.S. (2007). GEOquery: A bridge between the gene expression omnibus (geo) and bioconductor. Bioinformatics 23, 1846–1847.

Diaz, A., Park, K., Lim, D.A., Song, J.S., and others (2012). Normalization, bias correction, and peak calling for chip-seq. Stat Appl Genet Mol Biol 11, 9.

Fejes, A.P., Robertson, G., Bilenky, M., Varhol, R., Bainbridge, M., and Jones, S.J. (2008). FindPeaks 3.1: A tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics 24, 1729–1730.

Feng, X., Grossman, R., and Stein, L. (2011). PeakRanger: A cloud-enabled peak caller for chip-seq data. BMC Bioinformatics 12, 139.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 5, R80.

Guo, Y., Mahony, S., and Gifford, D.K. (2012). High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput Biol 8, e1002638–e1002638.

Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y.C., Laslo, P., Cheng, J.X., Murre, C., Singh, H., and Glass, C.K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. Molecular Cell 38, 576–589.

Hertz, G.Z., Hartzell, G.W., and Stormo, G.D. (1990). Identification of consensus patterns in unaligned dna sequences known to be functionally related. Computer Applications in the Biosciences: CABIOS 6, 81–92.

Ji, H., Jiang, H., Ma, W., Johnson, D.S., Myers, R.M., and Wong, W.H. (2008). An integrated software system for analyzing chip-chip and chip-seq data. Nature Biotechnology 26, 1293–1300.

John, S., Sabo, P.J., Thurman, R.E., Sung, M.-H., Biddie, S.C., Johnson, T.A., Hager, G.L., and Stamatoyannopoulos, J.A. (2011). Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nature Genetics 43, 264–268.

Kharchenko, P.V., Tolstorukov, M.Y., and Park, P.J. (2008). Design and analysis of chip-seq experiments for dna-binding proteins. Nature Biotechnology 26, 1351–1359.

Kumar, V., Muratani, M., Rayan, N.A., Kraus, P., Lufkin, T., Ng, H.H., and Prabhakar, S. (2013). Uniform, optimal signal processing of mapped deep-sequencing data. Nature Biotechnology 31, 615–622.

Landt, S.G., Marinov, G.K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B.E., Bickel, P., Brown, J.B., Cayting, P., et al. (2012). ChIP-seq guidelines and practices of the encode and modENCODE consortia. Genome Research 22, 1813–1831.

Lassmann, T., Hayashizaki, Y., and Daub, C.O. (2011). SAMStat: Monitoring biases in next generation sequencing data. Bioinformatics 27, 130–131.

Lawrence, M., Gentleman, R., and Carey, V. (2009). Rtracklayer: An r package for interfacing with genome browsers. Bioinformatics 25, 1841–1842.

Lawrence, M., Huber, W., Pages, H., Aboyoun, P., Carlson, M., Gentleman, R., Morgan, M.T., and Carey, V.J. (2013). Software for computing and annotating genomic ranges. PLoS Comput Biol 9, e1003118.

Leinonen, R., Sugawara, H., and Shumway, M. (2010). The sequence read archive. Nucleic Acids Research gkq1019.

Li, Q., Brown, J.B., Huang, H., and Bickel, P.J. (2011). Measuring reproducibility of high-throughput experiments. The Annals of Applied Statistics 1752–1779.

Liao, Y., Smyth, G.K., and Shi, W. (2013). The subread aligner: Fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research 41, e108–e108.

Lun, A.T., and Smyth, G.K. (2014). De novo detection of differentially bound regions for chip-seq data using peaks and windows: Controlling error rates correctly. Nucleic Acids Research 42, e95–e95.

Lund, E., Oldenburg, A.R., and Collas, P. (2014). Enriched domain detector: A program for detection of wide genomic enrichment domains robust against local variations. Nucleic Acids Research 42, e92–e92.

Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Methods 5, 621–628.

Muiño, J.M., Kaufmann, K., Van Ham, R., Angenent, G.C., Krajewski, P., and others (2011). ChIP-seq analysis in r (csar): An r package for the statistical detection of protein-bound genomic regions. Plant Methods 7, 11.

Narlikar, L., and Jothi, R. (2012). ChIP-seq data analysis: Identification of protein–DNA binding sites with sissrs peak-finder. In Next Generation Microarray Bioinformatics, (Springer), pp. 305–322.

Park, P.J. (2009). ChIP–seq: Advantages and challenges of a maturing technology. Nature Reviews Genetics 10, 669–680.

Qin, Z.S., Yu, J., Shen, J., Maher, C.A., Hu, M., Kalyana-Sundaram, S., Yu, J., and Chinnaiyan, A.M. (2010). HPeak: An hmm-based algorithm for defining read-enriched regions in chip-seq data. BMC Bioinformatics 11, 369.

Quinlan, A.R., and Hall, I.M. (2010). BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842.

Ramírez, F., Ryan, D.P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter, A.S., Heyne, S., Dündar, F., and Manke, T. (2016). DeepTools2: A next generation web server for deep-sequencing data analysis. Nucleic Acids Research gkw257.

Rashid, N.U., Giresi, P.G., Ibrahim, J.G., Sun, W., and Lieb, J.D. (2011). ZINBA integrates local covariates with dna-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions. Genome Biol 12, R67.

Rhie, S.K., Hazelett, D.J., Coetzee, S.G., Yan, C., Noushmehr, H., and Coetzee, G.A. (2014). Nucleosome positioning and histone modifications define relationships between regulatory elements and nearby gene expression in breast epithelial cells. BMC Genomics 15, 331.

Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Euskirchen, G., Bernier, B., Varhol, R., Delaney, A., et al. (2007). Genome-wide profiles of stat1 dna association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods 4, 651–657.

Ross-Innes, C.S., Stark, R., Teschendorff, A.E., Holmes, K.A., Ali, H.R., Dunning, M.J., Brown, G.D., Gojis, O., Ellis, I.O., Green, A.R., et al. (2012). Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature 481, 389–393.

Rozowsky, J., Euskirchen, G., Auerbach, R.K., Zhang, Z.D., Gibson, T., Bjornson, R., Carriero, N., Snyder, M., and Gerstein, M.B. (2009). PeakSeq enables systematic scoring of chip-seq experiments relative to controls. Nature Biotechnology 27, 66–75.

Song, Q., and Smith, A.D. (2011). Identifying dispersed epigenomic domains from chip-seq data. Bioinformatics 27, 870–871.

Spyrou, C., Stark, R., Lynch, A.G., and Tavaré, S. (2009). BayesPeak: Bayesian analysis of chip-seq data. BMC Bioinformatics 10, 299.

Valouev, A., Johnson, D.S., Sundquist, A., Medina, C., Anton, E., Batzoglou, S., Myers, R.M., and Sidow, A. (2008). Genome-wide analysis of transcription factor binding sites based on chip-seq data. Nature Methods 5, 829–834.

Wang, J., Lunyak, V.V., and Jordan, I.K. (2013). BroadPeak: A novel algorithm for identifying broad peaks in diffuse chip-seq datasets. Bioinformatics bts722.

Xu, H., Handoko, L., Wei, X., Ye, C., Sheng, J., Wei, C.-L., Lin, F., and Sung, W.-K. (2010). A signal–noise model for significance analysis of chip-seq with negative control. Bioinformatics 26, 1199–1204.

Zanconato, F., Forcato, M., Battilana, G., Azzolin, L., Quaranta, E., Bodega, B., Rosato, A., Bicciato, S., Cordenonsi, M., and Piccolo, S. (2015). Genome-wide association between yap/taz/tead and ap-1 at enhancers drives oncogenic growth. Nature Cell Biology 17, 1218–1227.

Zang, C., Schones, D.E., Zeng, C., Cui, K., Zhao, K., and Peng, W. (2009). A clustering approach for identification of enriched domains from histone modification chip-seq data. Bioinformatics 25, 1952–1958.

Zeeberg, B.R., Riss, J., Kane, D.W., Bussey, K.J., Uchio, E., Linehan, W.M., Barrett, J.C., and Weinstein, J.N. (2004). Mistaken identifiers: Gene name errors can be introduced inadvertently when using excel in bioinformatics. BMC Bioinformatics 5, 80.

Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown, M., Li, W., et al. (2008). Model-based analysis of chip-seq (macs). Genome Biology 9, R137.

Zhu, L.J., Gazin, C., Lawson, N.D., Pagès, H., Lin, S.M., Lapointe, D.S., and Green, M.R. (2010). ChIPpeakAnno: A bioconductor package to annotate chip-seq and chip-chip data. BMC Bioinformatics 11, 237.

Zhu, Y., Stephens, R.M., Meltzer, P.S., and Davis, S.R. (2013). SRAdb: Query and use public next-generation sequencing data from within r. BMC Bioinformatics 14, 19.

Integrated analysis and visualization of ChIP-seq data using ChIPpeakAnno, GeneNetworkBuilder and TrackViewer

Jianhong Ou, Jun Yu, Lihua Julie Zhu

Contents