Integrated analysis and visualization of ChIP-seq data using ChIPpeakAnno, GeneNetworkBuilder and TrackViewer

Jianhong Ou, Jun Yu, Lihua Julie Zhu

2017-08-01

ChIPseq technology

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes. – Peter J. Park

$ChIP-seq workflow$

(Modified from Nature Reviews Genetics doi:10.1038/nrg2641, P.Park)

Call peaks

call peaks

(Modified from Nature Methods doi:10.1038/nmeth.1371, S.Pepke)

Workflow of ChIPseq Analysis

Quality Assessment of sequencing: FastQC
Mapping: bowtie1/2, Rsubread
Peak calling: CCAT, SICER, MACS, ZINBA, BayesPeak, chipseq, ChIPseqR, CSAR, csaw, GenoGAM, iSeq, PICS
Quality assessment: ChIPQC

Peak annotation and pathway analysis: ChIPpeakAnno, Homer, PAVIS, GREAT, ChIPseeker, clusterProfiler, GOstats
Gene Network Building: GeneNetworkBuilder, ChIPXpress
Visualization: trackViewer, IGV, UCSC genome browser, rtracklayer
Motif enrichment analysis: The MEME Suite, homer

Summary of the workflow

The workflow is based primarily on software packages from the open-source Bioconductor project. It contains all steps that are necessary for detecting peaks, starting from the raw read sequences.

Reads are first aligned to the genome using the Rsubread package.

Peaks are called by MACS2.

The peaks are annotated and virualized by ChIPpeakAnno , trackViewer, and GeneNetworkBuilder packages.

We will demonstrate the workflow on a publicly available ChIP-seq dataset to profile YAP/TAZ/TEAD binding sites (Zanconato et al., 2015).

ChIPpeakAnno

Main functions:

Compare and combine biological replicates (IDRfilter, findOverlapsOfPeaks, peakPermTest)
Annotate to nearest gene (annotatePeakInBatch)
Pathway Analysis (getEnrichedGO, getEnrichedPath)
Separate binding sites into active enhancer regions and promoter regions using histone modification data
Visualize the signal pattern for genomic loci bound by multiple transcription factors

GeneNetworkBuilder

Build regulatory network (buildNetwork)
Filter regulatory network (filterNetwork)
Polish regulatory network (polishNetwork)

trackViewer

Import data (importScore)
Build gene model (geneModelFromTxdb)
View the tracks (viewTracks, browseTracks)

Download data from Gene Expression Omnibus

NCBI GEO ID	SRA project	Description	package
GSE66081	SRP055170	the data of ChIP-seq raw reads for the study of association between YAP/TAZ/TEAD and AP-1 at enhancers drivers oncogenic trowth	SRAdb
GSE49651		the H3K27Ac, H3K4me1, H3K4me3 ChIP-seq data in breast cancer cell line	GEOquery
GSE59232		transcriptomic data of Affymetrix GeneChips Human Genome U133 Plus 1.0	GEOquery

Here we use SRAdb package to download the sequence data.

## load library SRAdb to retrieve SRA files for GSE66081 dataset
library(SRAdb)
## SRAdb need to download a light sqlite file 
## to extract sra file information
## and then download by getSRAfile function.
sqlfile <- "SRAmetadb.sqlite"
if(!file.exists(sqlfile)) 
    sqlfile <- getSRAdbFile()
sra_con <- dbConnect(SQLite(), sqlfile)
conversion <- sraConvert(c("SRP055170"), 
                         sra_con=sra_con)
out <- getSRAfile(conversion$sample, 
                  sra_con=sra_con, 
                  fileType="sra")

## extract file annotations
rs <- listSRAfile(conversion$sample, 
                  sra_con=sra_con, 
                  fileType="sra")
experiment <- 
    dbGetQuery(sra_con, 
               paste0("select * from experiment where",
                      " experiment_accession in ('",
                      paste(conversion$experiment, 
                            collapse="','"), "')"))
info <- merge(experiment[, c("experiment_accession", 
                             "title", "library_layout")], 
              rs[, c("experiment", "run")], by=1)

info[1:5, 2:3]

##                                                  title library_layout
## 1 GSM1614028: YAP_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 2 GSM1614028: YAP_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 3 GSM1614028: YAP_rep1_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 4 GSM1614029: YAP_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE - 
## 5 GSM1614029: YAP_rep2_ChIPSeq; Homo sapiens; ChIP-Seq      SINGLE -

These files are downloaded in the SRA format, and need to be unpacked to the FASTQ format prior to alignment. This can be done using the fastq-dump utility from SRA Toolkit.

runs <- info[, "run"]
## extract fastq files from sra by sratoolkit
sapply(paste0(runs, ".sra"), 
       function(.ele) system(paste("fastq-dump", 
                                   .ele)))

Technical replicates are merged together prior to further processing. This reflects the fact that they originate from a single library of DNA fragments.

grp <- do.call(rbind, 
               strsplit(info$title, "(;|:) "))
info <- cbind(info, grp)
runs <- split(runs, grp[, 2])
group <- gsub("_ChIPSeq", "", names(runs))

runs[1:2]

## $IgG_rep1_ChIPSeq
## [1] "SRR1810913" "SRR1810912" "SRR1810914"
## 
## $IgG_rep2_ChIPSeq
## [1] "SRR1810915" "SRR1810917" "SRR1810916"

mapply(function(.ele, n) 
    system(paste("cat", paste0(.ele, ".fastq", collapse=" "), 
                 ">>",  paste0(n, ".fastq"), collapse=" ")), 
       runs, group)
## remove unused files to save storage.
unlink(paste0(info$run, ".fastq"))
unlink(paste0(info$run, ".sra"))

Aligning reads to hg19 build of human genome

Reads in each library are aligned to the hg19 build of human genome, using the Rsubread package.

library(Rsubread)
fastq.files <- paste0(group, ".fastq")
bam.files <- paste0(group, ".bam")

##getPhredOffset, if the offset is not correct, it will report error.
x <- qualityScores(filename=fastq.files[1], 
                   input_format="FASTQ", offset=33, 
                   nreads=1000)

x[1:3, 1:10]

##       1  2  3  4  5  6  7  8  9 10
## [1,] 34 34 34 37 37 37 35 35 39 39
## [2,] 34 34 34 37 37 37 37 37 39 39
## [3,] 34 34 34 37 37 37 37 37 35 37

Exercise: Check the offset of quality scores

library(Rsubread)
reads <- system.file("extdata","reads.txt.gz",package="Rsubread")
head(qualityScores(filename = reads,
                   input_format = "gzFASTQ", 
                   offset = 33,
                   nreads = 100))  ## please check the errors
head(qualityScores(filename = reads,
                   input_format = "gzFASTQ", 
                   offset = 64,
                   nreads = 100))

An index is constructed with the prefix index/hg19 and the index can be re-used. Here, the type of sequencing data is set to 1 for genomic DNA-seq data. The uniquely mapped reads should be reported only by setting the uniqe to TRUE.

## build index and this can be re-used.
hg19.fasta <- 
    "Genomes/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa"
dir.create("index")
buildindex(basename="index/hg19",
           reference=hg19.fasta)

## alignment
align(index="./index/hg19", readfile1=fastq.files, 
      type=1, 
      output_file=bam.files, maxMismatches=2, 
      nthreads=2, 
      phredOffset=33, unique=TRUE)

Subsample the data to save time for this tutorial.

library(Rsamtools)
filter <- FilterRules(list(seqn = function(x) x$rname == "chr1"))
## sort and index
null <- sapply(group, function(gp) {
    sortBam(paste0(gp, ".bam"), paste0(gp, ".srt"))
    file.copy(paste0(gp, ".srt.bam"), 
              paste0(gp, ".bam"), overwrite = TRUE)
    indexBam(paste0(gp, ".bam"))
})
## subset (optional)
null <- sapply(group, function(gp) {
    dest <- paste0(gp, ".chr1.bam")
    filterBam(paste0(gp, ".bam"), paste0(gp, ".chr1.bam"), 
              filter=filter)
    file.rename(paste0(gp, ".chr1.bam"), paste0(gp, ".bam"))
    indexBam(paste0(gp, ".bam"))
})

Potential PCR duplicates are removed by removeDupReads function in Rsubread package. This step can also be done by MarkDuplicates tool from the Picard software suite.

By asBam function from Rsamtools package, the alignment is sorted by their mapping location, and an index created on the destination (with extension ‘.bai’) when indexDestination=TRUE.

## remove reads which are mapped to identical locations, 
## using mapping location of the first base of each read.
## and resort by position
library(Rsamtools)
library(Rsubread)
null <- sapply(group[c(1, 11:13)], function(gp){
    out <- asSam(paste0(gp, ".bam"), gp)
    removeDupReads(out, threshold=2, 
                   outputFile=paste0(gp, ".rmdup.sam"))
    asBam(paste0(gp, ".rmdup.sam"), paste0(gp, ".rmdup"), 
          indexDestination=TRUE)
    unlink(out)
    unlink(paste0(gp, ".rmdup.sam"))
})

Calling peaks

The control datasets are pooled before calling peaks according the description of the paper.

## pool IgG depend on experiment
IgG.bg <- list(rep1.2=c("IgG_rep1.rmdup.bam", 
                        "IgG_rep2.rmdup.bam", 
                        "IgG_rep1.2.rmdup.bam"),
                rep3.4=c("IgG_rep3.rmdup.bam", 
                         "IgG_rep4.rmdup.bam", 
                         "IgG_rep3.4.rmdup.bam"),
                rep3.5=c("IgG_rep3.rmdup.bam", 
                         "IgG_rep5.rmdup.bam", 
                         "IgG_rep3.5.rmdup.bam"))
null <- sapply(IgG.bg[3], function(.ele){
    out <- mergeBam(files=.ele[-3],
                    destination = .ele[3], 
                    indexDestination=TRUE)
})

Before we call peaks, we may want to check the difference between ChIP signal and background signal.

By ChIPpeakAnno::cumulativePercentage function, the difference between the cumulative percentage tag allocation in Input and IP could be clearly checked (Diaz et al., 2012, Ramírez et al., 2016).

library(ChIPpeakAnno)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
pair <- data.frame(treat=c("YAP", "TAZ", "TEAD4", "JUN"),
                   control=c("1.2", "1.2", "3.4", "3.5"),
                   stringsAsFactors = FALSE)
## use chromosome 1 to save the time.
chr1 <- as(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)["chr1"], "GRanges")
## check the difference between TEAD4 ChIP and background signal
i <- 3 
cumulativePercentage(
        bamfiles = c(paste0(pair[i, "treat"], 
                            "_rep1.rmdup.bam"),
                     paste0("IgG_rep", pair[i, "control"], 
                            ".rmdup.bam")), 
        gr=chr1)

See documentation from deeptools

deeptools

We set loose filter condition for MACS2 to get more peaks for irreproducibility discovery rate (IDR) analysis.

for(i in seq.int(nrow(pair))){
    system(paste("macs2 callpeak --to-large -t", 
                 paste0(pair[i, "treat"], "_rep1.rmdup.bam"), 
                 "-c ", paste0("IgG_rep", pair[i, "control"], ".rmdup.bam"), 
                 " -f BAM -g hs -q 0.1 -n", 
                 paste0(pair[i, "treat"], "_rep1.q1")))
    system(paste("macs2 callpeak --to-large -t", 
                 paste0(pair[i, "treat"], "_rep2.rmdup.bam"), 
                 "-c ", paste0("IgG_rep", pair[i, "control"], ".rmdup.bam"), 
                 " -f BAM -g hs -q 0.1 -n", 
                 paste0(pair[i, "treat"], "_rep2.q1")))
    ## we will use idr to filter the results later
}

After peak calling, the output of MACS2 will be imported into R by ChIPpeakAnno::toGRanges function. The top 10000 peaks sorted by pValue (FDR) will be used for IDR analysis.

library(ChIPpeakAnno)
macs2.files <- dir(".", pattern="*.q1_peaks.narrowPeak$")
peaks <- sapply(macs2.files, function(.ele) 
    toGRanges(.ele, format="narrowPeak"))
peaks.bk <- peaks
peaks <- lapply(peaks, function(.ele) 
    .ele[order(.ele$pValue, decreasing=TRUE)])

Exercise: Import data by `toGRanges`

library(ChIPpeakAnno)
packagePath <- system.file("extdata", package = "ChIPpeakAnno")
dir(packagePath, "gff|bed|narrowPeak|broadPeak|gz")
toGRanges(file.path(packagePath, "GFF_peaks.gff"), format = "GFF")
toGRanges(file.path(packagePath, "WS220.bed"), format = "BED")
toGRanges(file.path(packagePath, "peaks.narrowPeak"), format = "narrowPeak")
toGRanges(file.path(packagePath, "TAF.broadPeak"), format = "broadPeak")
toGRanges(file.path(packagePath, "wgEncodeUmassDekker5CGm12878PkV2.bed.gz"), format = "BED")
packagePath <- system.file("extdata", "macs", package = "ChIPseqStepByStep")
macs2.files <- dir(packagePath, pattern="*.q1_peaks.narrowPeak$")
peaks <- sapply(macs2.files, function(.ele) 
    toGRanges(file.path(packagePath, .ele), format="narrowPeak"))
peaks.bk <- peaks
peaks <- lapply(peaks, function(.ele) .ele[order(.ele$pValue, decreasing=TRUE)])

There are lots of noise peaks when we set loose filter condition for MACS2. We will use IDR framework to filter the low reproducible peaks (Landt et al., 2012). Only overlapping peaks will be used for IDR calculation.

peaks <- lapply(peaks, function(.ele) .ele[1:min(1e5, length(.ele))])
lengths(peaks)

##   JUN_rep1.q1_peaks.narrowPeak   JUN_rep2.q1_peaks.narrowPeak 
##                           8877                          23544 
##   TAZ_rep1.q1_peaks.narrowPeak   TAZ_rep2.q1_peaks.narrowPeak 
##                           3647                           4200 
## TEAD4_rep1.q1_peaks.narrowPeak TEAD4_rep2.q1_peaks.narrowPeak 
##                           6850                           1520 
##   YAP_rep1.q1_peaks.narrowPeak   YAP_rep2.q1_peaks.narrowPeak 
##                           4187                           2593

new.group <- gsub("^(.*?)_.*$", "\\1", names(peaks))
peaks.grp <- split(peaks, new.group)
names(peaks.grp)
## find overlapping peaks
GSE66081 <- sapply(peaks.grp, findOverlapsOfPeaks, simplify = FALSE)
GSE66081.bk <- GSE66081
GSE66081 <- sapply(GSE66081, function(.ele)
    .ele$peaklist[[names(.ele$peaklist)[grepl("\\/\\/\\/", 
                                        names(.ele$peaklist))]]],
                   simplify=FALSE)

lengths(GSE66081)

##   JUN   TAZ TEAD4   YAP 
##  3226  1136   693   923

The IDR are calculated by the average coverages of each overlapping peak. The reads counts for peaks are done by summarizeOverlaps function in GenomicAlignments package.

library(ChIPpeakAnno)
new.group.names <- lapply(GSE66081.bk, function(.ele) 
    sub(".q1_peaks.narrowPeak", ".bam", 
        names(.ele$peaklist)[!grepl("\\/\\/\\/", names(.ele$peaklist))]))
GSE66081 <- mapply(function(.peaks, .bamfiles) 
    IDRfilter(peaksA=.peaks[[1]], peaksB=.peaks[[2]],
              bamfileA=.bamfiles[1], bamfileB=.bamfiles[2],
              IDRcutoff=0.01),
    peaks.grp, new.group.names)
## The original peaks are filtered to decrease the memory usage.
peaks.keep <- sapply(GSE66081, function(.ele) 
    as.character(unlist(.ele$peakNames)))
peaks.keep <- lapply(peaks.keep, function(.ele) gsub("^.*__", "", .ele))
peaks.keep <- unlist(peaks.keep)
peaks.s <- lapply(peaks, function(.ele) .ele[names(.ele) %in% peaks.keep])
peaks.unlist <- unlist(GRangesList(peaks.s), use.names = FALSE)

Export filtered peaks by `rtracklayer`

Export the filtered peak by export function in rtracklayer package.

library(rtracklayer)
null <- mapply(function(.dat, .name) 
    export(.dat, 
           sub("^(.*?).q1.*$", "\\1.fil.bed", .name),
           format="BED"), 
    peaks.s, names(peaks.s))
## filter the peaks by qValue < 10^-10 to get similar number peaks 
## as filtered by idr to confirm the effect of IDR filter.
peaks.bk <- lapply(peaks.bk, function(.ele) 
    .ele[.ele$qValue>10])
null <- mapply(function(.dat, .name) 
    export(.dat, 
           sub("^(.*?).q1.*$", "\\1.fil.q01.bed", .name),
           format="BED"), 
    peaks.bk, names(peaks.bk))

Quality control

The ChIP-seq experiment could be checked by ChIPQC package.

Usally, before mapping, FastQC could be used for sequence quality accessment. And after mapping, SAMStat could be used for mapping quality analysis, and strand cross-correlation could be used for checking the experiment quality via csaw package.

## strand cross-correlation
library(csaw)
scc <- lapply(sub("^(.*?).q1.*$", "\\1.rmdup.bam", names(peaks)),
              correlateReads, cross=TRUE,
              param=readParam(minq=30, 
                              restrict="chr1",
                              dedup=TRUE))
names(scc) <- gsub(".q1.*$", "", names(peaks))
scc.param <- lapply(scc, function(.ele){
    readsLength <- 50
    cutline <- 2 * readsLength
    read_length <- which.max(.ele[1:cutline])
    fragment_length <- which.max(.ele[cutline:length(.ele)]) + cutline - 1
    baseline <- which.min(.ele[cutline:length(.ele)]) + cutline - 1
    #normalized.strand.coefficient
    NSC <- .ele[fragment_length] / .ele[baseline] 
    # relative strand correlation
    RSC <- (.ele[fragment_length] - .ele[baseline])/(.ele[read_length] - .ele[baseline]) 
    c(read_length=read_length-1,
      fragment_length=fragment_length-1,
      baseline=baseline-1,
      NSC=NSC, 
      RSC=RSC)
})

The normalized strand coefficient (NSC) and relative strand correlation (RSC) are strong metrics for assessing signal-to-noise ratios in a ChIP-seq experiment. High-quality ChIP-seq data sets should have NSC values >= 1.05 and RSC values >= 0.8 (Landt et al., 2012).

do.call(rbind, scc.param)[, c("NSC", "RSC")]

##                 NSC       RSC
## JUN_rep1   3.410316 1.4603898
## JUN_rep2   2.280298 1.1741713
## TAZ_rep1   2.185869 0.8979216
## TAZ_rep2   2.038841 0.8035722
## TEAD4_rep1 2.409461 0.9944640
## TEAD4_rep2 1.889183 0.6928578
## YAP_rep1   2.295337 0.8845790
## YAP_rep2   2.049041 0.8366900

Strand cross correlation against delay distance

The absolute and relative height of ‘phantom’ peak and ChIP peak are useful determinants of the success of a ChIP-seq experiment.

op <- par(mfrow=c(4, 2))
for(i in 1:length(scc)){
    plot(1:length(scc[[i]])-1, scc[[i]], 
         xlab="Delay (bp)", ylab="cross-correlation", 
         main=names(scc)[i], type="l", 
         ylim=c(scc[[i]][scc.param[[i]]["baseline"]]*.8, max(scc[[i]])*1.1))
    abline(v=scc.param[[i]]["fragment_length"], col="red", lty=3)
    abline(h=scc[[i]][scc.param[[i]][c("fragment_length", "read_length", "baseline")]+1], 
           col=c("blue", "green", "gray30"), lty=2)
}

Generate bigWig files for UCSC genome browser

Now we have estimated fragment length and can generate track files for UCSC genome browser.

library(GenomicAlignments)
library(rtracklayer)
bwPath <- "bw"
dir.create(bwPath)
cvgs <- mapply(function(.ele, .name){
    gal <- readGAlignments(paste0(.name, ".rmdup.bam"))
    cvg <- coverage(gal, width = as.numeric(.ele["fragment_length"]))
    seqinfo(cvg) <- seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)
    export.bw(cvg, 
              con=file.path(bwPath, 
                            paste0(.name, ".rmdup.bigWig")))
    cvg
}, scc.param, names(scc.param))

ChIPQC is used for quality control.

library(ChIPQC)
samples <- 
    data.frame(SampleID=gsub("^(.*?).q1.*$", "\\1", names(peaks)), 
               Factor=gsub("^(.*?)_.*$", "\\1", names(peaks)),
               Replicate=gsub("^.*?_rep(.*?).q1.*$", "\\1", 
                              names(peaks)),
               bamReads=gsub("^(.*?).q1.*$", "\\1.rmdup.bam", 
                             names(peaks)),
               Peaks=names(peaks),
               PeakCaller="narrow")
samples[1:3, ]

##   SampleID Factor Replicate           bamReads
## 1 JUN_rep1    JUN         1 JUN_rep1.rmdup.bam
## 2 JUN_rep2    JUN         2 JUN_rep2.rmdup.bam
## 3 TAZ_rep1    TAZ         1 TAZ_rep1.rmdup.bam
##                          Peaks PeakCaller
## 1 JUN_rep1.q1_peaks.narrowPeak     narrow
## 2 JUN_rep2.q1_peaks.narrowPeak     narrow
## 3 TAZ_rep1.q1_peaks.narrowPeak     narrow

We run ChIPQC for peaks before filter, filtered by IDR and qValue.

## before filter
exampleExp <- ChIPQC(experiment=samples, annotaiton="hg19")
ChIPQCreport(exampleExp, reportFolder="ChIPQC", facetBy="Factor")
## after IDR filter
samples.fil <- samples
samples.fil$Peaks <- gsub("^(.*?).q1.*$", "\\1.fil.bed", names(peaks))
samples.fil$PeakCaller="bed"
exampleExp.fil <- ChIPQC(experiment=samples.fil, annotaiton="hg19")
ChIPQCreport(exampleExp.fil, 
             reportFolder="ChIPQC.fil", 
             facetBy="Factor")
## filtered by qValue from the MACS2 result
samples.fil.q01 <- samples.fil
samples.fil.q01$Peaks <- 
    gsub("^(.*?).q1.*$", "\\1.fil.q01.bed", names(peaks))
exampleExp.fil.q01 <- ChIPQC(experiment=samples.fil.q01, 
                             annotaiton="hg19")
ChIPQCreport(exampleExp.fil.q01, 
             reportFolder="ChIPQC.fil.q01", 
             facetBy="Factor")

ChIPQC

The effect of IDR with full dataset. The average profile across within peak regions of duplicates, and the correlation between samples as heatmap and by principal component analysis show higher quality of IDR filtered peaks for replicate samples because they are clustered together in the heatmap and spatially grouped within the PCA plot.

Association among different sets of peaks

Test the overlaps of all the peaks. From the testing result, we can confirm the widespread association between YAP, TAZ, TEAD and JUN. The vennDiagram shows the numbers of overlapping peak for these TFs.

ol <- findOverlapsOfPeaks(GSE66081, connectedPeaks="keepAll")

Exercise: find overlaps among peak lists

library(ChIPpeakAnno)
packagePath <- system.file("extdata", "filtered", package = "ChIPseqStepByStep")
filtedFiles <- dir(packagePath, ".bed")
GSE66081 <- sapply(filtedFiles, function(.ele){
    toGRanges(file.path(packagePath, .ele), format="BED")
})
names(GSE66081) <- sub(".bed", "", names(GSE66081))
lengths(GSE66081)
ol <- findOverlapsOfPeaks(GSE66081)
names(ol)
names(ol$peaklist)
makeVennDiagram(ol)

makeVennDiagram(ol)

## $p.value
##      JUN TAZ TEAD4 YAP pval
## [1,]   0   0     1   1    0
## [2,]   0   1     0   1    0
## [3,]   0   1     1   0    0
## [4,]   1   0     0   1    0
## [5,]   1   0     1   0    0
## [6,]   1   1     0   0    0
## 
## $vennCounts
##       JUN TAZ TEAD4 YAP Counts count.JUN count.TAZ count.TEAD4 count.YAP
##  [1,]   0   0     0   0      0         0         0           0         0
##  [2,]   0   0     0   1     26         0         0           0        26
##  [3,]   0   0     1   0     48         0         0          48         0
##  [4,]   0   0     1   1     17         0         0          17        17
##  [5,]   0   1     0   0     80         0        80           0         0
##  [6,]   0   1     0   1     29         0        29           0        29
##  [7,]   0   1     1   0     28         0        28          28         0
##  [8,]   0   1     1   1     63         0        63          63        63
##  [9,]   1   0     0   0   2172      2172         0           0         0
## [10,]   1   0     0   1     73        73         0           0        75
## [11,]   1   0     1   0     44        44         0          44         0
## [12,]   1   0     1   1     23        23         0          23        24
## [13,]   1   1     0   0    187       187       188           0         0
## [14,]   1   1     0   1    255       256       260           0       258
## [15,]   1   1     1   0     51        53        56          53         0
## [16,]   1   1     1   1    412       418       432         417       431
## attr(,"class")
## [1] "VennCounts"

We want to confirm that not only the peaks are overlapped but also their summits are close to each other.

peaks.summit <- lapply(peaks, function(.ele) {
    .ele <- shift(.ele, .ele$peak)
    width(.ele) <- 1
    .ele
})

TEAD.summit <- unlist(GRangesList(peaks.summit[new.group=="TEAD4"]))
TAZ.summit <- unlist(GRangesList(peaks.summit[new.group=="TAZ"]))
bof <- binOverFeature(TEAD.summit, 
                      annotationData=TAZ.summit, 
                      radius=300, nbins=300, FUN=length)

plot(as.numeric(rownames(bof)), bof, 
     ylab="Peak density", 
     xlab="Distance to the summit of TAZ peaks", 
     main="Position of TEAD4 peak summit", 
     type="l", xlim=c(-300, 300))

Steps of annotation by `ChIPpeakAnno`

The functions, toRanges, annotatePeakInBatch, and addGeneIDs in the ChIPpeakAnno, the annotation of ChIP-Seq peaks becomes streamlined into four major steps:

Read peak data with toGRanges
Generate annotation data with toGRanges
Annotate peaks with annotatePeakInBatch
Add additional informations with addGeneIDs

Prepare annotation data

The method toGRanges can be used to construct GRanges object from various annotation format such as BED, GFF, user defined readable text files, EnsDb or TxDb object.
Please type ?toGRanges for more information.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature="gene")

annoData[1]

## GRanges object with 1 range and 0 metadata columns:
##     seqnames               ranges strand
##        <Rle>            <IRanges>  <Rle>
##   1    chr19 [58858172, 58874214]      -
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Exercise: Prepare annotation data from Ensembl/UCSC based annotation package

Ensembl Archives: http://www.ensembl.org/info/website/archives/index.html
UCSC release numbers: https://genome.ucsc.edu/FAQ/FAQreleases.html

library(EnsDb.Hsapiens.v75)
toGRanges(EnsDb.Hsapiens.v75)
toGRanges(EnsDb.Hsapiens.v75, feature = "transcript")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene)
toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature = "geneModel")
toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature = "threeUTR")

Annotate by `annotatePeakInBatch`

Then we use annotatePeakInBatch or annoPeaks function in ChIPpeakAnno package to annotate the peaks. Depend on the experiment, we can annotate the peaks in multiple ways.

annotatePeakInBatch

overlaps <- ol$peaklist[["TAZ///TEAD4///YAP"]]
## annotate the peaks by nearest annotations.
YAP.TAZ.TEAD4.nearest <- 
    annotatePeakInBatch(overlaps, 
                        AnnotationData=annoData)

## annotate the peaks by both side in a given range.
YAP.TAZ.TEAD4.2side <- 
    annotatePeakInBatch(overlaps, 
                        AnnotationData=annoData, 
                        output="nearestBiDirectionalPromoters", 
                        bindingRegion=c(-100000, 100000))

Exercise: annotatePeakInBatch

overlaps <- ol$peaklist[["TAZ///TEAD4///YAP"]]
YAP.TAZ.TEAD4.nearest <- 
    annotatePeakInBatch(overlaps, 
                        AnnotationData=annoData)
YAP.TAZ.TEAD4.promoter <- 
    annotatePeakInBatch(overlaps, 
                        AnnotationData=annoData, 
                        output="nearestBiDirectionalPromoters", 
                        bindingRegion=c(-10000, 2000))

pie1(table(YAP.TAZ.TEAD4.2side$insideFeature))

Add gene symbols by `addGeneIDs`

library(org.Hs.eg.db)
YAP.TAZ.TEAD4.nearest.anno <-
    addGeneIDs(annotatedPeak = YAP.TAZ.TEAD4.nearest,
           orgAnn = org.Hs.eg.db,
           feature_id_type = "entrez_id")
YAP.TAZ.TEAD4.nearest.anno[1:3]

## GRanges object with 3 ranges and 11 metadata columns:
##             seqnames               ranges strand |
##                <Rle>            <IRanges>  <Rle> |
##   X01.23254     chr1 [14812601, 14813726]      * |
##    X02.7709     chr1 [16249621, 16250159]      * |
##   X03.23400     chr1 [17339107, 17339960]      * |
##                                            peakNames        peak
##                                      <CharacterList> <character>
##   X01.23254 TAZ__olp0130,TEAD4__olp0074,YAP__olp0097          01
##    X02.7709 TEAD4__olp0089,YAP__olp0120,TAZ__olp0155          02
##   X03.23400 YAP__olp0137,TAZ__olp0184,TEAD4__olp0105          03
##                 feature start_position end_position feature_strand
##             <character>      <integer>    <integer>    <character>
##   X01.23254       23254       14925213     15444544              +
##    X02.7709        7709       16268364     16302627              -
##   X03.23400       23400       17312453     17338423              -
##             insideFeature distancetoFeature shortestDistance
##                  <factor>         <numeric>        <integer>
##   X01.23254      upstream           -112612           111487
##    X02.7709    downstream             53006            18205
##   X03.23400      upstream              -684              684
##             fromOverlappingOrNearest      symbol
##                          <character> <character>
##   X01.23254          NearestLocation        KAZN
##    X02.7709          NearestLocation      ZBTB17
##   X03.23400          NearestLocation     ATP13A2
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Exercise: addGeneIDs

library(org.Hs.eg.db)
YAP.TAZ.TEAD4.promoter.anno <- 
    addGeneIDs(annotatedPeak = YAP.TAZ.TEAD4.promoter, 
               orgAnn = org.Hs.eg.db, 
               feature_id_type = "entrez_id")

Save annotation results

The annotation results can be saved in XLS file format using WriteXLS package to avoid the gene name errors that can be inadvertently introduced when opened by Excel (Zeeberg et al., 2004).

## save annotations
YAP.TAZ.TEAD4.2side.m <- as.data.frame(unname(YAP.TAZ.TEAD4.2side))
YAP.TAZ.TEAD4.2side.m$peakNames <- NULL
library(WriteXLS)
WriteXLS("YAP.TAZ.TEAD4.2side.m", 
         "YAP.TAZ.TEAD4.overlapping.peaks.anno.xls")

ErrInExcel

Enrichment analysis

With the annotated peak data, you can call the function getEnrichedGO to obtain a list of enriched GO terms. Please note that the default setting of feature_id_type is “ensembl_gene_id”. If you are using TxDb as annotation data, you need to change it to “entrez_id”.

## enrichment analysis
library(org.Hs.eg.db)
over <- getEnrichedGO(YAP.TAZ.TEAD4.2side, 
                      orgAnn="org.Hs.eg.db", 
                      feature_id_type="entrez_id", 
                      maxP=.01, condense=TRUE)

over[["bp"]][1:3, -c(3, 10)]

##        go.id                       go.term Ontology count.InDataset
## 1 GO:0002252       immune effector process       BP               8
## 2 GO:0006879 cellular iron ion homeostasis       BP               2
## 3 GO:0006882 cellular zinc ion homeostasis       BP               2
##   count.InGenome      pvalue totaltermInDataset totaltermInGenome
## 1           1169 0.008433204               3535           1462775
## 2             48 0.006117350               3535           1462775
## 3             25 0.001687957               3535           1462775

dim(over[["bp"]])

## [1] 15 10

For pathway analysis, you can call function getEnrichedPATH using reactome or KEGG database.

library(reactome.db)
path <- getEnrichedPATH(YAP.TAZ.TEAD4.2side, 
                        orgAnn="org.Hs.eg.db", 
                        pathAnn="reactome.db", 
                        feature_id_type="entrez_id", 
                        maxP=.05)

path[1:2, ]

##        path.id EntrezID count.InDataset count.InGenome     pvalue
## 1 R-HSA-165054     1104               1             33 0.04744971
## 2 R-HSA-166663     1401               1             23 0.03331229
##   totaltermInDataset totaltermInGenome PATH
## 1                159            108031   NA
## 2                159            108031   NA

Bidirectional promoters

Bidirectional promoters are the DNA regions located between the 5’ ends of two adjacent genes coded on opposite strands. The two adjacent genes are transcribed to the opposite directions, and often co-regulated by this shared promoter region(Robertson et al., 2007). Here is an example to find peaks with bi-directional promoters and output the percentage of the peaks near bi-directional promoters.

bdp <- peaksNearBDP(overlaps, annoData, MaxDistance=100000)

bdp

## $peaksWithBDP
## GRangesList object of length 9:
## $2 
## GRanges object with 2 ranges and 9 metadata columns:
##      seqnames               ranges strand |
##         <Rle>            <IRanges>  <Rle> |
##   X2     chr1 [16249621, 16250159]      * |
##   X2     chr1 [16249621, 16250159]      * |
##                                     peakNames   bdp_idx        peak
##                               <CharacterList> <integer> <character>
##   X2 TEAD4__olp0089,YAP__olp0120,TAZ__olp0155         2          X2
##   X2 TEAD4__olp0089,YAP__olp0120,TAZ__olp0155         2          X2
##          feature       feature.ranges feature.strand  distance
##      <character>            <IRanges>          <Rle> <integer>
##   X2      149563 [16330731, 16333184]              +     80571
##   X2      729614 [16160710, 16174642]              -     74978
##      insideFeature distanceToSite
##           <factor>      <integer>
##   X2      upstream          80571
##   X2      upstream          74978
## 
## $4 
## GRanges object with 2 ranges and 9 metadata columns:
##      seqnames               ranges strand |
##   X4     chr1 [17516131, 17517419]      * |
##   X4     chr1 [17516131, 17517419]      * |
##                                     peakNames bdp_idx peak feature
##   X4 TEAD4__olp0109,TAZ__olp0187,YAP__olp0140       4   X4   11240
##   X4 TEAD4__olp0109,TAZ__olp0187,YAP__olp0140       4   X4   29943
##            feature.ranges feature.strand distance insideFeature
##   X4 [17393256, 17445948]              -    70182      upstream
##   X4 [17531621, 17572501]              +    14201      upstream
##      distanceToSite
##   X4          70182
##   X4          14201
## 
## $22 
## GRanges object with 2 ranges and 9 metadata columns:
##       seqnames               ranges strand |
##   X22     chr1 [85666766, 85667899]      * |
##   X22     chr1 [85666766, 85667899]      * |
##                                      peakNames bdp_idx peak feature
##   X22 YAP__olp0713,TEAD4__olp0529,TAZ__olp0876      22  X22  646626
##   X22 YAP__olp0713,TEAD4__olp0529,TAZ__olp0876      22  X22   84144
##             feature.ranges feature.strand distance insideFeature
##   X22 [85742041, 85865646]              +    74141      upstream
##   X22 [85623356, 85666728]              -       37      upstream
##       distanceToSite
##   X22          74141
##   X22             37
## 
## ...
## <6 more elements>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $percentPeaksWithBDP
## [1] 0.1428571
## 
## $n.peaks
## [1] 63
## 
## $n.peaksWithBDP
## [1] 9

Check distribution of peaks

The distribution of peaks over exon, intron, enhancer, proximal promoter, 5’UTR and 3’UTR could give you some clues of how to annotate the peaks. The distribution can be summarized in the peak centric or nucleotide centric view using the function assignChromosomeRegion in ChIPpeakAnno package.

seqlevelsStyle(overlaps) ## == UCSC
aCR<-assignChromosomeRegion(overlaps, nucleotideLevel=FALSE,
                            TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)

par(mar=c(10.1, 4.1, 4.1, 2.1))
barplot(aCR$percentage, las=3)

Discover the binding pattern

Owing to most of the peaks are in the intergenic regions, we questioned whether most YAP/TAZ/TEAD common peaks are located in enhancers. Enhancers can be distinguished from promoters by their epigenetic features, that is, relative enrichment of histone H3 monomethylation (H3K4me1) on Lys4 at enhancers, and trimethylation (H3K4me3) at promoters. We first determine the distribution of the distance between the peaks and known genes. And then check the correlation of peaks and enhancers.

dist2TSS <- 
    lapply(list(YAP=GSE66081[["YAP"]],
                TAZ=GSE66081[["TAZ"]],
                TEAD4=GSE66081[["TEAD4"]],
                YAP.TAZ.TEAD4=ol$peaklist[["TAZ///TEAD4///YAP"]]),
           function(.ele) {
               .ele <- annotatePeakInBatch(.ele, 
                                           AnnotationData=annoData,
                                           output="nearestLocation")
               .ele$shortestDistance})
dist2TSS.cut <- lapply(dist2TSS, cut, breaks=c(0, 1e3, 1e4, 1e5, 1e10))
dist2TSS.table <- sapply(dist2TSS.cut, table)
dist2TSS.percentage <- apply(dist2TSS.table, 2, 
                             function(.ele) .ele/sum(.ele))
library(ggplot2)
library(reshape2)
dist2TSS.percentage <- melt(dist2TSS.percentage)
dist2TSS.percentage$value <- dist2TSS.percentage$value * 100

ggplot(dist2TSS.percentage, aes(x=Var2, y=value, fill=Var1)) +
    geom_bar(stat="identity") +
    xlab("") + ylab("%") + labs(title="Distance to TSS") + theme_bw() + 
    scale_fill_manual(name="",
                      values=c("#FF0000FF", "#FF0000BB", 
                               "#FF000077", "#FF000033"),
                      labels=c("<1kb", "1-10kb", "10-100kb", ">100kb"))

Because over 70% of the peaks are far from known TSS (more than 10Kb), we want to confirm the peaks could be linked to active enhancer region. The ChIP-seq data of epigenetic marks of same cell line are availble in GSE49651.

We use GEOquery package to download data from GEO. We can also download the data by SRAdb package and call peaks from raw reads. Here, we use the peaks saved in the GSE49651.

library(GEOquery)
getGEOSuppFiles("GSE49651")
epipeaks.files <- dir("GSE49651", full.names=TRUE)
library(ChIPpeakAnno)
epipeaks <- lapply(epipeaks.files[3:5], 
                   function(.ele) toGRanges(gzfile(.ele), format="BED"))
names(epipeaks) <- 
    gsub("GSE49651_MDAMB231.(.*?).hg19.*._peaks.txt.gz", "\\1", 
         basename(epipeaks.files)[3:5])

The presence of H3K4me1 and H3K4me3 peaks, their genomic locations and their overlap can be used as the criteria to define promoters and enhancers. H3K4me3 peaks not overlapping with H3K4me1 peaks and close to a TSS (5kb) are defined as promoters, as NA otherwise; H3K4me1 peaks not overlapping with H3K4me3 peaks are defined as enhancers. Promoters or enhancers are defined as active if overlapping with H3K27ac peaks.

(Modified from Nature Cell Biology, doi:10.1038/ncb3216, F. Zanconato)

promoter.UCSC <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene, 5e3, 5e3)
promoter.UCSC <- unique(promoter.UCSC)
attach(epipeaks)
ol.promoter <- findOverlapsOfPeaks(H3K4me1, H3K4me3, H3K27Ac, 
                                   promoter.UCSC, 
                                   ignore.strand = FALSE)
promoter <- c(ol.promoter$peaklist[["H3K4me3///H3K27Ac///promoter.UCSC"]],
              ol.promoter$peaklist[["H3K4me1///H3K4me3///H3K27Ac///promoter.UCSC"]])
enhancer.active <- ol.promoter$peaklist[["H3K4me1///H3K27Ac"]]
enhancer.inactive <- ol.promoter$peaklist[["H3K4me1"]]

mcols(enhancer.active) <- DataFrame(type="enhancer.active")
mcols(enhancer.inactive) <- DataFrame(type="enhancer.inactive")
mcols(promoter) <- DataFrame(type="promoter")
types <- c(promoter, enhancer.active, enhancer.inactive)

YAP.TAZ.TEAD <- ol$peaklist[["TAZ///TEAD4///YAP"]]
seqlevelsStyle(YAP.TAZ.TEAD) <- "UCSC"
YAP.TAZ.TEAD <- YAP.TAZ.TEAD[seqnames(YAP.TAZ.TEAD) %in% 
                                 seqlevels(types)]
YAP.TAZ.TEAD4.type <- findOverlaps(YAP.TAZ.TEAD, types)
tbl <- table(types[subjectHits(YAP.TAZ.TEAD4.type)]$type)
tbl["not.classified"] <- length(YAP.TAZ.TEAD) - 
    length(unique(queryHits(YAP.TAZ.TEAD4.type)))

pie1(tbl)

The binding pattern could be visualized and compared by heatmap and distribution curve from the binding ranges of overlapping peaks of target TFs. For big bedgraph files, bedtools are used to decrease the file size before importing into R.

##heatmap
YAP.TAZ.TEAD.assigned <- YAP.TAZ.TEAD[queryHits(YAP.TAZ.TEAD4.type)]
YAP.TAZ.TEAD.assigned$type <- types[subjectHits(YAP.TAZ.TEAD4.type)]$type
YAP.TAZ.TEAD.assigned <- c(YAP.TAZ.TEAD.assigned[YAP.TAZ.TEAD.assigned$type=="enhancer.active"], YAP.TAZ.TEAD.assigned[YAP.TAZ.TEAD.assigned$type=="promoter"])
YAP.TAZ.TEAD.assigned <- unique(YAP.TAZ.TEAD.assigned)

library(rtracklayer)
YAP.TAZ.TEAD.assigned.center <- 
    reCenterPeaks(YAP.TAZ.TEAD.assigned, width=1)
YAP.TAZ.TEAD.assigned.reCenter <- 
    reCenterPeaks(YAP.TAZ.TEAD.assigned, width=2000)
untar("GSE49651/GSE49651_RAW.tar")
sapply(c("GSM1204470_MDAMB231.H3K4me1_1.hg19.tags.bedGraph.gz",
         "GSM1204472_MDAMB231.H3K4me3_1.hg19.tags.bedGraph.gz",
         "GSM1204474_MDAMB231.H3K27Ac_1.hg19.tags.bedGraph.gz"), gunzip)

bams <- c("YAP", "TAZ", "TEAD4", "JUN")
library(BSgenome.Hsapiens.UCSC.hg19)
len <- seqlengths(Hsapiens)
write.table(len, file="hg19.genome.txt", 
            quote=FALSE, col.names=FALSE, 
            sep="\t")
sapply(bams, function(.ele){
    system(paste0("genomeCoverageBed -bga -split -ibam ", 
                  .ele, "_rep1.rmdup.bam -g hg19.genome.txt > ", 
                  .ele, "_rep1.bedGraph"))
})


files <- c(YAP="YAP_rep1.bedGraph",
           TAZ="TAZ_rep1.bedGraph",
           TEAD4="TEAD4_rep1.bedGraph",
           JUN="JUN_rep1.bedGraph",
           H3K4me1="GSM1204470_MDAMB231.H3K4me1_1.hg19.tags.bedGraph",
           H3K4me3="GSM1204472_MDAMB231.H3K4me3_1.hg19.tags.bedGraph",
           H3K27Ac="GSM1204474_MDAMB231.H3K27Ac_1.hg19.tags.bedGraph")
export(YAP.TAZ.TEAD.assigned.reCenter, "tmp.bed", format="BED")
sapply(files, function(.ele) 
    system(paste("intersectBed -a", .ele, "-b tmp.bed >", 
                 gsub("bedGraph", "sub.bedGraph", .ele))))
unlink("tmp.bed")
library(trackViewer)
cvglists <- sapply(gsub("bedGraph", "sub.bedGraph", files), 
                   importData, format="bedGraph")
names(cvglists) <- names(files)
sig <- featureAlignedSignal(cvglists, YAP.TAZ.TEAD.assigned.center, 
                            upstream=1000, downstream=1000)

names(sig)

## [1] "YAP"     "TAZ"     "TEAD4"   "JUN"     "H3K4me1" "H3K4me3" "H3K27Ac"

heatmap <- featureAlignedHeatmap(sig, YAP.TAZ.TEAD.assigned.center, 
                                 upstream=1000, downstream=1000,
                                 annoMcols=c("type"), 
                                 margin=c(.1, .01, .15, .25))

Summarize the consensus

We first get the sequences of the peaks and then summarize the enriched oligos.

library(BSgenome.Hsapiens.UCSC.hg19)
YAP.TAZ.TEAD4.uniq <- unique(YAP.TAZ.TEAD4.nearest)
YAP.TAZ.TEAD4.uniq <- YAP.TAZ.TEAD4.uniq[!is.na(YAP.TAZ.TEAD4.uniq$feature)]
strand(YAP.TAZ.TEAD4.uniq) <- as.character(YAP.TAZ.TEAD4.uniq$feature_strand)
seq <- getAllPeakSequence(YAP.TAZ.TEAD4.uniq, upstream=0, downstream=0, 
                          genome=Hsapiens)
freqs <- oligoFrequency(Hsapiens$chr1, MarkovOrder=3)
os <- oligoSummary(seq, oligoLength=6, MarkovOrder=3, 
                   quickMotif=TRUE, freqs=freqs, revcomp=TRUE) 
zscore <- sort(os$zscore)

h <- hist(zscore, breaks=100)
text(zscore[length(zscore)], max(h$counts)/10, 
     labels=names(zscore[length(zscore)]), adj=1)

library(motifStack)
pfms <- mapply(function(.ele, id)
    new("pfm", mat=.ele, name=paste("SAMPLE motif", id)), 
    os$motifs, 1:length(os$motifs))

motifStack(pfms)

Exercise: Try to run oligoSummary by full dataset

library(BSgenome.Hsapiens.UCSC.hg19)
YAP.TAZ.TEAD4.uniq <- unique(YAP.TAZ.TEAD4.nearest)
YAP.TAZ.TEAD4.uniq <- YAP.TAZ.TEAD4.uniq[!is.na(YAP.TAZ.TEAD4.uniq$feature)]
strand(YAP.TAZ.TEAD4.uniq) <- as.character(YAP.TAZ.TEAD4.uniq$feature_strand)
seq <- getAllPeakSequence(YAP.TAZ.TEAD4.uniq, upstream=0, downstream=0, genome=Hsapiens)
freqs <- oligoFrequency(Hsapiens$chr1, MarkovOrder=3)
os <- oligoSummary(seq, oligoLength=6, MarkovOrder=3, 
                   quickMotif=TRUE, freqs=freqs, revcomp=TRUE) 
zscore <- sort(os$zscore)
h <- hist(zscore, breaks=100)
text(zscore[length(zscore)], max(h$counts)/10, 
     labels=names(zscore[length(zscore)]), adj=1)
library(motifStack)
pfms <- mapply(function(.ele, id)
    new("pfm", mat=.ele, name=paste("SAMPLE motif", id)), 
    os$motifs, 1:length(os$motifs))
motifStack(pfms)

The oligoSummary is a quick tool to calculate the enriched k-mers in the peaks. To get accurate motifs, multiple softwares could be used, such as the MEME Suite(Bailey et al., 1994), Consensus(Hertz et al., 1990), rGADEM, Homer, and et. al.

ap1 <- GSE66081[["JUN"]]
seq.ap1 <- getAllPeakSequence(ap1, upstream=0, downstream=0, genome=Hsapiens)
os.ap1 <- oligoSummary(seq.ap1, oligoLength=7, MarkovOrder=3, 
                       quickMotif=FALSE, freqs=freqs, revcomp=TRUE)
zscore.ap1 <- sort(os.ap1$zscore)

h.ap1 <- hist(zscore.ap1, breaks=100)
text(zscore.ap1[length(zscore.ap1)], max(h.ap1$counts)/10, 
     labels=names(zscore.ap1[length(zscore.ap1)]), adj=1)

Build Regulatory Network from ChIP-seq and Expression Data

Intersecting genes with promoter regions bound by YAP/TAZ from the ChIP experiment and genes differential expressed in knockdown of YAP/TAZ allow us to identify genes directly/indirectly regulated by YAP/TAZ by GeneNetworkBuilder package.

Gene expression data are downloaded from GSE59232 by GEOquery package.

library(GEOquery)
library(limma)
gset <- getGEO("GSE59232", GSEMatrix =TRUE, AnnotGPL=TRUE)[[1]]
pD <- pData(gset)[, c("title", "geo_accession")]
gset <- gset[, grepl("MDA-MB-231 cells", pD$title, fixed = TRUE)]
pD <- pData(gset)[, c("title", "geo_accession")]
ex <- exprs(gset) ## log2 transformed expression profile

Affymetrix microarray will be analyzed by limma package.

fl <- do.call(rbind, 
              strsplit(sub("MDA-MB-231 cells ", 
                           "", pD$title), ", "))
sml <- sub("#.$", "", fl[, 1])
design.table <- data.frame(group=sml, 
                           batch=fl[, 1],
                           replicate=fl[, 2])
design <- model.matrix(~ -1 + group + batch + replicate, 
                       data = design.table)
colnames(design) <- sub("group", "", 
                        make.names(colnames(design)))
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(contrasts = "siYAP.TAZ-siControl", 
                             levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
tT <- topTable(fit2, adjust="fdr", number = nrow(fit2))

Using interaction data from BioGRID saved in simpIntLists package, combined with the direct targets of TAZ, buildNetwork function will build a regulation network. And the network will be filtered by filterNetwork function by the expression profile.

library(simpIntLists)
interactionmap <- findInteractionList("human", "Official")
interactionmap <- lapply(interactionmap, function(.ele) 
    cbind(from=as.character(.ele$name), 
          to=as.character(.ele$interactors)))
interactionmap <- do.call(rbind, interactionmap)
library(GeneNetworkBuilder)
rootgene <- "TAZ"
TFbindingTable <- 
    cbind(from=rootgene,
          to=unique(YAP.TAZ.TEAD4.nearest.anno$symbol))
sifNetwork <-
    buildNetwork(TFbindingTable=TFbindingTable, 
                 interactionmap=interactionmap, level=2)
tT$symbols <- tT$Gene.symbol
tT <- tT[, c("ID", "logFC", "P.Value", "symbols")]
expressionData <- tT[!duplicated(tT[, "symbols"]), ]
cifNetwork <-
    filterNetwork(rootgene=rootgene, sifNetwork=sifNetwork, 
                  exprsData=expressionData, mergeBy="symbols",
                  miRNAlist=character(0), 
                  tolerance=1, cutoffPVal=0.001, 
                  cutoffLFC=1.5)
## polish network
gR<-polishNetwork(cifNetwork)

## plot by Rgraphviz
library(Rgraphviz)
plotNetwork<-function(gR, layouttype="dot", ...){
    if(!is(gR,"graphNEL")) stop("gR must be a graphNEL object")
    if(!(GeneNetworkBuilder:::inList(layouttype, c("dot", "neato", "twopi", "circo", "fdp")))){
        stop("layouttype must be dot, neato, twopi, circo or fdp")
    }
    g1<-Rgraphviz::layoutGraph(gR, layoutType=layouttype, ...)
    nodeRenderInfo(g1)$col <- nodeRenderInfo(gR)$col
    nodeRenderInfo(g1)$fill <- nodeRenderInfo(gR)$fill
    renderGraph(g1)
}
plotNetwork(gR)

browseNetwork(gR)

View tracks by `trackViewer`

The peaks could be visualized by trackViewer package.

library(trackViewer)
## prepare ranges to view
gene <- "SYDE2"
eID <- mget(gene, org.Hs.egSYMBOL2EG)
gr <- as(annoData[eID[[1]]], "GRanges")
gr.promoter <- promoters(gr, upstream=5000, downstream=2000)
seqlevels(gr.promoter) <- "chr1"
seqinfo(gr.promoter) <- seqinfo(gr.promoter)["chr1"]
## import data
bams.rep1 <- sub(".bam", ".rmdup.bam", 
                 bam.files[grepl("rep1", bam.files)])
syde <- lapply(bams.rep1, importBam, ranges=gr.promoter)
names(syde) <- gsub("_rep1.rmdup.bam", "", bams.rep1)
## get gene model
trs <- geneModelFromTxdb(TxDb.Hsapiens.UCSC.hg19.knownGene, 
                         orgDb = org.Hs.eg.db, gr = gr.promoter)
trs.names <- sapply(trs, function(.ele) .ele@name)
trs <- trs[match(gene, trs.names)]
names(trs) <- gene
## modify the plot styles
optSty <- optimizeStyle(trackList(syde, trs, heightDist=c(.9, .1)), 
                        theme="col")
trackList <- optSty$tracks
viewerStyle <- optSty$style
for(i in 1:length(trackList)) 
    setTrackStyleParam(trackList[[i]], "ylabgp", list(cex=.8))

## plot
vp <- viewTracks(trackList, gr=gr.promoter, viewerStyle=viewerStyle)
addGuideLine(guideLine = c(85666950, 85667700), vp = vp)

Too complicated? Try browseTracks

Exercise: browseTracks

library(trackViewer)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(ChIPseqStepByStep)
(ex.gr <- parse2GRanges("chr1:85664729-85671728:-"))
## get gene model
trs <- geneModelFromTxdb(TxDb.Hsapiens.UCSC.hg19.knownGene, 
                         orgDb = org.Hs.eg.db, gr = ex.gr)
## import signals
packagePath <- system.file("extdata", "bedGraph", package = "ChIPseqStepByStep")
(ex.bg <- dir(packagePath, "bedGraph"))
ex.scores <- sapply(ex.bg, function(.ele){
    importScore(file.path(packagePath, .ele), 
               format = "bedGraph", 
               ranges = ex.gr)
})
## browse the tracks
browseTracks(trackList(ex.scores, trs, heightDist = c(3/4, 1/4)), gr=ex.gr)

Software availability

This workflow depends on various packages from version 3.5 of the Bioconductor project, running on R version 3.4.1 or higher. Version numbers for all packages used are shown below.

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.21.0                       
##  [2] plotly_4.7.1                           
##  [3] simpIntLists_1.13.0                    
##  [4] limma_3.33.7                           
##  [5] png_0.1-7                              
##  [6] trackViewer_1.13.8                     
##  [7] knitr_1.16                             
##  [8] motifStack_1.21.1                      
##  [9] ade4_1.7-6                             
## [10] MotIV_1.33.0                           
## [11] grImport_0.9-0                         
## [12] XML_3.98-1.9                           
## [13] BSgenome.Hsapiens.UCSC.hg19_1.4.0      
## [14] BSgenome_1.45.1                        
## [15] reactome.db_1.59.1                     
## [16] org.Hs.eg.db_3.4.1                     
## [17] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [18] GenomicFeatures_1.29.8                 
## [19] AnnotationDbi_1.39.2                   
## [20] csaw_1.11.1                            
## [21] BiocParallel_1.11.4                    
## [22] ChIPQC_1.13.1                          
## [23] DiffBind_2.5.8                         
## [24] ggplot2_2.2.1                          
## [25] idr_1.2                                
## [26] GenomicAlignments_1.13.4               
## [27] SummarizedExperiment_1.7.5             
## [28] DelayedArray_0.3.17                    
## [29] matrixStats_0.52.2                     
## [30] Biobase_2.37.2                         
## [31] rtracklayer_1.37.3                     
## [32] GeneNetworkBuilder_1.19.1              
## [33] Rcpp_0.12.12                           
## [34] ChIPpeakAnno_3.11.4                    
## [35] VennDiagram_1.6.17                     
## [36] futile.logger_1.4.3                    
## [37] Rsamtools_1.29.0                       
## [38] Biostrings_2.45.3                      
## [39] XVector_0.17.0                         
## [40] GenomicRanges_1.29.12                  
## [41] GenomeInfoDb_1.13.4                    
## [42] IRanges_2.11.12                        
## [43] S4Vectors_0.15.5                       
## [44] Rsubread_1.27.4                        
## [45] reshape2_1.4.2                         
## [46] SRAdb_1.35.0                           
## [47] RCurl_1.95-4.8                         
## [48] bitops_1.0-6                           
## [49] graph_1.55.0                           
## [50] BiocGenerics_0.23.0                    
## [51] RSQLite_2.0                            
## 
## loaded via a namespace (and not attached):
##   [1] htmlwidgets_0.9                          
##   [2] BatchJobs_1.6                            
##   [3] munsell_0.4.3                            
##   [4] systemPipeR_1.11.0                       
##   [5] colorspace_1.3-2                         
##   [6] BiocInstaller_1.27.2                     
##   [7] Category_2.43.1                          
##   [8] labeling_0.3                             
##   [9] GenomeInfoDbData_0.99.1                  
##  [10] hwriter_1.3.2                            
##  [11] bit64_0.9-7                              
##  [12] pheatmap_1.0.8                           
##  [13] fail_1.3                                 
##  [14] rprojroot_1.2                            
##  [15] TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2  
##  [16] lambda.r_1.1.9                           
##  [17] biovizBase_1.25.1                        
##  [18] regioneR_1.9.1                           
##  [19] R6_2.2.2                                 
##  [20] locfit_1.5-9.1                           
##  [21] AnnotationFilter_1.1.3                   
##  [22] assertthat_0.2.0                         
##  [23] scales_0.4.1                             
##  [24] nnet_7.3-12                              
##  [25] gtable_0.2.0                             
##  [26] ensembldb_2.1.10                         
##  [27] seqLogo_1.43.0                           
##  [28] rlang_0.1.1                              
##  [29] genefilter_1.59.0                        
##  [30] BBmisc_1.11                              
##  [31] splines_3.4.1                            
##  [32] lazyeval_0.2.0                           
##  [33] acepack_1.4.1                            
##  [34] GEOquery_2.43.0                          
##  [35] dichromat_2.0-0                          
##  [36] brew_1.0-6                               
##  [37] checkmate_1.8.3                          
##  [38] yaml_2.1.14                              
##  [39] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
##  [40] backports_1.1.0                          
##  [41] httpuv_1.3.5                             
##  [42] Hmisc_4.0-3                              
##  [43] RBGL_1.53.0                              
##  [44] tools_3.4.1                              
##  [45] gplots_3.0.1                             
##  [46] RColorBrewer_1.1-2                       
##  [47] plyr_1.8.4                               
##  [48] base64enc_0.1-3                          
##  [49] progress_1.1.2                           
##  [50] zlibbioc_1.23.0                          
##  [51] purrr_0.2.2.2                            
##  [52] prettyunits_1.0.2                        
##  [53] rpart_4.1-11                             
##  [54] pbapply_1.3-3                            
##  [55] chipseq_1.27.0                           
##  [56] ggrepel_0.6.5                            
##  [57] cluster_2.0.6                            
##  [58] magrittr_1.5                             
##  [59] data.table_1.10.4                        
##  [60] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2  
##  [61] futile.options_1.0.0                     
##  [62] amap_0.8-14                              
##  [63] ProtGenerics_1.9.0                       
##  [64] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2  
##  [65] mime_0.5                                 
##  [66] evaluate_0.10.1                          
##  [67] xtable_1.8-2                             
##  [68] gridExtra_2.2.1                          
##  [69] compiler_3.4.1                           
##  [70] biomaRt_2.33.3                           
##  [71] tibble_1.3.3                             
##  [72] KernSmooth_2.23-15                       
##  [73] htmltools_0.3.6                          
##  [74] GOstats_2.43.0                           
##  [75] Formula_1.2-2                            
##  [76] tidyr_0.6.3                              
##  [77] sendmailR_1.2-1                          
##  [78] DBI_0.7                                  
##  [79] MASS_7.3-47                              
##  [80] rGADEM_2.25.0                            
##  [81] ShortRead_1.35.1                         
##  [82] Matrix_1.2-10                            
##  [83] gdata_2.18.0                             
##  [84] Gviz_1.21.1                              
##  [85] bindr_0.1                                
##  [86] pkgconfig_2.0.1                          
##  [87] revealjs_0.9                             
##  [88] foreign_0.8-69                           
##  [89] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2     
##  [90] annotate_1.55.0                          
##  [91] multtest_2.33.0                          
##  [92] AnnotationForge_1.19.4                   
##  [93] stringr_1.2.0                            
##  [94] VariantAnnotation_1.23.6                 
##  [95] digest_0.6.12                            
##  [96] rmarkdown_1.6                            
##  [97] htmlTable_1.9                            
##  [98] edgeR_3.19.3                             
##  [99] GSEABase_1.39.0                          
## [100] curl_2.8.1                               
## [101] shiny_1.0.3                              
## [102] gtools_3.5.0                             
## [103] rjson_0.2.15                             
## [104] jsonlite_1.5                             
## [105] bindrcpp_0.2                             
## [106] seqinr_3.3-6                             
## [107] viridisLite_0.2.0                        
## [108] lattice_0.20-35                          
## [109] Nozzle.R1_1.1-1                          
## [110] Rhtslib_1.9.1                            
## [111] httr_1.2.1                               
## [112] survival_2.41-3                          
## [113] GO.db_3.4.1                              
## [114] interactiveDisplayBase_1.15.0            
## [115] glue_1.1.1                               
## [116] bit_1.1-12                               
## [117] stringi_1.1.5                            
## [118] blob_1.1.0                               
## [119] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 
## [120] AnnotationHub_2.9.5                      
## [121] latticeExtra_0.6-28                      
## [122] caTools_1.17.1                           
## [123] memoise_1.1.0                            
## [124] dplyr_0.7.2

The command-line tools used in the workflow including SRA Toolkit (version 2.3.4), MACS2 (version 2.1.0) and Bedtools (version 2.25.0) must be installed on the system.

Acknowledgements

We would like to thank the support from the Department of Molecular, Cell and Cancer Biology (MCCB) at UMass Medical School (UMMS). We thank the early adopters who provided great ideas and feedbacks to enhance the features of the software.