Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes. – Peter J. Park
Quality Assessment of sequencing: FastQC
Peak calling: CCAT, SICER, MACS, ZINBA, BayesPeak, chipseq, ChIPseqR, CSAR, csaw, GenoGAM, iSeq, PICS
Quality assessment: ChIPQC
Peak annotation and pathway analysis: ChIPpeakAnno, Homer, PAVIS, GREAT, ChIPseeker, clusterProfiler, GOstats
Gene Network Building: GeneNetworkBuilder, ChIPXpress
Visualization: trackViewer, IGV, UCSC genome browser, rtracklayer
Motif enrichment analysis: The MEME Suite, homer
The workflow is based primarily on software packages from the open-source Bioconductor project. It contains all steps that are necessary for detecting peaks, starting from the raw read sequences.
Reads are first aligned to the genome using the Rsubread package.
Peaks are called by MACS2.
The peaks are annotated and virualized by ChIPpeakAnno , trackViewer, and GeneNetworkBuilder packages.
We will demonstrate the workflow on a publicly available ChIP-seq dataset to profile YAP/TAZ/TEAD binding sites (Zanconato et al., 2015).
Main functions:
Compare and combine biological replicates (IDRfilter
, findOverlapsOfPeaks
, peakPermTest
)
Annotate to nearest gene (annotatePeakInBatch
)
Pathway Analysis (getEnrichedGO
, getEnrichedPath
)
Separate binding sites into active enhancer regions and promoter regions using histone modification data
Visualize the signal pattern for genomic loci bound by multiple transcription factors
Build regulatory network (buildNetwork
)
Filter regulatory network (filterNetwork
)
Polish regulatory network (polishNetwork
)
Import data (importScore
)
Build gene model (geneModelFromTxdb
)
View the tracks (viewTracks
, browseTracks
)
NCBI GEO ID | SRA project | Description | package |
---|---|---|---|
GSE66081 | SRP055170 | the data of ChIP-seq raw reads for the study of association between YAP/TAZ/TEAD and AP-1 at enhancers drivers oncogenic trowth | SRAdb |
GSE49651 | the H3K27Ac, H3K4me1, H3K4me3 ChIP-seq data in breast cancer cell line | GEOquery | |
GSE59232 | transcriptomic data of Affymetrix GeneChips Human Genome U133 Plus 1.0 | GEOquery |
Here we use SRAdb package to download the sequence data.
## load library SRAdb to retrieve SRA files for GSE66081 dataset
library(SRAdb)
## SRAdb need to download a light sqlite file
## to extract sra file information
## and then download by getSRAfile function.
sqlfile <- "SRAmetadb.sqlite"
if(!file.exists(sqlfile))
sqlfile <- getSRAdbFile()
sra_con <- dbConnect(SQLite(), sqlfile)
conversion <- sraConvert(c("SRP055170"),
sra_con=sra_con)
out <- getSRAfile(conversion$sample,
sra_con=sra_con,
fileType="sra")
## extract file annotations
rs <- listSRAfile(conversion$sample,
sra_con=sra_con,
fileType="sra")
experiment <-
dbGetQuery(sra_con,
paste0("select * from experiment where",
" experiment_accession in ('",
paste(conversion$experiment,
collapse="','"), "')"))
info <- merge(experiment[, c("experiment_accession",
"title", "library_layout")],
rs[, c("experiment", "run")], by=1)
info[1:5, 2:3]
## title library_layout
## 1 GSM1614028: YAP_rep1_ChIPSeq; Homo sapiens; ChIP-Seq SINGLE -
## 2 GSM1614028: YAP_rep1_ChIPSeq; Homo sapiens; ChIP-Seq SINGLE -
## 3 GSM1614028: YAP_rep1_ChIPSeq; Homo sapiens; ChIP-Seq SINGLE -
## 4 GSM1614029: YAP_rep2_ChIPSeq; Homo sapiens; ChIP-Seq SINGLE -
## 5 GSM1614029: YAP_rep2_ChIPSeq; Homo sapiens; ChIP-Seq SINGLE -
These files are downloaded in the SRA format, and need to be unpacked to the FASTQ format prior to alignment. This can be done using the fastq-dump utility from SRA Toolkit.
runs <- info[, "run"]
## extract fastq files from sra by sratoolkit
sapply(paste0(runs, ".sra"),
function(.ele) system(paste("fastq-dump",
.ele)))
Technical replicates are merged together prior to further processing. This reflects the fact that they originate from a single library of DNA fragments.
grp <- do.call(rbind,
strsplit(info$title, "(;|:) "))
info <- cbind(info, grp)
runs <- split(runs, grp[, 2])
group <- gsub("_ChIPSeq", "", names(runs))
runs[1:2]
## $IgG_rep1_ChIPSeq
## [1] "SRR1810913" "SRR1810912" "SRR1810914"
##
## $IgG_rep2_ChIPSeq
## [1] "SRR1810915" "SRR1810917" "SRR1810916"
mapply(function(.ele, n)
system(paste("cat", paste0(.ele, ".fastq", collapse=" "),
">>", paste0(n, ".fastq"), collapse=" ")),
runs, group)
## remove unused files to save storage.
unlink(paste0(info$run, ".fastq"))
unlink(paste0(info$run, ".sra"))
Reads in each library are aligned to the hg19 build of human genome, using the Rsubread package.
library(Rsubread)
fastq.files <- paste0(group, ".fastq")
bam.files <- paste0(group, ".bam")
##getPhredOffset, if the offset is not correct, it will report error.
x <- qualityScores(filename=fastq.files[1],
input_format="FASTQ", offset=33,
nreads=1000)
x[1:3, 1:10]
## 1 2 3 4 5 6 7 8 9 10
## [1,] 34 34 34 37 37 37 35 35 39 39
## [2,] 34 34 34 37 37 37 37 37 39 39
## [3,] 34 34 34 37 37 37 37 37 35 37
library(Rsubread)
reads <- system.file("extdata","reads.txt.gz",package="Rsubread")
head(qualityScores(filename = reads,
input_format = "gzFASTQ",
offset = 33,
nreads = 100)) ## please check the errors
head(qualityScores(filename = reads,
input_format = "gzFASTQ",
offset = 64,
nreads = 100))
An index is constructed with the prefix index/hg19 and the index can be re-used. Here, the type
of sequencing data is set to 1 for genomic DNA-seq data. The uniquely mapped reads should be reported only by setting the uniqe
to TRUE.
## build index and this can be re-used.
hg19.fasta <-
"Genomes/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa"
dir.create("index")
buildindex(basename="index/hg19",
reference=hg19.fasta)
## alignment
align(index="./index/hg19", readfile1=fastq.files,
type=1,
output_file=bam.files, maxMismatches=2,
nthreads=2,
phredOffset=33, unique=TRUE)
Subsample the data to save time for this tutorial.
library(Rsamtools)
filter <- FilterRules(list(seqn = function(x) x$rname == "chr1"))
## sort and index
null <- sapply(group, function(gp) {
sortBam(paste0(gp, ".bam"), paste0(gp, ".srt"))
file.copy(paste0(gp, ".srt.bam"),
paste0(gp, ".bam"), overwrite = TRUE)
indexBam(paste0(gp, ".bam"))
})
## subset (optional)
null <- sapply(group, function(gp) {
dest <- paste0(gp, ".chr1.bam")
filterBam(paste0(gp, ".bam"), paste0(gp, ".chr1.bam"),
filter=filter)
file.rename(paste0(gp, ".chr1.bam"), paste0(gp, ".bam"))
indexBam(paste0(gp, ".bam"))
})
Potential PCR duplicates are removed by removeDupReads function in Rsubread package. This step can also be done by MarkDuplicates tool from the Picard software suite.
By asBam function from Rsamtools package, the alignment is sorted by their mapping location, and an index created on the destination (with extension ‘.bai’) when indexDestination=TRUE
.
## remove reads which are mapped to identical locations,
## using mapping location of the first base of each read.
## and resort by position
library(Rsamtools)
library(Rsubread)
null <- sapply(group[c(1, 11:13)], function(gp){
out <- asSam(paste0(gp, ".bam"), gp)
removeDupReads(out, threshold=2,
outputFile=paste0(gp, ".rmdup.sam"))
asBam(paste0(gp, ".rmdup.sam"), paste0(gp, ".rmdup"),
indexDestination=TRUE)
unlink(out)
unlink(paste0(gp, ".rmdup.sam"))
})
The control datasets are pooled before calling peaks according the description of the paper.
## pool IgG depend on experiment
IgG.bg <- list(rep1.2=c("IgG_rep1.rmdup.bam",
"IgG_rep2.rmdup.bam",
"IgG_rep1.2.rmdup.bam"),
rep3.4=c("IgG_rep3.rmdup.bam",
"IgG_rep4.rmdup.bam",
"IgG_rep3.4.rmdup.bam"),
rep3.5=c("IgG_rep3.rmdup.bam",
"IgG_rep5.rmdup.bam",
"IgG_rep3.5.rmdup.bam"))
null <- sapply(IgG.bg[3], function(.ele){
out <- mergeBam(files=.ele[-3],
destination = .ele[3],
indexDestination=TRUE)
})
Before we call peaks, we may want to check the difference between ChIP signal and background signal.
By ChIPpeakAnno::cumulativePercentage
function, the difference between the cumulative percentage tag allocation in Input and IP could be clearly checked (Diaz et al., 2012, RamÃrez et al., 2016).
library(ChIPpeakAnno)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
pair <- data.frame(treat=c("YAP", "TAZ", "TEAD4", "JUN"),
control=c("1.2", "1.2", "3.4", "3.5"),
stringsAsFactors = FALSE)
## use chromosome 1 to save the time.
chr1 <- as(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)["chr1"], "GRanges")
## check the difference between TEAD4 ChIP and background signal
i <- 3
cumulativePercentage(
bamfiles = c(paste0(pair[i, "treat"],
"_rep1.rmdup.bam"),
paste0("IgG_rep", pair[i, "control"],
".rmdup.bam")),
gr=chr1)
See documentation from deeptools
We set loose filter condition for MACS2 to get more peaks for irreproducibility discovery rate (IDR) analysis.
for(i in seq.int(nrow(pair))){
system(paste("macs2 callpeak --to-large -t",
paste0(pair[i, "treat"], "_rep1.rmdup.bam"),
"-c ", paste0("IgG_rep", pair[i, "control"], ".rmdup.bam"),
" -f BAM -g hs -q 0.1 -n",
paste0(pair[i, "treat"], "_rep1.q1")))
system(paste("macs2 callpeak --to-large -t",
paste0(pair[i, "treat"], "_rep2.rmdup.bam"),
"-c ", paste0("IgG_rep", pair[i, "control"], ".rmdup.bam"),
" -f BAM -g hs -q 0.1 -n",
paste0(pair[i, "treat"], "_rep2.q1")))
## we will use idr to filter the results later
}
After peak calling, the output of MACS2 will be imported into R by ChIPpeakAnno::toGRanges
function. The top 10000 peaks sorted by pValue (FDR) will be used for IDR analysis.
library(ChIPpeakAnno)
macs2.files <- dir(".", pattern="*.q1_peaks.narrowPeak$")
peaks <- sapply(macs2.files, function(.ele)
toGRanges(.ele, format="narrowPeak"))
peaks.bk <- peaks
peaks <- lapply(peaks, function(.ele)
.ele[order(.ele$pValue, decreasing=TRUE)])
toGRanges
library(ChIPpeakAnno)
packagePath <- system.file("extdata", package = "ChIPpeakAnno")
dir(packagePath, "gff|bed|narrowPeak|broadPeak|gz")
toGRanges(file.path(packagePath, "GFF_peaks.gff"), format = "GFF")
toGRanges(file.path(packagePath, "WS220.bed"), format = "BED")
toGRanges(file.path(packagePath, "peaks.narrowPeak"), format = "narrowPeak")
toGRanges(file.path(packagePath, "TAF.broadPeak"), format = "broadPeak")
toGRanges(file.path(packagePath, "wgEncodeUmassDekker5CGm12878PkV2.bed.gz"), format = "BED")
packagePath <- system.file("extdata", "macs", package = "ChIPseqStepByStep")
macs2.files <- dir(packagePath, pattern="*.q1_peaks.narrowPeak$")
peaks <- sapply(macs2.files, function(.ele)
toGRanges(file.path(packagePath, .ele), format="narrowPeak"))
peaks.bk <- peaks
peaks <- lapply(peaks, function(.ele) .ele[order(.ele$pValue, decreasing=TRUE)])
There are lots of noise peaks when we set loose filter condition for MACS2. We will use IDR framework to filter the low reproducible peaks (Landt et al., 2012). Only overlapping peaks will be used for IDR calculation.
peaks <- lapply(peaks, function(.ele) .ele[1:min(1e5, length(.ele))])
lengths(peaks)
## JUN_rep1.q1_peaks.narrowPeak JUN_rep2.q1_peaks.narrowPeak
## 8877 23544
## TAZ_rep1.q1_peaks.narrowPeak TAZ_rep2.q1_peaks.narrowPeak
## 3647 4200
## TEAD4_rep1.q1_peaks.narrowPeak TEAD4_rep2.q1_peaks.narrowPeak
## 6850 1520
## YAP_rep1.q1_peaks.narrowPeak YAP_rep2.q1_peaks.narrowPeak
## 4187 2593
new.group <- gsub("^(.*?)_.*$", "\\1", names(peaks))
peaks.grp <- split(peaks, new.group)
names(peaks.grp)
## find overlapping peaks
GSE66081 <- sapply(peaks.grp, findOverlapsOfPeaks, simplify = FALSE)
GSE66081.bk <- GSE66081
GSE66081 <- sapply(GSE66081, function(.ele)
.ele$peaklist[[names(.ele$peaklist)[grepl("\\/\\/\\/",
names(.ele$peaklist))]]],
simplify=FALSE)
lengths(GSE66081)
## JUN TAZ TEAD4 YAP
## 3226 1136 693 923
The IDR are calculated by the average coverages of each overlapping peak. The reads counts for peaks are done by summarizeOverlaps function in GenomicAlignments package.
library(ChIPpeakAnno)
new.group.names <- lapply(GSE66081.bk, function(.ele)
sub(".q1_peaks.narrowPeak", ".bam",
names(.ele$peaklist)[!grepl("\\/\\/\\/", names(.ele$peaklist))]))
GSE66081 <- mapply(function(.peaks, .bamfiles)
IDRfilter(peaksA=.peaks[[1]], peaksB=.peaks[[2]],
bamfileA=.bamfiles[1], bamfileB=.bamfiles[2],
IDRcutoff=0.01),
peaks.grp, new.group.names)
## The original peaks are filtered to decrease the memory usage.
peaks.keep <- sapply(GSE66081, function(.ele)
as.character(unlist(.ele$peakNames)))
peaks.keep <- lapply(peaks.keep, function(.ele) gsub("^.*__", "", .ele))
peaks.keep <- unlist(peaks.keep)
peaks.s <- lapply(peaks, function(.ele) .ele[names(.ele) %in% peaks.keep])
peaks.unlist <- unlist(GRangesList(peaks.s), use.names = FALSE)
rtracklayer
Export the filtered peak by export function in rtracklayer package.
library(rtracklayer)
null <- mapply(function(.dat, .name)
export(.dat,
sub("^(.*?).q1.*$", "\\1.fil.bed", .name),
format="BED"),
peaks.s, names(peaks.s))
## filter the peaks by qValue < 10^-10 to get similar number peaks
## as filtered by idr to confirm the effect of IDR filter.
peaks.bk <- lapply(peaks.bk, function(.ele)
.ele[.ele$qValue>10])
null <- mapply(function(.dat, .name)
export(.dat,
sub("^(.*?).q1.*$", "\\1.fil.q01.bed", .name),
format="BED"),
peaks.bk, names(peaks.bk))
The ChIP-seq experiment could be checked by ChIPQC package.
Usally, before mapping, FastQC could be used for sequence quality accessment. And after mapping, SAMStat could be used for mapping quality analysis, and strand cross-correlation could be used for checking the experiment quality via csaw package.
## strand cross-correlation
library(csaw)
scc <- lapply(sub("^(.*?).q1.*$", "\\1.rmdup.bam", names(peaks)),
correlateReads, cross=TRUE,
param=readParam(minq=30,
restrict="chr1",
dedup=TRUE))
names(scc) <- gsub(".q1.*$", "", names(peaks))
scc.param <- lapply(scc, function(.ele){
readsLength <- 50
cutline <- 2 * readsLength
read_length <- which.max(.ele[1:cutline])
fragment_length <- which.max(.ele[cutline:length(.ele)]) + cutline - 1
baseline <- which.min(.ele[cutline:length(.ele)]) + cutline - 1
#normalized.strand.coefficient
NSC <- .ele[fragment_length] / .ele[baseline]
# relative strand correlation
RSC <- (.ele[fragment_length] - .ele[baseline])/(.ele[read_length] - .ele[baseline])
c(read_length=read_length-1,
fragment_length=fragment_length-1,
baseline=baseline-1,
NSC=NSC,
RSC=RSC)
})
The normalized strand coefficient (NSC) and relative strand correlation (RSC) are strong metrics for assessing signal-to-noise ratios in a ChIP-seq experiment. High-quality ChIP-seq data sets should have NSC values >= 1.05 and RSC values >= 0.8 (Landt et al., 2012).
do.call(rbind, scc.param)[, c("NSC", "RSC")]
## NSC RSC
## JUN_rep1 3.410316 1.4603898
## JUN_rep2 2.280298 1.1741713
## TAZ_rep1 2.185869 0.8979216
## TAZ_rep2 2.038841 0.8035722
## TEAD4_rep1 2.409461 0.9944640
## TEAD4_rep2 1.889183 0.6928578
## YAP_rep1 2.295337 0.8845790
## YAP_rep2 2.049041 0.8366900
The absolute and relative height of ‘phantom’ peak and ChIP peak are useful determinants of the success of a ChIP-seq experiment.
op <- par(mfrow=c(4, 2))
for(i in 1:length(scc)){
plot(1:length(scc[[i]])-1, scc[[i]],
xlab="Delay (bp)", ylab="cross-correlation",
main=names(scc)[i], type="l",
ylim=c(scc[[i]][scc.param[[i]]["baseline"]]*.8, max(scc[[i]])*1.1))
abline(v=scc.param[[i]]["fragment_length"], col="red", lty=3)
abline(h=scc[[i]][scc.param[[i]][c("fragment_length", "read_length", "baseline")]+1],
col=c("blue", "green", "gray30"), lty=2)
}
Now we have estimated fragment length and can generate track files for UCSC genome browser.
library(GenomicAlignments)
library(rtracklayer)
bwPath <- "bw"
dir.create(bwPath)
cvgs <- mapply(function(.ele, .name){
gal <- readGAlignments(paste0(.name, ".rmdup.bam"))
cvg <- coverage(gal, width = as.numeric(.ele["fragment_length"]))
seqinfo(cvg) <- seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)
export.bw(cvg,
con=file.path(bwPath,
paste0(.name, ".rmdup.bigWig")))
cvg
}, scc.param, names(scc.param))
ChIPQC is used for quality control.
library(ChIPQC)
samples <-
data.frame(SampleID=gsub("^(.*?).q1.*$", "\\1", names(peaks)),
Factor=gsub("^(.*?)_.*$", "\\1", names(peaks)),
Replicate=gsub("^.*?_rep(.*?).q1.*$", "\\1",
names(peaks)),
bamReads=gsub("^(.*?).q1.*$", "\\1.rmdup.bam",
names(peaks)),
Peaks=names(peaks),
PeakCaller="narrow")
samples[1:3, ]
## SampleID Factor Replicate bamReads
## 1 JUN_rep1 JUN 1 JUN_rep1.rmdup.bam
## 2 JUN_rep2 JUN 2 JUN_rep2.rmdup.bam
## 3 TAZ_rep1 TAZ 1 TAZ_rep1.rmdup.bam
## Peaks PeakCaller
## 1 JUN_rep1.q1_peaks.narrowPeak narrow
## 2 JUN_rep2.q1_peaks.narrowPeak narrow
## 3 TAZ_rep1.q1_peaks.narrowPeak narrow
We run ChIPQC for peaks before filter, filtered by IDR and qValue.
## before filter
exampleExp <- ChIPQC(experiment=samples, annotaiton="hg19")
ChIPQCreport(exampleExp, reportFolder="ChIPQC", facetBy="Factor")
## after IDR filter
samples.fil <- samples
samples.fil$Peaks <- gsub("^(.*?).q1.*$", "\\1.fil.bed", names(peaks))
samples.fil$PeakCaller="bed"
exampleExp.fil <- ChIPQC(experiment=samples.fil, annotaiton="hg19")
ChIPQCreport(exampleExp.fil,
reportFolder="ChIPQC.fil",
facetBy="Factor")
## filtered by qValue from the MACS2 result
samples.fil.q01 <- samples.fil
samples.fil.q01$Peaks <-
gsub("^(.*?).q1.*$", "\\1.fil.q01.bed", names(peaks))
exampleExp.fil.q01 <- ChIPQC(experiment=samples.fil.q01,
annotaiton="hg19")
ChIPQCreport(exampleExp.fil.q01,
reportFolder="ChIPQC.fil.q01",
facetBy="Factor")
Test the overlaps of all the peaks. From the testing result, we can confirm the widespread association between YAP, TAZ, TEAD and JUN. The vennDiagram shows the numbers of overlapping peak for these TFs.
ol <- findOverlapsOfPeaks(GSE66081, connectedPeaks="keepAll")
library(ChIPpeakAnno)
packagePath <- system.file("extdata", "filtered", package = "ChIPseqStepByStep")
filtedFiles <- dir(packagePath, ".bed")
GSE66081 <- sapply(filtedFiles, function(.ele){
toGRanges(file.path(packagePath, .ele), format="BED")
})
names(GSE66081) <- sub(".bed", "", names(GSE66081))
lengths(GSE66081)
ol <- findOverlapsOfPeaks(GSE66081)
names(ol)
names(ol$peaklist)
makeVennDiagram(ol)
makeVennDiagram(ol)
## $p.value
## JUN TAZ TEAD4 YAP pval
## [1,] 0 0 1 1 0
## [2,] 0 1 0 1 0
## [3,] 0 1 1 0 0
## [4,] 1 0 0 1 0
## [5,] 1 0 1 0 0
## [6,] 1 1 0 0 0
##
## $vennCounts
## JUN TAZ TEAD4 YAP Counts count.JUN count.TAZ count.TEAD4 count.YAP
## [1,] 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 1 26 0 0 0 26
## [3,] 0 0 1 0 48 0 0 48 0
## [4,] 0 0 1 1 17 0 0 17 17
## [5,] 0 1 0 0 80 0 80 0 0
## [6,] 0 1 0 1 29 0 29 0 29
## [7,] 0 1 1 0 28 0 28 28 0
## [8,] 0 1 1 1 63 0 63 63 63
## [9,] 1 0 0 0 2172 2172 0 0 0
## [10,] 1 0 0 1 73 73 0 0 75
## [11,] 1 0 1 0 44 44 0 44 0
## [12,] 1 0 1 1 23 23 0 23 24
## [13,] 1 1 0 0 187 187 188 0 0
## [14,] 1 1 0 1 255 256 260 0 258
## [15,] 1 1 1 0 51 53 56 53 0
## [16,] 1 1 1 1 412 418 432 417 431
## attr(,"class")
## [1] "VennCounts"
We want to confirm that not only the peaks are overlapped but also their summits are close to each other.
peaks.summit <- lapply(peaks, function(.ele) {
.ele <- shift(.ele, .ele$peak)
width(.ele) <- 1
.ele
})
TEAD.summit <- unlist(GRangesList(peaks.summit[new.group=="TEAD4"]))
TAZ.summit <- unlist(GRangesList(peaks.summit[new.group=="TAZ"]))
bof <- binOverFeature(TEAD.summit,
annotationData=TAZ.summit,
radius=300, nbins=300, FUN=length)
plot(as.numeric(rownames(bof)), bof,
ylab="Peak density",
xlab="Distance to the summit of TAZ peaks",
main="Position of TEAD4 peak summit",
type="l", xlim=c(-300, 300))
ChIPpeakAnno
The functions, toRanges
, annotatePeakInBatch
, and addGeneIDs
in the ChIPpeakAnno, the annotation of ChIP-Seq peaks becomes streamlined into four major steps:
Read peak data with toGRanges
Generate annotation data with toGRanges
Annotate peaks with annotatePeakInBatch
Add additional informations with addGeneIDs
The method toGRanges
can be used to construct GRanges object from various annotation format such as BED, GFF, user defined readable text files, EnsDb or TxDb object.
Please type ?toGRanges
for more information.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature="gene")
annoData[1]
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## 1 chr19 [58858172, 58874214] -
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
library(EnsDb.Hsapiens.v75)
toGRanges(EnsDb.Hsapiens.v75)
toGRanges(EnsDb.Hsapiens.v75, feature = "transcript")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene)
toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature = "geneModel")
toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature = "threeUTR")
annotatePeakInBatch
Then we use annotatePeakInBatch or annoPeaks function in ChIPpeakAnno package to annotate the peaks. Depend on the experiment, we can annotate the peaks in multiple ways.
overlaps <- ol$peaklist[["TAZ///TEAD4///YAP"]]
## annotate the peaks by nearest annotations.
YAP.TAZ.TEAD4.nearest <-
annotatePeakInBatch(overlaps,
AnnotationData=annoData)
## annotate the peaks by both side in a given range.
YAP.TAZ.TEAD4.2side <-
annotatePeakInBatch(overlaps,
AnnotationData=annoData,
output="nearestBiDirectionalPromoters",
bindingRegion=c(-100000, 100000))
overlaps <- ol$peaklist[["TAZ///TEAD4///YAP"]]
YAP.TAZ.TEAD4.nearest <-
annotatePeakInBatch(overlaps,
AnnotationData=annoData)
YAP.TAZ.TEAD4.promoter <-
annotatePeakInBatch(overlaps,
AnnotationData=annoData,
output="nearestBiDirectionalPromoters",
bindingRegion=c(-10000, 2000))
pie1(table(YAP.TAZ.TEAD4.2side$insideFeature))
addGeneIDs
library(org.Hs.eg.db)
YAP.TAZ.TEAD4.nearest.anno <-
addGeneIDs(annotatedPeak = YAP.TAZ.TEAD4.nearest,
orgAnn = org.Hs.eg.db,
feature_id_type = "entrez_id")
YAP.TAZ.TEAD4.nearest.anno[1:3]
## GRanges object with 3 ranges and 11 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## X01.23254 chr1 [14812601, 14813726] * |
## X02.7709 chr1 [16249621, 16250159] * |
## X03.23400 chr1 [17339107, 17339960] * |
## peakNames peak
## <CharacterList> <character>
## X01.23254 TAZ__olp0130,TEAD4__olp0074,YAP__olp0097 01
## X02.7709 TEAD4__olp0089,YAP__olp0120,TAZ__olp0155 02
## X03.23400 YAP__olp0137,TAZ__olp0184,TEAD4__olp0105 03
## feature start_position end_position feature_strand
## <character> <integer> <integer> <character>
## X01.23254 23254 14925213 15444544 +
## X02.7709 7709 16268364 16302627 -
## X03.23400 23400 17312453 17338423 -
## insideFeature distancetoFeature shortestDistance
## <factor> <numeric> <integer>
## X01.23254 upstream -112612 111487
## X02.7709 downstream 53006 18205
## X03.23400 upstream -684 684
## fromOverlappingOrNearest symbol
## <character> <character>
## X01.23254 NearestLocation KAZN
## X02.7709 NearestLocation ZBTB17
## X03.23400 NearestLocation ATP13A2
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(org.Hs.eg.db)
YAP.TAZ.TEAD4.promoter.anno <-
addGeneIDs(annotatedPeak = YAP.TAZ.TEAD4.promoter,
orgAnn = org.Hs.eg.db,
feature_id_type = "entrez_id")
The annotation results can be saved in XLS file format using WriteXLS package to avoid the gene name errors that can be inadvertently introduced when opened by Excel (Zeeberg et al., 2004).
## save annotations
YAP.TAZ.TEAD4.2side.m <- as.data.frame(unname(YAP.TAZ.TEAD4.2side))
YAP.TAZ.TEAD4.2side.m$peakNames <- NULL
library(WriteXLS)
WriteXLS("YAP.TAZ.TEAD4.2side.m",
"YAP.TAZ.TEAD4.overlapping.peaks.anno.xls")