Contents

1 Motivation & work flow

1.1 ChIP-seq

ChIP-seq Overview

ChIP-seq Overview

  • Tags versus sequenced reads; single-end read extension in 3’ direction
  • Strand shift / cross-correlation
  • Defined (narrow, e.g., transcription factor binding sites) versus diffuse (e.g., histone marks) peaks

1.1.1 ChIP-seq for differential binding

  • Designed experiment with replicate samples per treatment
  • Analysis using insights from microarrays / RNA-seq

1.1.2 Novel statistical issues

  • Inferring peaks without ‘data snooping’ (using the same data twice, once to infer peaks, once to estimate differential binding)
  • Retaining power
  • Minimizing false discovery rate

1.2 Work flow

  • Following Bailey et al., 2013

1.2.1 Experimental design and execution

  • Single sample

    • ChIPed transcription factor and
    • Input (fragmented genomic DNA) or control (e.g., IP with non-specific antibody such as immunoglobulin G, IgG)
  • Designed experiments

    • Replication of TF / control pairs

1.2.2 Sequencing & alignment

  • Sequencing depth rules of thumb: \(>10M\) reads for narrow peaks, \(>20M\) for broad peaks
  • Long & paired end useful but not essential – alignment in ambiguous regions
  • Basic aligners generally adequate, e.g., no need to align splice junctions
  • Sims et al., 2014

1.2.3 Peak calling

  • Very large number of peak calling programs; some specialized for e.g., narrow vs. broad peaks.
  • Commmonly used: MACS, PeakSeq, CisGenome, …
  • MACS: Model-based Analysis for ChIP-Seq, Liu et al., 2008

    • Scale control tag counts to match ChIP counts
    • Center peaks by shifting \(d/2\)
    • Model occurrence of a tag as a Poisson process
    • Look for fixed width sliding windows with exceess number of tag enrichment
    • Empirical FDR: Swap ChIP and control samples; FDR is # control peaks / # ChIP peaks
    • Output: BED file of called peaks

1.2.4 Down-stream analysis

  • Annotation: what genes are my peaks near?
  • Differential representation: which peaks are over- or under-represented in treatment 1, compared to treatment 2?
  • Motif identification (peaks over known motifs?) and discovery
  • Integrative analysis, e.g., assoication of regulatory elements and expression

1.3 Peak calling

1.3.1 ‘Known’ ranges

  • Count tags in pre-defined ranges, e.g., promoter regions of known genes
  • Obvious limitations, e.g., regulatory elements not in specified ranges; specified range contains multiple regulatory elements with complementary behavior

1.3.2 de novo windows

  • Width: narrow peaks, 1bp; broad peaks, 150bp
  • Offset: 25-100bp; influencing computational burden

1.3.3 de novo peak calling

  • Third-party software (many available; MACS commonly used)
  • Various strategies for calling peaks – Lun & Smyth, Table 1

    • Call each sample independently; intersection or union of peaks across samples, …
    • Call peaks from a pooled library
  • Relevant slides pdf

1.4 Peak calling across libraries

  • Table 1: Description of peak calling strategies. Each strategy is given an identifier and is described by the mode in which MACS is run, the libraries on which it is run and the consolidation operation (if any) performed to combine peaks between libraries or groups. For method 6, the union of the peaks in each direction of enrichment is taken.
ID Mode Library Operation
1 Single-sample Individual Union
2 Single-sample Individual Intersection
3 Single-sample Individual At least 2
4 Single-sample Pooled over group Union
5 Single-sample Pooled over group Intersection
6 Two-sample Pooled over group Union
7 Single-sample Pooled over all -
  • How to choose? – Lun & Smyth,

    • Under the null hypothesis, type I error rate is uniform
    • Table 2: consequences for type I error
    • Best strategy: call peaks from a pooled library
    • Table 2: The observed type I error rate when testing for differential enrichment using counts from each peak calling strategy. Error rates for a range of specified error thresholds are shown. All values represent the mean of 10 simulation iterations with the standard error shown in brackets. RA: reference analysis using 10 000 randomly chosen true peaks.
ID Error rate
  0.01 0.05 0.1
RA 0.010 (0.000) 0.051 (0.001) 0.100 (0.002)
1 0.002 (0.000) 0.019 (0.001) 0.053 (0.001)
2 0.003 (0.000) 0.030 (0.000) 0.073 (0.001)
3 0.006 (0.000) 0.042 (0.001) 0.092 (0.001)
4 0.033 (0.001) 0.145 (0.001) 0.261 (0.002)
5 0.000 (0.000) 0.001 (0.000) 0.005 (0.000)
6 0.088 (0.006) 0.528 (0.013) 0.893 (0.006)
7 0.010 (0.000) 0.049 (0.001) 0.098 (0.001)
## 100,000 t-tests under the null, n = 6
n <- 6; m <- matrix(rnorm(n * 100000), ncol=n)
P <- genefilter::rowttests(m, factor(rep(1:2, each=3)))$p.value
quantile(P, c(.001, .01, .05))
##     0.1%       1%       5% 
## 0.000954 0.010309 0.049512
hist(P, breaks=20)

1.4.1 de novo hybrid strategies

2 Practical: Differential binding (csaw)

This exercise is based on the csaw vignette, where more detail can be found.

2.1 Steps 1 - 4: Experimental Design … Alignment

The experiment involves changes in binding profiles of the NFYA protein between embryonic stem cells and terminal neurons. It is a subset of the data provided by Tiwari et al. 2012 available as GSE25532. There are two es (embryonic stem cell) and two tn (terminal neuron) replicates. Single-end FASTQ files were extracted from GEO, aligned using Rsubread, and post-processed (sorted and indexed) using Rsamtools with the script available at

file.path("ChIPSeq", "NFYA", "scripts", "preprocess.R")

Create a data frame summarizing the files used.

acc <- c(es_1="SRR074398", es_2="SRR074399", tn_1="SRR074417",
    tn_2="SRR074418")
files <- data.frame(Treatment=sub("_.*", "", names(acc)),
    Replicate=sub(".*_", "", names(acc)),
    sra=sprintf("%s.sra", acc),
    fastq=sprintf("%s.fastq.gz", acc),
    bam=sprintf("%s.fastq.gz.subread.BAM", acc),
    row.names=acc, stringsAsFactors=FALSE)

2.2 Step 5: Reduction

Change to the directory where the BAM files are located

setwd("ChIPSeq/NFYA/bam")

Load the csaw library and count reads in overlapping windows. This returns a SummarizedExperiment, so explore it a bit…

library(csaw)
library(GenomicRanges)
frag.len <- 110
system.time({
    data <- windowCounts(files$bam, width=10, ext=frag.len)
})                                      # 156 seconds
acc <- sub(".fastq.*", "", data$bam.files)
colData(data) <- cbind(files[acc,], colData(data))

For further discussion about the reduction step, see Chapter 2 of the csaw vignette.

2.3 Step 6: Analysis

Filtering (vignette Chapter 3) Start by filtering low-count windows. There are likely to be many of these (how many?). Is there a rational way to choose the filtering threshold?

library(edgeR)     # for aveLogCPM()
keep <- aveLogCPM(assay(data)) >= -1
data <- data[keep,]

Normalization (composition bias) (vignette Chapter 4) csaw uses binned counts in normalization. The bins are large relative to the ChIP peaks, on the assumption that the bins primarily represent non-differentially bound regions. The sample bin counts are normalized using the edgeR TMM (trimmed median of M values) method seen in the RNASeq differential expression lab. Explore vignette chapter 4 for more on normalization (this is a useful resource when seeking to develop normalization methods for other protocols!).

system.time({
    binned <- windowCounts(files$bam, bin=TRUE, width=10000)
})                                      #139 second
normfacs <- normalize(binned)

Experimental design and Differential binding (vignette Chapter 5) Differential binding will be assessed using edgeR, where we need to specify the experimental design

design <- model.matrix(~Treatment, colData(data))

Apply a standard edgeR work flow to identify differentially bound regions. Creatively explore the results.

y <- asDGEList(data, norm.factors=normfacs)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
results <- glmQLFTest(fit, contrast=c(0, 1))
head(results$table)
##    logFC logCPM       F  PValue
## 1  1.029  -2.38 0.39466 0.52986
## 2  1.052  -2.39 0.41573 0.51907
## 3  0.118  -2.16 0.00698 0.93341
## 4 -0.847  -1.79 0.50517 0.47724
## 5 -5.885  -2.15 7.98493 0.00472
## 6 -5.629  -2.31 6.58683 0.01027

Multiple testing (vignette Chapter 6) The challenge is that FDR across all detected differentially bound regions is what one is interested in, but what is immediately available is the FDR across differentially bound windows; region will often consist of multiple overlapping windows. As a first step, we’ll take a ‘quick and dirty’ approach to identifying regions by merging ‘high-abundance’ windows that are within, e.g., 1kb of one another

merged <- mergeWindows(rowRanges(data), tol=1000L)

Combine test results across windows within regions. Several strategies are explored in section 6.5 of the vignette.

tabcom <- combineTests(merged$id, results$table)
head(tabcom)
##   nWindows logFC.up logFC.down PValue   FDR direction
## 1        2        2          0 0.5299 0.999        up
## 2        6        0          5 0.0106 0.999      down
## 3       10        1          5 0.7301 0.999      down
## 4        7        5          2 0.0689 0.999        up
## 5        3        1          0 0.9728 0.999     mixed
## 6        1        0          1 0.3816 0.999      down

Section 6.6 of the vignette discusses approaches to identifying the ‘best’ windows within regions.

Finally, create a GRanges object that summarized the merged windows and combined test statistics.

final <- merged$region
mcols(final) <- as(tabcom, "DataFrame")

2.4 Step 7: Comprehension

2.4.1 Annotation

As an example of how results can be anntoted

library(org.Mm.eg.db)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
anno <- detailRanges(
    final, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
    orgdb=org.Mm.eg.db, promoter=c(3000, 1000), dist=5000
)
mcols(final) <- cbind(mcols(final), DataFrame(anno))

The ‘top table’ of most differentially expressed regions can be obtained by ordering final by the PValue column, perhaps first filtering to remove regions that do not overlap known promoters.

annotated <- final[nzchar(final$overlap)]
annotated[order(annotated$PValue)]
## GRanges object with 279384 ranges and 9 metadata columns:
##                  seqnames                 ranges strand |  nWindows
##                     <Rle>              <IRanges>  <Rle> | <integer>
##        [1]           chr4 [ 70373201,  70379210]      * |        37
##        [2]          chr19 [ 23357351,  23359160]      * |        15
##        [3]           chr6 [103649001, 103650160]      * |        10
##        [4] chrUn_JH584304 [       51,     70160]      * |      1245
##        [5]           chr9 [  3034201,   3038460]      * |        49
##        ...            ...                    ...    ... .       ...
##   [279380]          chr13 [ 38165651,  38166460]      * |         4
##   [279381]           chr5 [118141051, 118144410]      * |        16
##   [279382]           chr6 [108497701, 108502260]      * |        17
##   [279383]           chr8 [ 76776951,  76777110]      * |         4
##   [279384]           chr5 [123945901, 123950210]      * |        27
##             logFC.up logFC.down    PValue       FDR   direction
##            <integer>  <integer> <numeric> <numeric> <character>
##        [1]        12         12  1.07e-52  9.95e-48        down
##        [2]         4          8  1.91e-31  6.90e-27        down
##        [3]         0          8  4.14e-26  9.96e-22        down
##        [4]         5       1179  1.13e-25  2.53e-21        down
##        [5]         0         49  2.43e-24  4.77e-20        down
##        ...       ...        ...       ...       ...         ...
##   [279380]         1          0         1         1       mixed
##   [279381]         4          0         1         1       mixed
##   [279382]         0          5         1         1       mixed
##   [279383]         0          0         1         1       mixed
##   [279384]         9          7         1         1       mixed
##                    overlap             left             right
##                <character>      <character>       <character>
##        [1]    Cdk5rap2|5|-                  Cdk5rap2|4|-[981]
##        [2]      Mamdc2|7|- Mamdc2|8|-[3904]  Mamdc2|6|-[4550]
##        [3]        Chl1|7|+   Chl1|6|+[1672]                  
##        [4] Pisd-ps3|0-11|-                                   
##        [5]     Mir101c|0|-                   Mir101c|1|-[209]
##        ...             ...              ...               ...
##   [279380]         Dsp|I|+     Dsp|2|+[785]   Dsp|3-4|+[1056]
##   [279381]       Fbxw8|2|-                                   
##   [279382]      Itpr1|58|+ Itpr1|57|+[3808]  Itpr1|59|+[3617]
##   [279383]     Gm10649|I|-                                   
##   [279384]      Ccdc62|6|+ Ccdc62|5|+[1596] Ccdc62|7-9|+[989]
##   -------
##   seqinfo: 66 sequences from an unspecified genome