In this vignette, we provide a brief overview of the *ChIPexoQual* package. This package provides a statistical quality control (QC) pipeline that enables the exploration and analysis of ChIP-exo/nexus experiments. In this vignette we used the reads aligned to **chr1** in the mouse liver ChIP-exo experiment (Serandour et al. 2013) to illustrate the use of the pipeline. To load the packages we use:

```
library(ChIPexoQual)
library(ChIPexoQualExample)
```

*ChIPexoQual* takes a set of aligned reads from a ChIP-exo (or ChIP-nexus) experiment as input and performs the following steps:

- Identify read islands, i.e., overlapping clusters of reads separated by gaps, from read coverage.
- Compute \(D_i\), number of reads in island \(i\), and \(U_i\), number of island \(i\) positions with at least one aligning read, \(i=1, \cdots, I\).
- For each island \(i\), \(i=1, \cdots, I\) compute island statistics: \[ \begin{align*} \mbox{ARC}_i &= \frac{D_i}{W_i}, \quad \mbox{URC}_i = \frac{U_i}{D_i}, \\ %\mbox{URC}_i &= \frac{U_i}{D_i}, \\ \mbox{FSR}_i &= \frac{(\text{Number of forward strand reads aligning to island $i$})}{D_i}, \end{align*} \]

- Generate diagnostic plots (i) URC vs.Â ARC plot; (ii) Region Composition plot; (iii) FSR distribution plot.
- Randomly sample \(M\) (at least 1000) islands and fit, \[ \begin{align*} D_i = \beta_1 U_i + \beta_2 W_i + \varepsilon_i, \end{align*} \] where \(\varepsilon_i\) denotes the independent error term. Repeat this process \(B\) times and generate box plots of estimated \(\beta_1\) and \(\beta_2\).

We analyzed a larger collection of ChIP-exo/nexus experiments in (Welch et al. 2016) including complete versions of this samples.

The minimum input to use *ChIPexoQual* are the aligned reads of a ChIP-exo/nexus experiment. *ChIPexoQual* accepts either the name of the bam file or the reads in a *GAlignments* object:

```
files = list.files(system.file("extdata",
package = "ChIPexoQualExample"),full.names = TRUE)
basename(files[1])
```

`## [1] "ChIPexo_carroll_FoxA1_mouse_rep1_chr1.bam"`

```
ex1 = ExoData(file = files[1],mc.cores = 2L,verbose = FALSE)
ex1
```

```
## ExoData object with 655785 ranges and 11 metadata columns:
## seqnames ranges strand | fwdReads revReads
## <Rle> <IRanges> <Rle> | <integer> <integer>
## [1] chr1 [3000941, 3000976] * | 2 0
## [2] chr1 [3001457, 3001492] * | 0 1
## [3] chr1 [3001583, 3001618] * | 0 2
## [4] chr1 [3001647, 3001682] * | 1 0
## [5] chr1 [3001852, 3001887] * | 1 0
## ... ... ... ... . ... ...
## [655781] chr1 [197192012, 197192047] * | 0 1
## [655782] chr1 [197192421, 197192456] * | 0 1
## [655783] chr1 [197193059, 197193094] * | 1 0
## [655784] chr1 [197193694, 197193729] * | 0 3
## [655785] chr1 [197194986, 197195021] * | 0 2
## fwdPos revPos depth uniquePos ARC
## <integer> <integer> <integer> <integer> <numeric>
## [1] 1 0 2 1 0.0555555555555556
## [2] 0 1 1 1 0.0277777777777778
## [3] 0 1 2 1 0.0555555555555556
## [4] 1 0 1 1 0.0277777777777778
## [5] 1 0 1 1 0.0277777777777778
## ... ... ... ... ... ...
## [655781] 0 1 1 1 0.0277777777777778
## [655782] 0 1 1 1 0.0277777777777778
## [655783] 1 0 1 1 0.0277777777777778
## [655784] 0 1 3 1 0.0833333333333333
## [655785] 0 1 2 1 0.0555555555555556
## URC FSR M A
## <numeric> <numeric> <numeric> <numeric>
## [1] 0.5 1 -Inf Inf
## [2] 1 0 -Inf -Inf
## [3] 0.5 0 -Inf -Inf
## [4] 1 1 -Inf Inf
## [5] 1 1 -Inf Inf
## ... ... ... ... ...
## [655781] 1 0 -Inf -Inf
## [655782] 1 0 -Inf -Inf
## [655783] 1 1 -Inf Inf
## [655784] 0.333333333333333 0 -Inf -Inf
## [655785] 0.5 0 -Inf -Inf
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

```
reads = readGAlignments(files[1],param = NULL)
ex2 = ExoData(reads = reads,mc.cores = 2L,verbose = FALSE)
identical(GRanges(ex1),GRanges(ex2))
```

`## [1] TRUE`

For the rest of the vignette, we generate an **ExoData** object for each replicate:

```
files = files[grep("bai",files,invert = TRUE)] ## ignore index files
exampleExoData = lapply(files,ExoData,mc.cores = 2L,verbose = FALSE)
```

Finally, we can recover the number of reads that compose a **ExoData** object by using the **nreads** function:

` sapply(exampleExoData,nreads)`

`## [1] 1654985 1766665 1670117`

To create the *ARC vs URC plot* proposed in (Welch et al. 2016), we use the **ARC_URC_plot** function. This function allows to visually compare different samples:

` ARCvURCplot(exampleExoData,names.input = paste("Rep",1:3,sep = "-"))`

This plot typically exhibits one of the following three patterns for any given sample. In all three panels we can observe two arms: the first with low **Average Read Coefficient (ARC)** and varying **Unique Read Coefficient (URC)**; and the second where the **URC** decreases as the **ARC** increases. The first and third replicates exhibit a defined decreasing trend in URC as the ARC increases. This indicates that these samples exhibit a higher ChIP enrichment than the second replicate. On the other hand, the overall URC level from the first two replicates is higher than that of the third replicate, elucidating that the libraries for the first two replicates are more complex than that of the third replicate.

To create the *FSR distribution* and *Region Composition* plots suggested in Welch et. al 2016 (submitted), we use the **FSR_dist_plot** and **region_comp_plot**, respectively.

```
p1 = regionCompplot(exampleExoData,names.input = paste("Rep",1:3,
sep = "-"),depth.values = seq_len(50))
p2 = FSRDistplot(exampleExoData,names.input = paste("Rep",1:3,sep = "-"),
quantiles = c(.25,.5,.75),depth.values = seq_len(100))
gridExtra::grid.arrange(p1,p2,nrow = 1)
```

The left panel displays the *Region Composition plot* and the right panel shows the *Forward Strand Ratio (FSR) distribution plot*, both of which highlight specific problems with replicates 2 and 3. The *Region Composition plot* exhibits apparent decreasing trends in the proportions of regions formed by fragments in one exclusive strand. High quality experiments tend to show exponential decay in the proportion of single stranded regions, while for the lower quality experiments, the trend may be linear or even constant. The FSR distributions of both of replicates 2 and 3 are more spread around their respective medians. The rate at which the FSR distribution becomes more centralized around the median indicates the aforementioned lower enrichment in the second replicate and the low complexity in the third one. The asymmetric behavior of the second replicate is characteristic of low enrichment, while the constant values of replicate three for low minimum number of reads indicate that this replicate has islands composed of reads aligned to very few unique positions.

All the plot functions in *ChIPexoQual* allow a list or several separate **ExoData** objects. This allows to explore island subsets for each replicate. For example, to show that the first arm is composed of regions formed by reads aligned to few positions, we can generate the following plot:

```
ARCvURCplot(exampleExoData[[1]],
subset(exampleExoData[[1]],uniquePos > 10),
subset(exampleExoData[[1]],uniquePos > 20),
names.input = c("All", "uniquePos > 10", "uniquePos > 20"))
```

For this figure, we used the *ARC vs URC* plot to show how several of the regions with low **ARC** values are composed by reads that align to a small number of unique positions. This technique highlights a strategy that can be followed to further explore the data, as with all the previously listed plotting functions we may compare different subsets of the islands in the partition.

The last step of the quality control pipeline is to evaluate the linear model:

\[ \begin{align*} D_i = \beta_1 U_i + \beta_2 U_2 + \epsilon_i, \end{align*} \]

The distribution of the parameters of this model is built by sampling **nregions** regions (the default value is 1,000), fitting the model and repeating the process **ntimes** (the default value is 100). We visualize the distributions of the parameters with box-plots:

```
p1 = paramDistBoxplot(exampleExoData,which.param = "beta1", names.input = paste("Rep",1:3,sep = "-"))
p2 = paramDistBoxplot(exampleExoData,which.param = "beta2", names.input = paste("Rep",1:3,sep = "-"))
gridExtra::grid.arrange(p1,p2,nrow = 1)
```