Contents

0.1 Installation

DegCre can be installed from Bioconductor as follows:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DegCre")

Alternatively, DegCre can be installed from GitHub. With the devtools pacakge installed, run:

devtools::install_github("brianSroberts/DegCre")

0.2 Introduction

DegCre associates differentially expressed genes (DEGs) with cis-regulatory elements (CREs) in a probabilistic, non-parametric approach. DegCre is intended to be applied on differential expression and regulatory signal data derived from responses to perturbations such as drug or natural ligand treatmnents. As an example used here, we have obtained data from McDowell et al. (McDowell et al. 2018) which was generated by treating A549 cells with dexamethasone and measuring RNA-seq and ChIP-seq data at several time points. Data from RNA-seq and NR3C1 ChIP-seq at four hours versus control is stored in the list DexNR3C1.

DegCre uses the GenomicRanges framework for handling genomic regions and some calculations. As one input, DegGR, users generate differential expression statistics for genes with methods such as DESeq2 or edgeR. These values should then be associated with gene TSSs such as those available from EPDNew (Dreos et al. 2015; Meylan et al. 2020) in a GRanges. The second input, CreGR, is differential regulatory signal data (in the example, NR3C1 ChIP-seq data) associated with genomic regions in a GRanges such as those generated by csaw.

A complete description of the mathematical basis of the DegCre core algorithms is provided in (Roberts, Cooper, and Myers 2023). DegCre generates a Hits object of all associations between DegGR and CreGR within a specified maximum distance. Associations are then binned by TSS-to-CRE distance according to an algorithm that balances resolution (many bins with few members) versus minimization of the deviance of each bin’s CRE p-value distribution from the global distribution, selecting an optimal bin size.

Next, DegCre applies a non-parametric algorithm to find concordance between DEG and CRE differential effects within bins and derives an association probability. For all association probabilities involving one given CRE, the probabilities are adjusted to favor associations across shorter distances. An FDR of the association probability is then estimated. Results are returned in list containing a Hits object and both input GRanges.

0.3 Generating DegCre associations

To begin, we load the necessary packages:

library(GenomicRanges)
library(InteractionSet)
library(plotgardener)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)
library(DegCre)
library(magick)

DegCre is run on the two input GRanges, DegGR and CreGR, and differential p-values associated with each. To require the effect directions be concordant between the changes in gene expression and CRE signal, users must also supply log fold-changes for each and set reqEffectDirConcord=TRUE (default value). As example data, we use DexNR3C1 derived from McDowell et al. (McDowell et al. 2018) using DESeq2 and csaw:

data(DexNR3C1)
lapply(DexNR3C1,head)
#> $DegGR
#> GRanges object with 6 ranges and 7 metadata columns:
#>       seqnames          ranges strand | promGeneName    GeneSymb
#>          <Rle>       <IRanges>  <Rle> |  <character> <character>
#>   [1]     chr1   959255-959256      - |      NOC2L_1       NOC2L
#>   [2]     chr1   960632-960633      + |     KLHL17_1      KLHL17
#>   [3]     chr1   966481-966482      + |    PLEKHN1_1     PLEKHN1
#>   [4]     chr1   976680-976681      - |      PERM1_1       PERM1
#>   [5]     chr1 1000096-1000097      - |       HES4_1        HES4
#>   [6]     chr1 1000510-1000511      + |      ISG15_2       ISG15
#>                   GeneID   baseMean      logFC       pVal      pAdj
#>              <character>  <numeric>  <numeric>  <numeric> <numeric>
#>   [1]  ENSG00000188976.9 1404.01811 -0.0148658 0.89202335 0.9804131
#>   [2] ENSG00000187961.12  182.81254 -0.5189710 0.00183145 0.0235433
#>   [3]  ENSG00000187583.9   56.95208 -0.1598313 0.50313987 0.8747718
#>   [4]  ENSG00000187642.8    1.41588 -1.0589853 0.49900530        NA
#>   [5]  ENSG00000188290.9   76.33487 -0.7547663 0.00184199 0.0236628
#>   [6]  ENSG00000187608.7   97.50041 -0.0390943 0.88895013 0.9803767
#>   -------
#>   seqinfo: 23 sequences from an unspecified genome; no seqlengths
#> 
#> $CreGR
#> GRanges object with 6 ranges and 3 metadata columns:
#>       seqnames          ranges strand |     logFC      pVal      pAdj
#>          <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
#>   [1]     chr1   941905-941960      * |  1.875569 0.3827979  0.947530
#>   [2]     chr1   943129-943526      * |  3.713827 0.0154015  0.120363
#>   [3]     chr1 1686385-1686476      * |  2.030911 0.4654492  1.000000
#>   [4]     chr1 1741105-1741250      * |  2.061714 0.2426531  0.732045
#>   [5]     chr1 1777591-1777610      * | -0.366914 0.6429142  1.000000
#>   [6]     chr1 1777645-1777700      * |  0.241421 1.0000000  1.000000
#>   -------
#>   seqinfo: 23 sequences from an unspecified genome

We now run DegCre on this data with default values (subsetting input to “chr1” for expediency):

subDegGR <- DexNR3C1$DegGR[which(seqnames(DexNR3C1$DegGR)=="chr1")]
subCreGR <- DexNR3C1$CreGR[which(seqnames(DexNR3C1$CreGR)=="chr1")]

myDegCreResList <- runDegCre(DegGR=subDegGR,
                             DegP=subDegGR$pVal,
                             DegLfc=subDegGR$logFC,
                             CreGR=subCreGR,
                             CreP=subCreGR$pVal,
                             CreLfc=subCreGR$logFC,
                             reqEffectDirConcord=TRUE,
                             verbose=FALSE)

DegCre produces a list object with several items. These include the inputs DegGR and CreGR. It also contains a Hits object degCreHits. The queryHits of degCreHits index DegGR and the subjectHits index CreGR. degCreHits also has several metadata columns that include the key resuls of the algorithm. These include assocProb which is the probabilty of the association being true, and assocProbFDR which is the FDR of assocProb being greater than a null model based on TSS to CRE distance alone. Also, the results of the distance bin optimization heuristic, binHeurOutputs, and the input DEG significance threshold alphaVal are returned.

names(myDegCreResList)
#> [1] "degCreHits"     "binHeurOutputs" "alphaVal"       "DegGR"         
#> [5] "CreGR"
head(myDegCreResList$degCreHits)
#> Hits object with 6 hits and 10 metadata columns:
#>       queryHits subjectHits | assocDist assocProb assocProbFDR rawAssocProb
#>       <integer>   <integer> | <integer> <numeric>    <numeric>    <numeric>
#>   [1]         1           5 |    818334         0            1    0.0691544
#>   [2]         2           5 |    816957         0            1    0.0691544
#>   [3]         3           5 |    811108         0            1    0.0691544
#>   [4]         4           5 |    800909         0            1    0.0691544
#>   [5]         5           5 |    777493         0            1    0.0708054
#>   [6]         6           5 |    777079         0            1    0.0708054
#>            CreP       DegP   DegPadj binAssocDist    numObs distBinId
#>       <numeric>  <numeric> <numeric>    <numeric> <numeric> <integer>
#>   [1]  0.642914 0.89202335         1       860869      2277        13
#>   [2]  0.642914 0.00183145         1       860869      2277        13
#>   [3]  0.642914 0.50313987         1       860869      2277        13
#>   [4]  0.642914 0.49900530         1       860869      2277        13
#>   [5]  0.642914 0.00184199         1       788305      2277        12
#>   [6]  0.642914 0.88895013         1       788305      2277        12
#>   -------
#>   queryLength: 2791 / subjectLength: 2515

To find associations driven by non-random changes in CRE signal, we subset the degCreHits by assocProbFDR. We can also further subset by assocProb to find those also more likely to confirm in a suitable validation experiment:

passFDR <- which(mcols(myDegCreResList$degCreHits)$assocProbFDR <= 0.05)
passProb <- which(mcols(myDegCreResList$degCreHits)$assocProb >= 0.25)

keepDegCreHits <- myDegCreResList$degCreHits[intersect(passFDR,passProb)]

keepDegCreHits
#> Hits object with 280 hits and 10 metadata columns:
#>         queryHits subjectHits | assocDist assocProb assocProbFDR rawAssocProb
#>         <integer>   <integer> | <integer> <numeric>    <numeric>    <numeric>
#>     [1]       106          12 |     21906  0.352554  2.59376e-10     0.352554
#>     [2]       110          13 |       133  0.275902  4.51173e-09     0.275902
#>     [3]       111          13 |      1110  0.275902  4.51173e-09     0.275902
#>     [4]       117          16 |      6785  0.275902  6.91694e-09     0.275902
#>     [5]       117          14 |     33803  0.382105  3.00959e-07     0.382105
#>     ...       ...         ... .       ...       ...          ...          ...
#>   [276]      2586        2139 |     25987  0.275902  6.64008e-09     0.275902
#>   [277]      2586        2138 |     29695  0.252744  8.94344e-07     0.252744
#>   [278]      2606        2175 |      4143  0.275902  5.90251e-09     0.275902
#>   [279]      2607        2175 |      3379  0.275902  5.90251e-09     0.275902
#>   [280]      2608        2175 |      2106  0.275902  5.90251e-09     0.275902
#>                CreP        DegP     DegPadj binAssocDist    numObs distBinId
#>           <numeric>   <numeric>   <numeric>    <numeric> <numeric> <integer>
#>     [1] 0.000925164 3.15116e-07 8.03231e-04        48582      2277         1
#>     [2] 0.026711376 9.18081e-10 2.34019e-06        48582      2277         1
#>     [3] 0.026711376 9.18081e-10 2.34019e-06        48582      2277         1
#>     [4] 0.025518445 1.34403e-09 3.42593e-06        48582      2277         1
#>     [5] 0.000177546 1.34403e-09 3.42593e-06        48582      2277         1
#>     ...         ...         ...         ...          ...       ...       ...
#>   [276]   0.0252348 2.35586e-12 6.00508e-09        48582      2277         1
#>   [277]   0.0666493 2.35586e-12 6.00508e-09        48582      2277         1
#>   [278]   0.0252299 2.56393e-07 6.53546e-04        48582      2277         1
#>   [279]   0.0252299 2.56393e-07 6.53546e-04        48582      2277         1
#>   [280]   0.0252299 2.56393e-07 6.53546e-04        48582      2277         1
#>   -------
#>   queryLength: 2791 / subjectLength: 2515

We can then view the subsets of DegGR and CreGR that comprise the passing associations:


myDegCreResList$DegGR[queryHits(keepDegCreHits)]
#> GRanges object with 280 ranges and 7 metadata columns:
#>         seqnames              ranges strand | promGeneName    GeneSymb
#>            <Rle>           <IRanges>  <Rle> |  <character> <character>
#>     [1]     chr1     6180123-6180124      - |       CHD5_1        CHD5
#>     [2]     chr1     6259438-6259439      - |     GPR153_2      GPR153
#>     [3]     chr1     6261063-6261064      - |     GPR153_1      GPR153
#>     [4]     chr1     6419918-6419919      - |       HES2_1        HES2
#>     [5]     chr1     6419918-6419919      - |       HES2_1        HES2
#>     ...      ...                 ...    ... .          ...         ...
#>   [276]     chr1 223845946-223845947      - |    TP53BP2_1     TP53BP2
#>   [277]     chr1 223845946-223845947      - |    TP53BP2_1     TP53BP2
#>   [278]     chr1 225887108-225887109      - |     LEFTY1_2      LEFTY1
#>   [279]     chr1 225887872-225887873      - |     LEFTY1_3      LEFTY1
#>   [280]     chr1 225889145-225889146      - |     LEFTY1_1      LEFTY1
#>                     GeneID  baseMean     logFC        pVal        pAdj
#>                <character> <numeric> <numeric>   <numeric>   <numeric>
#>     [1] ENSG00000116254.16   52.5996   1.50353 3.15116e-07 8.03231e-04
#>     [2]  ENSG00000158292.6   98.8953   1.86950 9.18081e-10 2.34019e-06
#>     [3]  ENSG00000158292.6   98.8953   1.86950 9.18081e-10 2.34019e-06
#>     [4] ENSG00000069812.10   34.9662   2.05545 1.34403e-09 3.42593e-06
#>     [5] ENSG00000069812.10   34.9662   2.05545 1.34403e-09 3.42593e-06
#>     ...                ...       ...       ...         ...         ...
#>   [276] ENSG00000143514.15  1322.437  0.678904 2.35586e-12 6.00508e-09
#>   [277] ENSG00000143514.15  1322.437  0.678904 2.35586e-12 6.00508e-09
#>   [278]  ENSG00000243709.1    17.491  2.218767 2.56393e-07 6.53546e-04
#>   [279]  ENSG00000243709.1    17.491  2.218767 2.56393e-07 6.53546e-04
#>   [280]  ENSG00000243709.1    17.491  2.218767 2.56393e-07 6.53546e-04
#>   -------
#>   seqinfo: 23 sequences from an unspecified genome; no seqlengths

myDegCreResList$CreGR[subjectHits(keepDegCreHits)]
#> GRanges object with 280 ranges and 3 metadata columns:
#>         seqnames              ranges strand |     logFC        pVal       pAdj
#>            <Rle>           <IRanges>  <Rle> | <numeric>   <numeric>  <numeric>
#>     [1]     chr1     6157801-6158216      * |   6.06528 0.000925164 0.02104324
#>     [2]     chr1     6259573-6259952      * |   5.14967 0.026711376 0.17136537
#>     [3]     chr1     6259573-6259952      * |   5.14967 0.026711376 0.17136537
#>     [4]     chr1     6412897-6413132      * |   3.27381 0.025518445 0.16628814
#>     [5]     chr1     6385663-6386114      * |   6.95045 0.000177546 0.00954392
#>     ...      ...                 ...    ... .       ...         ...        ...
#>   [276]     chr1 223819867-223819958      * |   3.02428   0.0252348   0.165105
#>   [277]     chr1 223815871-223816250      * |   2.93106   0.0666493   0.321799
#>   [278]     chr1 225891253-225891272      * |   2.55296   0.0252299   0.165094
#>   [279]     chr1 225891253-225891272      * |   2.55296   0.0252299   0.165094
#>   [280]     chr1 225891253-225891272      * |   2.55296   0.0252299   0.165094
#>   -------
#>   seqinfo: 23 sequences from an unspecified genome

The choice of the threshold for DEG significance (DEG \(\alpha\)) is important in DegCre calculations. Accordingly, DegCre can be run over a set of thresholds to determine an optimal value. Here we select DEG \(\alpha\) by finding the value that maximizes a normalized Precision-Recall Area Under the Curve (AUC):

alphaOptList <- optimizeAlphaDegCre(DegGR=subDegGR,
                                    DegP=subDegGR$pVal,
                                    DegLfc=subDegGR$logFC,
                                    CreGR=subCreGR,
                                    CreP=subCreGR$pVal,
                                    CreLfc=subCreGR$logFC,
                                    reqEffectDirConcord=TRUE,
                                    verbose=FALSE)

alphaOptList$alphaPRMat
#>             alphaVal        AUC   deltaAUC normDeltaAUC
#> alpha_0.005    0.005 0.02226499 0.01479536   0.01490670
#> alpha_0.01     0.010 0.02358739 0.01562898   0.01575436
#> alpha_0.02     0.020 0.02805818 0.01910732   0.01927989
#> alpha_0.03     0.030 0.02897679 0.01949782   0.01968441
#> alpha_0.05     0.050 0.02764595 0.01863172   0.01880120
#> alpha_0.1      0.100 0.02628054 0.01780109   0.01795333

bestAlphaId <- which.max(alphaOptList$alphaPRMat[,4])

bestDegCreResList <- alphaOptList$degCreResListsByAlpha[[bestAlphaId]]

0.4 Visualizing DegCre results

With an optimal DegCre result obtained, we can plot the association probabilities versus binned association distance to see the relationship between the two:

par(mai=c(0.8,1.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
par(lwd=2)

binStats <- plotDegCreAssocProbVsDist(degCreResList=bestDegCreResList,
                                      assocProbFDRThresh=0.05)

The total number of associations passing a reasonable FDR (0.05 in the figure) drops fairly quickly as bin TSS to CRE distance increases (top panel). Similarly, the median association probability (black line in lower panel) drops with distance as well. The blue range is the interquartile range of the association probabilities. The red line in the above plot represents the null association probability, calculated as the probability if all CRE p-values in a given bin were identical.

The binning of the associations by distance is also important to the operation of the DegCre algorithm. The choice of the bin size (number of associations per bin) is done by an optimization heuristic that tries a wide range of bin sizes and calculates the KS test statistic for each bin’s CRE p-value distribution compared against the global (unbinned) CRE p-value distribution. The algorithm finds the smallest bin size for which the median KS-statistic is within a specified fraction, fracMinKsMedianThresh, of the range of that for all tested bin sizes. We visualize the results of this process:

par(mai=c(0.8,0.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
par(lwd=1.5)

plotDegCreBinHeuristic(degCreResList=bestDegCreResList)

The red dot indicates the chosen bin size run with the default value of ‘fracMinKsMedianThresh = 0.2’.

Another way to evaluate the performance of the DegCre on a data set is to make a Precision-Recall (PR) curve. To calculate these values, we define as a perfect set of associations, i.e. one that would generate a PR Area Under the Curve (AUC) of 1, as containing associations to all significant DEGs with uniform association probabilities of 1. Such a set will not result from real data, but it serves as a useful benchmark. Comparison of relative values of PR AUCs for DegCre results will be most informative. We plot the PR curve for the example data:

par(mai=c(0.8,0.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
par(lwd=1.5)

degCrePRAUC(degCreResList=bestDegCreResList)

The black line is the performance of the DegCre associations and the red region is values obtained from random shuffles.

0.5 Obtaining the number of associations per DEG

Since DegCre calculates association probabilities, one can estimate the number of expected associations per significant DEG by summing the total associations that pass a chosen FDR. DEGs with many expected associations are more likely to have true associations. This can be done with the getExpectAssocPerDEG() function:

expectAssocPerDegDf <- getExpectAssocPerDEG(degCreResList=bestDegCreResList,
                                            geneNameColName="GeneSymb",
                                            assocAlpha=0.05)

The result is a DataFrame with one entry per gene, sorted by the expected number of DEGs, `expectAssocs’.

head(expectAssocPerDegDf)
#> DataFrame with 6 rows and 11 columns
#>   promGeneName    GeneSymb             GeneID  baseMean     logFC        pVal
#>    <character> <character>        <character> <numeric> <numeric>   <numeric>
#> 1     ZBTB7B_2      ZBTB7B ENSG00000160685.12   442.688  0.956975 9.65209e-10
#> 2       NFIA_4        NFIA ENSG00000162599.14  2914.410  0.882264 8.26113e-12
#> 3     TGFBR3_3      TGFBR3  ENSG00000069702.9  1235.361  0.931604 1.59219e-17
#> 4       COA7_1        COA7  ENSG00000162377.5  1191.281  0.442608 7.74760e-06
#> 5    ZC3H12A_2     ZC3H12A  ENSG00000163874.8   567.067  1.528620 1.71215e-34
#> 6       SHC1_2        SHC1 ENSG00000160691.17  7353.804  0.623721 1.76309e-08
#>          pAdj expectAssocs   nAssocs assocAlpha  degAlpha
#>     <numeric>    <numeric> <numeric>  <numeric> <numeric>
#> 1 2.46032e-06     10.31544        33       0.05      0.03
#> 2 2.10576e-08      8.92411        36       0.05      0.03
#> 3 4.05850e-14      6.89431        24       0.05      0.03
#> 4 1.97486e-02      6.83219        25       0.05      0.03
#> 5 4.36426e-31      6.72306        18       0.05      0.03
#> 6 4.49412e-05      6.55873        21       0.05      0.03

This result can be plotted as a histogram by the function plotExpectedAssocsPerDeg():

par(mai=c(0.8,0.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
plotExpectedAssocsPerDeg(expectAssocPerDegDf=expectAssocPerDegDf)

The dashed line indicates the median expected associations per DEG.

0.6 Browser views of DegCre results for genes or regions

One of the genes with a high number of expected associations and also a known target of NR3C1 (Glucocorticoid receptor) is ERRFI1. Our test data DexNR3C1 is derived from ChIP-seq of NR3C1 in cells treated with the powerful glucocorticoid, dexamethasone, so it is not surprising to find it with many expected associations. To visualize these associations, we can make a browser plot of the associations and underlying data with plotBrowserDegCre() which relies on the plotgardener package.

browserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList,
                                 geneName="ERRFI1",
                                 geneNameColName="GeneSymb",
                                 CreSignalName="NR3C1",
                                 panelTitleFontSize=8,
                                 geneLabelFontsize=10,
                                 plotXbegin=0.9,
                                 plotWidth=5)

We now want to zoom in closer to the TSS of ERRFI1. To do this we specify the region we want to plot as a GRanges of length 1:

zoomGR <- GRanges(seqnames="chr1",
                  ranges=IRanges(start=7900e3,end=8400e3))

We then supply the GRanges to the plotRegionGR argument of plotBrowserDegCre():

zoomBrowserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList,
                                                   plotRegionGR=zoomGR,
                                                   geneNameColName="GeneSymb",
                                                   CreSignalName="NR3C1",
                                                   panelTitleFontSize=8,
                                                   geneLabelFontsize=10,
                                                   plotXbegin=0.9,
                                                   plotWidth=5)