DegCre 1.0.0
DegCre can be installed from Bioconductor as follows:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DegCre")
Alternatively, DegCre can be installed from GitHub.
With the devtools
pacakge installed, run:
devtools::install_github("brianSroberts/DegCre")
DegCre associates differentially expressed genes (DEGs) with cis-regulatory
elements (CREs) in a probabilistic, non-parametric approach. DegCre is
intended to be applied on differential expression and regulatory signal data
derived from responses to perturbations such as drug or natural ligand
treatmnents. As an example used here, we have obtained data from
McDowell et al. (McDowell et al. 2018) which was generated by treating
A549 cells with dexamethasone and measuring RNA-seq and ChIP-seq data at
several time points. Data from RNA-seq and NR3C1 ChIP-seq at four hours
versus control is stored in the list DexNR3C1
.
DegCre uses the
GenomicRanges
framework for handling genomic regions and some calculations. As one input,
DegGR
, users generate differential expression statistics for genes with
methods such as
DESeq2 or
edgeR. These
values should then be associated with gene TSSs such as those available from
EPDNew (Dreos et al. 2015; Meylan et al. 2020) in a GRanges
. The second input,
CreGR
, is differential regulatory signal data (in the example,
NR3C1 ChIP-seq data) associated with genomic regions in a GRanges
such as
those generated by
csaw.
A complete description of the mathematical basis of the DegCre core algorithms
is provided in (Roberts, Cooper, and Myers 2023). DegCre generates a Hits
object of all
associations between DegGR
and CreGR
within a specified maximum distance.
Associations are then binned by TSS-to-CRE distance according to an algorithm
that balances resolution (many bins with few members) versus minimization of
the deviance of each bin’s CRE p-value distribution from the global
distribution, selecting an optimal bin size.
Next, DegCre applies a non-parametric algorithm to find concordance between
DEG and CRE differential effects within bins and derives an association
probability. For all association probabilities involving one given CRE, the
probabilities are adjusted to favor associations across shorter distances.
An FDR of the association probability is then estimated. Results are returned
in list containing a Hits
object and both input GRanges
.
To begin, we load the necessary packages:
library(GenomicRanges)
library(InteractionSet)
library(plotgardener)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)
library(DegCre)
library(magick)
DegCre is run on the two input GRanges
, DegGR
and CreGR
, and
differential p-values associated with each. To require the effect directions
be concordant between the changes in gene expression and CRE signal, users must
also supply log fold-changes for each and set reqEffectDirConcord=TRUE
(default value). As example data, we use DexNR3C1
derived from McDowell et
al. (McDowell et al. 2018) using DESeq2
and csaw:
data(DexNR3C1)
lapply(DexNR3C1,head)
#> $DegGR
#> GRanges object with 6 ranges and 7 metadata columns:
#> seqnames ranges strand | promGeneName GeneSymb
#> <Rle> <IRanges> <Rle> | <character> <character>
#> [1] chr1 959255-959256 - | NOC2L_1 NOC2L
#> [2] chr1 960632-960633 + | KLHL17_1 KLHL17
#> [3] chr1 966481-966482 + | PLEKHN1_1 PLEKHN1
#> [4] chr1 976680-976681 - | PERM1_1 PERM1
#> [5] chr1 1000096-1000097 - | HES4_1 HES4
#> [6] chr1 1000510-1000511 + | ISG15_2 ISG15
#> GeneID baseMean logFC pVal pAdj
#> <character> <numeric> <numeric> <numeric> <numeric>
#> [1] ENSG00000188976.9 1404.01811 -0.0148658 0.89202335 0.9804131
#> [2] ENSG00000187961.12 182.81254 -0.5189710 0.00183145 0.0235433
#> [3] ENSG00000187583.9 56.95208 -0.1598313 0.50313987 0.8747718
#> [4] ENSG00000187642.8 1.41588 -1.0589853 0.49900530 NA
#> [5] ENSG00000188290.9 76.33487 -0.7547663 0.00184199 0.0236628
#> [6] ENSG00000187608.7 97.50041 -0.0390943 0.88895013 0.9803767
#> -------
#> seqinfo: 23 sequences from an unspecified genome; no seqlengths
#>
#> $CreGR
#> GRanges object with 6 ranges and 3 metadata columns:
#> seqnames ranges strand | logFC pVal pAdj
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
#> [1] chr1 941905-941960 * | 1.875569 0.3827979 0.947530
#> [2] chr1 943129-943526 * | 3.713827 0.0154015 0.120363
#> [3] chr1 1686385-1686476 * | 2.030911 0.4654492 1.000000
#> [4] chr1 1741105-1741250 * | 2.061714 0.2426531 0.732045
#> [5] chr1 1777591-1777610 * | -0.366914 0.6429142 1.000000
#> [6] chr1 1777645-1777700 * | 0.241421 1.0000000 1.000000
#> -------
#> seqinfo: 23 sequences from an unspecified genome
We now run DegCre on this data with default values (subsetting input to “chr1” for expediency):
subDegGR <- DexNR3C1$DegGR[which(seqnames(DexNR3C1$DegGR)=="chr1")]
subCreGR <- DexNR3C1$CreGR[which(seqnames(DexNR3C1$CreGR)=="chr1")]
myDegCreResList <- runDegCre(DegGR=subDegGR,
DegP=subDegGR$pVal,
DegLfc=subDegGR$logFC,
CreGR=subCreGR,
CreP=subCreGR$pVal,
CreLfc=subCreGR$logFC,
reqEffectDirConcord=TRUE,
verbose=FALSE)
DegCre produces a list object with several items. These include the inputs
DegGR
and CreGR
. It also contains a Hits
object degCreHits
. The
queryHits
of degCreHits
index DegGR
and the subjectHits
index CreGR
.
degCreHits
also has several metadata columns that include the key resuls of
the algorithm. These include assocProb
which is the probabilty of the
association being true, and assocProbFDR
which is the FDR of assocProb
being greater than a null model based on TSS to CRE distance alone. Also,
the results of the distance bin optimization heuristic, binHeurOutputs
,
and the input DEG significance threshold alphaVal
are returned.
names(myDegCreResList)
#> [1] "degCreHits" "binHeurOutputs" "alphaVal" "DegGR"
#> [5] "CreGR"
head(myDegCreResList$degCreHits)
#> Hits object with 6 hits and 10 metadata columns:
#> queryHits subjectHits | assocDist assocProb assocProbFDR rawAssocProb
#> <integer> <integer> | <integer> <numeric> <numeric> <numeric>
#> [1] 1 5 | 818334 0 1 0.0691544
#> [2] 2 5 | 816957 0 1 0.0691544
#> [3] 3 5 | 811108 0 1 0.0691544
#> [4] 4 5 | 800909 0 1 0.0691544
#> [5] 5 5 | 777493 0 1 0.0708054
#> [6] 6 5 | 777079 0 1 0.0708054
#> CreP DegP DegPadj binAssocDist numObs distBinId
#> <numeric> <numeric> <numeric> <numeric> <numeric> <integer>
#> [1] 0.642914 0.89202335 1 860869 2277 13
#> [2] 0.642914 0.00183145 1 860869 2277 13
#> [3] 0.642914 0.50313987 1 860869 2277 13
#> [4] 0.642914 0.49900530 1 860869 2277 13
#> [5] 0.642914 0.00184199 1 788305 2277 12
#> [6] 0.642914 0.88895013 1 788305 2277 12
#> -------
#> queryLength: 2791 / subjectLength: 2515
To find associations driven by non-random changes in CRE signal, we subset
the degCreHits
by assocProbFDR
. We can also further subset by assocProb
to find those also more likely to confirm in a suitable validation experiment:
passFDR <- which(mcols(myDegCreResList$degCreHits)$assocProbFDR <= 0.05)
passProb <- which(mcols(myDegCreResList$degCreHits)$assocProb >= 0.25)
keepDegCreHits <- myDegCreResList$degCreHits[intersect(passFDR,passProb)]
keepDegCreHits
#> Hits object with 280 hits and 10 metadata columns:
#> queryHits subjectHits | assocDist assocProb assocProbFDR rawAssocProb
#> <integer> <integer> | <integer> <numeric> <numeric> <numeric>
#> [1] 106 12 | 21906 0.352554 2.59376e-10 0.352554
#> [2] 110 13 | 133 0.275902 4.51173e-09 0.275902
#> [3] 111 13 | 1110 0.275902 4.51173e-09 0.275902
#> [4] 117 16 | 6785 0.275902 6.91694e-09 0.275902
#> [5] 117 14 | 33803 0.382105 3.00959e-07 0.382105
#> ... ... ... . ... ... ... ...
#> [276] 2586 2139 | 25987 0.275902 6.64008e-09 0.275902
#> [277] 2586 2138 | 29695 0.252744 8.94344e-07 0.252744
#> [278] 2606 2175 | 4143 0.275902 5.90251e-09 0.275902
#> [279] 2607 2175 | 3379 0.275902 5.90251e-09 0.275902
#> [280] 2608 2175 | 2106 0.275902 5.90251e-09 0.275902
#> CreP DegP DegPadj binAssocDist numObs distBinId
#> <numeric> <numeric> <numeric> <numeric> <numeric> <integer>
#> [1] 0.000925164 3.15116e-07 8.03231e-04 48582 2277 1
#> [2] 0.026711376 9.18081e-10 2.34019e-06 48582 2277 1
#> [3] 0.026711376 9.18081e-10 2.34019e-06 48582 2277 1
#> [4] 0.025518445 1.34403e-09 3.42593e-06 48582 2277 1
#> [5] 0.000177546 1.34403e-09 3.42593e-06 48582 2277 1
#> ... ... ... ... ... ... ...
#> [276] 0.0252348 2.35586e-12 6.00508e-09 48582 2277 1
#> [277] 0.0666493 2.35586e-12 6.00508e-09 48582 2277 1
#> [278] 0.0252299 2.56393e-07 6.53546e-04 48582 2277 1
#> [279] 0.0252299 2.56393e-07 6.53546e-04 48582 2277 1
#> [280] 0.0252299 2.56393e-07 6.53546e-04 48582 2277 1
#> -------
#> queryLength: 2791 / subjectLength: 2515
We can then view the subsets of DegGR
and CreGR
that comprise the passing
associations:
myDegCreResList$DegGR[queryHits(keepDegCreHits)]
#> GRanges object with 280 ranges and 7 metadata columns:
#> seqnames ranges strand | promGeneName GeneSymb
#> <Rle> <IRanges> <Rle> | <character> <character>
#> [1] chr1 6180123-6180124 - | CHD5_1 CHD5
#> [2] chr1 6259438-6259439 - | GPR153_2 GPR153
#> [3] chr1 6261063-6261064 - | GPR153_1 GPR153
#> [4] chr1 6419918-6419919 - | HES2_1 HES2
#> [5] chr1 6419918-6419919 - | HES2_1 HES2
#> ... ... ... ... . ... ...
#> [276] chr1 223845946-223845947 - | TP53BP2_1 TP53BP2
#> [277] chr1 223845946-223845947 - | TP53BP2_1 TP53BP2
#> [278] chr1 225887108-225887109 - | LEFTY1_2 LEFTY1
#> [279] chr1 225887872-225887873 - | LEFTY1_3 LEFTY1
#> [280] chr1 225889145-225889146 - | LEFTY1_1 LEFTY1
#> GeneID baseMean logFC pVal pAdj
#> <character> <numeric> <numeric> <numeric> <numeric>
#> [1] ENSG00000116254.16 52.5996 1.50353 3.15116e-07 8.03231e-04
#> [2] ENSG00000158292.6 98.8953 1.86950 9.18081e-10 2.34019e-06
#> [3] ENSG00000158292.6 98.8953 1.86950 9.18081e-10 2.34019e-06
#> [4] ENSG00000069812.10 34.9662 2.05545 1.34403e-09 3.42593e-06
#> [5] ENSG00000069812.10 34.9662 2.05545 1.34403e-09 3.42593e-06
#> ... ... ... ... ... ...
#> [276] ENSG00000143514.15 1322.437 0.678904 2.35586e-12 6.00508e-09
#> [277] ENSG00000143514.15 1322.437 0.678904 2.35586e-12 6.00508e-09
#> [278] ENSG00000243709.1 17.491 2.218767 2.56393e-07 6.53546e-04
#> [279] ENSG00000243709.1 17.491 2.218767 2.56393e-07 6.53546e-04
#> [280] ENSG00000243709.1 17.491 2.218767 2.56393e-07 6.53546e-04
#> -------
#> seqinfo: 23 sequences from an unspecified genome; no seqlengths
myDegCreResList$CreGR[subjectHits(keepDegCreHits)]
#> GRanges object with 280 ranges and 3 metadata columns:
#> seqnames ranges strand | logFC pVal pAdj
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
#> [1] chr1 6157801-6158216 * | 6.06528 0.000925164 0.02104324
#> [2] chr1 6259573-6259952 * | 5.14967 0.026711376 0.17136537
#> [3] chr1 6259573-6259952 * | 5.14967 0.026711376 0.17136537
#> [4] chr1 6412897-6413132 * | 3.27381 0.025518445 0.16628814
#> [5] chr1 6385663-6386114 * | 6.95045 0.000177546 0.00954392
#> ... ... ... ... . ... ... ...
#> [276] chr1 223819867-223819958 * | 3.02428 0.0252348 0.165105
#> [277] chr1 223815871-223816250 * | 2.93106 0.0666493 0.321799
#> [278] chr1 225891253-225891272 * | 2.55296 0.0252299 0.165094
#> [279] chr1 225891253-225891272 * | 2.55296 0.0252299 0.165094
#> [280] chr1 225891253-225891272 * | 2.55296 0.0252299 0.165094
#> -------
#> seqinfo: 23 sequences from an unspecified genome
The choice of the threshold for DEG significance (DEG \(\alpha\)) is important in DegCre calculations. Accordingly, DegCre can be run over a set of thresholds to determine an optimal value. Here we select DEG \(\alpha\) by finding the value that maximizes a normalized Precision-Recall Area Under the Curve (AUC):
alphaOptList <- optimizeAlphaDegCre(DegGR=subDegGR,
DegP=subDegGR$pVal,
DegLfc=subDegGR$logFC,
CreGR=subCreGR,
CreP=subCreGR$pVal,
CreLfc=subCreGR$logFC,
reqEffectDirConcord=TRUE,
verbose=FALSE)
alphaOptList$alphaPRMat
#> alphaVal AUC deltaAUC normDeltaAUC
#> alpha_0.005 0.005 0.02226499 0.01479536 0.01490670
#> alpha_0.01 0.010 0.02358739 0.01562898 0.01575436
#> alpha_0.02 0.020 0.02805818 0.01910732 0.01927989
#> alpha_0.03 0.030 0.02897679 0.01949782 0.01968441
#> alpha_0.05 0.050 0.02764595 0.01863172 0.01880120
#> alpha_0.1 0.100 0.02628054 0.01780109 0.01795333
bestAlphaId <- which.max(alphaOptList$alphaPRMat[,4])
bestDegCreResList <- alphaOptList$degCreResListsByAlpha[[bestAlphaId]]
With an optimal DegCre result obtained, we can plot the association probabilities versus binned association distance to see the relationship between the two:
par(mai=c(0.8,1.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
par(lwd=2)
binStats <- plotDegCreAssocProbVsDist(degCreResList=bestDegCreResList,
assocProbFDRThresh=0.05)
The total number of associations passing a reasonable FDR (0.05 in the figure) drops fairly quickly as bin TSS to CRE distance increases (top panel). Similarly, the median association probability (black line in lower panel) drops with distance as well. The blue range is the interquartile range of the association probabilities. The red line in the above plot represents the null association probability, calculated as the probability if all CRE p-values in a given bin were identical.
The binning of the associations by distance is also important to the operation
of the DegCre algorithm. The choice of the bin size (number of associations
per bin) is done by an optimization heuristic that tries a wide range of bin
sizes and calculates the KS test statistic for each bin’s CRE p-value
distribution compared against the global (unbinned) CRE p-value distribution.
The algorithm finds the smallest bin size for which the median KS-statistic
is within a specified fraction, fracMinKsMedianThresh
, of the range of that
for all tested bin sizes. We visualize the results of this process:
par(mai=c(0.8,0.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
par(lwd=1.5)
plotDegCreBinHeuristic(degCreResList=bestDegCreResList)
The red dot indicates the chosen bin size run with the default value of ‘fracMinKsMedianThresh = 0.2’.
Another way to evaluate the performance of the DegCre on a data set is to make a Precision-Recall (PR) curve. To calculate these values, we define as a perfect set of associations, i.e. one that would generate a PR Area Under the Curve (AUC) of 1, as containing associations to all significant DEGs with uniform association probabilities of 1. Such a set will not result from real data, but it serves as a useful benchmark. Comparison of relative values of PR AUCs for DegCre results will be most informative. We plot the PR curve for the example data:
par(mai=c(0.8,0.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
par(lwd=1.5)
degCrePRAUC(degCreResList=bestDegCreResList)
The black line is the performance of the DegCre associations and the red region is values obtained from random shuffles.
Since DegCre calculates association probabilities, one can estimate the
number of expected associations per significant DEG by summing the total
associations that pass a chosen FDR. DEGs with many expected associations
are more likely to have true associations. This can be done with the
getExpectAssocPerDEG()
function:
expectAssocPerDegDf <- getExpectAssocPerDEG(degCreResList=bestDegCreResList,
geneNameColName="GeneSymb",
assocAlpha=0.05)
The result is a DataFrame
with one entry per gene, sorted by the expected
number of DEGs, `expectAssocs’.
head(expectAssocPerDegDf)
#> DataFrame with 6 rows and 11 columns
#> promGeneName GeneSymb GeneID baseMean logFC pVal
#> <character> <character> <character> <numeric> <numeric> <numeric>
#> 1 ZBTB7B_2 ZBTB7B ENSG00000160685.12 442.688 0.956975 9.65209e-10
#> 2 NFIA_4 NFIA ENSG00000162599.14 2914.410 0.882264 8.26113e-12
#> 3 TGFBR3_3 TGFBR3 ENSG00000069702.9 1235.361 0.931604 1.59219e-17
#> 4 COA7_1 COA7 ENSG00000162377.5 1191.281 0.442608 7.74760e-06
#> 5 ZC3H12A_2 ZC3H12A ENSG00000163874.8 567.067 1.528620 1.71215e-34
#> 6 SHC1_2 SHC1 ENSG00000160691.17 7353.804 0.623721 1.76309e-08
#> pAdj expectAssocs nAssocs assocAlpha degAlpha
#> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1 2.46032e-06 10.31544 33 0.05 0.03
#> 2 2.10576e-08 8.92411 36 0.05 0.03
#> 3 4.05850e-14 6.89431 24 0.05 0.03
#> 4 1.97486e-02 6.83219 25 0.05 0.03
#> 5 4.36426e-31 6.72306 18 0.05 0.03
#> 6 4.49412e-05 6.55873 21 0.05 0.03
This result can be plotted as a histogram by the function
plotExpectedAssocsPerDeg()
:
par(mai=c(0.8,0.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
plotExpectedAssocsPerDeg(expectAssocPerDegDf=expectAssocPerDegDf)
The dashed line indicates the median expected associations per DEG.
One of the genes with a high number of expected associations and also a known
target of NR3C1 (Glucocorticoid receptor) is ERRFI1. Our test data
DexNR3C1
is derived from ChIP-seq of NR3C1 in cells treated with the
powerful glucocorticoid, dexamethasone, so it is not surprising to find it
with many expected associations. To visualize these associations, we can make
a browser plot of the associations and underlying data with
plotBrowserDegCre()
which relies on the
plotgardener package.
browserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList,
geneName="ERRFI1",
geneNameColName="GeneSymb",
CreSignalName="NR3C1",
panelTitleFontSize=8,
geneLabelFontsize=10,
plotXbegin=0.9,
plotWidth=5)
We now want to zoom in closer to the TSS of ERRFI1. To do this we specify
the region we want to plot as a GRanges
of length 1:
zoomGR <- GRanges(seqnames="chr1",
ranges=IRanges(start=7900e3,end=8400e3))
We then supply the GRanges
to the plotRegionGR
argument of
plotBrowserDegCre()
:
zoomBrowserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList,
plotRegionGR=zoomGR,
geneNameColName="GeneSymb",
CreSignalName="NR3C1",
panelTitleFontSize=8,
geneLabelFontsize=10,
plotXbegin=0.9,
plotWidth=5)