# 1 Introduction

RIVER is an R package of a probabilistic modeling framework, called RIVER (RNA-Informed Variant Effect on Regulation), that jointly analyzes personal genome (WGS) and transcriptome data (RNA-seq) to estimate the probability that a variant has regulatory impact in that individual. It is based on a generative model that assumes that genomic annotations, such as the location of a variant with respect to regulatory elements, determine the prior probability that variant is a functional regulatory variant, which is an unobserved variable. The functional regulatory variant status then influences whether nearby genes are likely to display outlier levels of gene expression in that person.

RIVER is a hierarchical Bayesian model that predicts the regulatory effects of rare variants by integrating gene expression with genomic annotations. The RIVER consists of three layers: a set of nodes $$G = G_{1}, ..., G_{P}$$ in the topmost layer representing $$P$$ observed genomic annotations over all rare variants near a particular gene, a latent binary variable $$FR$$ in the middle layer representing the unobserved funcitonal regulatory status of the rare variants, and one binary node $$E$$ in the final layer representing expression outlier status of the nearby gene. We model each conditional probability distribution as follows: $FR | G \sim Bernoulli(\psi), \psi = logit^{-1}(\beta^T, G)$ $E | FR \sim Categorical(\theta_{FR})$ $\beta_i \sim N(0, \frac{1}{\lambda})$ $\theta_{FR} \sim Beta(C,C)$ with parameters $$\beta$$ and $$\theta$$ and hyper-parameters $$\lambda$$ and $$C$$.

Because $$FR$$ is unobserved, log-likelihood objective of RIVER over instances $$n = 1, ..., N$$, $\sum_{n=1}^{N} log\sum_{FR_n= 0}^{1} P(E_n, G_n, FR_n | \beta, \theta),$ is non-convex. We therefore optimize model parameters via Expectation-Maximization (EM) as follows:

In the E-step, we compute the posterior probabilities ($$\omega_n^{(i)}$$) of the latent variables $$FR_n$$ given current parameters and observed data. For example, at the $$i$$-th iteration, the posterior probability of $$FR_n = 1$$ for the $$n$$-th instance is $\omega_{1n}^{(i)} = P(FR_n = 1 | G_n, \beta^{(i)}, E_n, \theta^{(i)}) =\frac{P(FR_n = 1 | G_n, \beta^{(i)}) \cdotp P(E_n | FR_n = 1, \theta^{(i)})}{\sum_{FR_n = 0}^1 P(FR_n | G_n, \beta^{(i)}) \cdotp P(E_n | FR_n, \theta^{(i)})},$ $\omega_{0n}^{(i)} = 1 - \omega_{1n}^{(i)}.$

In the M-step, at the $$i$$-th iteration, given the current estimates $$\omega^{(i)}$$, the parameters ($$\beta^{(i+1)*}$$) are estimated as $\max_{\beta^{(i+1)}} \sum_{n = 1}^N \sum_{FR_n = 0}^1 log(P(FR_n | G_n, \beta^{(i+1)})) \cdotp \omega_{FR, n}^{(i)} - \frac{\lambda}{2}||\beta^{(i+1)}||_2,$ where $$\lambda$$ is an L2 penalty hyper-parameter derived from the Gaussian prior on $$\beta$$. The parameters $$\theta$$ get updated as: $\theta_{s,t}^{(i+1)} = \sum_{n = 1}^{N} I(E_n = t) \cdotp \omega_{s,n}^{(i)} + C.$ where $$I$$ is an indicator operator, $$t$$ is the binary value of expression $$E_n$$, $$s$$ is the possible binary values of $$FR_n$$ and $$C$$ is a pseudo count derived from the Beta prior on . The E- and M-steps are applied iteratively until convergence.

# 2 Quick Start

The purpose of this section is to provide users a general sense of our package, RIVER, including components and their behaviors and applications. We will briefly go over main functions, observe basic operations and corresponding outcomes. Throughout this section, users may have better ideas about which functions are available, which values to be chosen for necessary parameters, and where to seek help. More detailed descriptions are given in later sections.

First, we load RIVER:

library("RIVER")

RIVER consists of several functions mainly supporting two main functions including evaRIVER and appRIVER, which we are about to show how to use them here. We first load simulated data created beforehand for illustration.

filename <- system.file("extdata", "simulation_RIVER.gz",
package="RIVER")
dataInput <- getData(filename) # import experimental data

getData combines different resources including genomic features, outlier status of gene expression, and N2 pairs having same rare variants into standardized data structures, called ExpressionSet class.

print(dataInput)
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 18 features, 6122 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: indiv58:gene1614 indiv5:gene1331 ... indiv18:gene1111
##     (6122 total)
##   varLabels: Outlier N2pair
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
Feat <- t(Biobase::exprs(dataInput)) # genomic features (G)
Out <- as.numeric(unlist(dataInput$Outlier))-1 # outlier status (E) In the simulated data, an input object dataInput consists of 18 genomic features and expression outlier status of 6122 samples and which samples belong to N2 pairs. head(Feat) ## Feature1 Feature2 Feature3 Feature4 Feature5 ## indiv58:gene1614 0.4890338 -1.2143183 -0.5015241 -0.8093881 -0.1916957 ## indiv5:gene1331 -4.1665487 -0.6355490 -1.0914989 -2.8069442 -0.3921658 ## indiv76:gene447 -0.4469724 -0.9440314 0.7386942 0.6786997 0.6043647 ## indiv16:gene126 1.6252083 -2.2686506 1.4156013 -0.5662072 -0.1405224 ## indiv45:gene1044 0.3086791 1.0044798 0.8493856 0.3515569 -0.3763434 ## indiv6:gene258 0.9046627 1.7618399 -1.6258895 0.3961724 -0.2203089 ## Feature6 Feature7 Feature8 Feature9 Feature10 ## indiv58:gene1614 -0.06614698 0.1993153 -0.7544253 -1.3505036 -0.03477715 ## indiv5:gene1331 -0.43831006 1.8175888 -0.8411557 2.1495237 -0.48691686 ## indiv76:gene447 1.47627930 0.6521233 -0.6416004 1.0309900 -0.38262446 ## indiv16:gene126 2.11676989 1.0670951 0.3404799 -1.5970916 -0.49910751 ## indiv45:gene1044 1.45269690 -0.8368466 -1.0016646 0.1908291 -0.26598159 ## indiv6:gene258 -0.76821018 -1.3436283 -0.7201516 0.5440035 0.31006097 ## Feature11 Feature12 Feature13 Feature14 Feature15 ## indiv58:gene1614 2.08675352 0.7283183 0.15074710 1.5183579 0.2226134 ## indiv5:gene1331 2.57339374 0.4840111 0.58897093 -0.2069596 0.1000502 ## indiv76:gene447 0.30621172 0.6690835 -1.39701213 1.4853201 1.6552545 ## indiv16:gene126 0.58599041 -0.7052013 -0.07715282 -1.3326831 0.1719152 ## indiv45:gene1044 1.45029299 -0.4530850 -0.09610983 1.4731617 -0.9372256 ## indiv6:gene258 0.05174989 0.4026417 -0.50911462 -0.3623563 -2.5279770 ## Feature16 Feature17 Feature18 ## indiv58:gene1614 1.88846321 -0.4408236 -0.4558111 ## indiv5:gene1331 0.22640429 1.2535793 -1.0163254 ## indiv76:gene447 0.71720775 0.2758493 1.1378452 ## indiv16:gene126 1.12919669 0.6166061 -1.6069772 ## indiv45:gene1044 0.14794114 1.2241711 1.4916670 ## indiv6:gene258 -0.01574568 -1.5835940 -1.2366762 head(Out) ## [1] 1 1 1 0 1 1 Feat contains continuous values of genomic features (defined as $$G$$ in the objective function) while Out contains binary values representing outlier status of same samples as Feat (defined as $$E$$ in the objective function). For evaluation, we hold out pairs of individuals at genes where only those two individuals shared the same rare variants. Except for the list of instances, we train RIVER with the rest of instances, compute the RIVER score (the posterior probability of having a functional regulatory variant given both WGS and RNA-seq data) from one individual, and assess the accuracy with respect to the second individual’s held-out expression levels. Since there is currently quite few gold standard set of functional rare variants, using this labeled test data allow us to evaluate predictive accuracy of RIVER compared with genomic annotation model, GAM, that uses genomic annotations alone. We can observe a significant improvement by incorporating expression data. To do so, we simply use evaRIVER: evaROC <- evaRIVER(dataInput) ## ::: EM iteration is terminated since it converges within a ## predefined tolerance (0.001) ::: evaROC is an S4 object of class evaRIVER which contains two AUC values from RIVER and GAM, specificity and sensitivity measures from the two models, and p-value of comparing the two AUC values. cat('AUC (GAM-genomic annotation model) = ', round(evaROC$GAM_auc,3), '\n')
## AUC (GAM-genomic annotation model) =  0.58
cat('AUC (RIVER) = ', round(evaROC$RIVER_auc,3), '\n') ## AUC (RIVER) = 0.8 cat('P-value ', format.pval(evaROC$pvalue, digits=2, eps=0.001), '***\n')
## P-value  <0.001 ***

We can visualize the ROC curves with AUC values:

par(mar=c(6.1, 6.1, 4.1, 4.1))
plot(NULL, xlim=c(0,1), ylim=c(0,1),
xlab="False positive rate", ylab="True positive rate",
cex.axis=1.3, cex.lab=1.6)
abline(0, 1, col="gray")
lines(1-evaROC$RIVER_spec, evaROC$RIVER_sens,
type="s", col='dodgerblue', lwd=2)
lines(1-evaROC$GAM_spec, evaROC$GAM_sens,
type="s", col='mediumpurple', lwd=2)
legend(0.7,0.2,c("RIVER","GAM"), lty=c(1,1), lwd=c(2,2),
col=c("dodgerblue","mediumpurple"), cex=1.2,
pt.cex=1.2, bty="n")
title(main=paste("AUC: RIVER = ", round(evaROC$RIVER_auc,3), ", GAM = ", round(evaROC$GAM_auc,3),
", P = ", format.pval(evaROC$pvalue, digits=2, eps=0.001),sep="")) Each ROC curve from either RIVER or GAM is computed by comparing the posterior probability given available data for the 1st individual with the outlier status of the 2nd individual in the list of held-out pairs and vice versa. To extract posterior probabilities for prioritizing functional rare variants in any downstream analysis such as finding pathogenic rare variants in disease, you simply run appRIVER to obtain the posterior probabilities: postprobs <- appRIVER(dataInput) ## ::: EM iteration is terminated since it converges within a ## predefined tolerance (0.001) ::: postprobs is an S4 object of class appRIVER which contains subject IDs, gene names, $$P(FR = 1 | G)$$, $$P(FR = 1 | G, E)$$, and fitRIVER, all the relevant information of the fitted RIVER including hyperparamters for further use. Probabilities of rare variants being functional from RIVER and GAM for a few samples are shown below: example_probs <- data.frame(Subject=postprobs$indiv_name,
Gene=postprobs$gene_name, RIVERpp=postprobs$RIVER_posterior,
GAMpp=postprobs$GAM_posterior) head(example_probs) ## Subject Gene RIVERpp GAMpp ## 1 indiv58 gene1614 0.4303673 0.2319994 ## 2 indiv5 gene1331 0.3597397 0.1939262 ## 3 indiv76 gene447 0.5249447 0.3061179 ## 4 indiv16 gene126 0.1316062 0.2447880 ## 5 indiv45 gene1044 0.4488103 0.2370526 ## 6 indiv6 gene258 0.3598831 0.1714101 From left to right, it shows subject ID, gene name, posterior probabilities from RIVER, posterior probabilities from GAM. To observe how much we can obtain additional information on functional rare variants by integrating the outlier status of gene expression into RIVER in the following figure. plotPosteriors(postprobs, outliers=as.numeric(unlist(dataInput$Outlier))-1)