Compositional data occur in many disciplines: geology, nutrition, economics, and ecology, to name a few. Data are compositional when each sample is sum-constrained. For example, mineral compositions describe a mineral in terms of the weight percentage coming from various elements; or taxonomic compositions break down a community by the fraction of community memebers that come from a particular species. In ecology in particular, the covariance between features is often of interest to determine which species possibly interact with each other. However, the sum constraint of compositional data makes naive measures inappropriate.

BAnOCC is a package for analyzing compositional covariance while
accounting for the compositional structure. Briefly, the model assumes
that the unobserved counts are log-normally distributed and then
infers the correlation matrix of the log-basis (see the The Model
section for a more detailed explanation). The inference is made using
No U-Turn Sampling for Hamiltonian Monte Carlo (Hoffman and Gelman 2014)
as implemented in the `rstan`

R package (Stan Development Team 2015a).

There are three options for installing BAnOCC:

- Within R
- Using compressed file from bitbucket
- Directly from bitbucket

**This is not yet available**

**This is not yet available**

Clone the repository using `git clone`

, which downloads the package as
its own directory called `banocc`

.

`git clone https://<your-user-name>@bitbucket.org/biobakery/banocc.git`

Then, install BAnOCCâ€™s dependencies. If these are already installed on your machine, this step can be skipped.

`Rscript -e "install.packages(c('rstan', 'mvtnorm', 'coda', 'stringr'))"`

Lastly, install BAnOCC using `R CMD INSTALL`

. Note that this *will
not* automatically install the dependencies, so they must be installed
first.

`R CMD INSTALL banocc`

We first need to load the package:

`library(banocc)`

`## Loading required package: rstan`

`## Loading required package: StanHeaders`

```
##
## rstan version 2.32.6 (Stan version 2.32.2)
```

```
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
```

The BAnOCC package contains four things:

`banocc_model`

, which is the BAnOCC model in the`rstan`

format`run_banocc`

, a wrapper function for`rstan::sampling`

that samples from the model and returns a list with various useful elements`get_banocc_output`

, which takes as input a`stanfit`

object outputted by`run_banocc`

, and outputs various statistics from the chains.Several test datasets which are included both as counts and as the corresponding compositions:

Dataset Description Counts Composition No correlations in the counts `counts_null`

`compositions_null`

No correlations in the counts `counts_hard_null`

`compositions_hard_null`

Positive corr. in the counts `counts_pos_spike`

`compositions_pos_spike`

Negative corr. in the counts `counts_neg_spike`

`compositions_neg_spike`

For a full and complete description of the possible parameters for
`run_banocc`

and `get_banocc_output`

, their default values, and the
output, see

```
?run_banocc
?get_banocc_output
```

There are only two required inputs to `run_banocc`

:

- The dataset
`C`

. This is assumed to be \(N \times P\), with \(N\) samples and \(P\) features. The row sums are therefore required to be less than one for all samples. - The compiled stan model
`compiled_banocc_model`

. The compiled model is required so that`run_banocc`

doesnâ€™t need to waste time compiling the model every time it is called. To compile, use`rstan::stan_model(model_code=banocc::banocc_model)`

.

The simplest way to run the model is to load a test dataset, compile the model, sample from it (this gives a warning because the default number of iterations is low), and get the output:

```
data(compositions_null)
compiled_banocc_model <- rstan::stan_model(model_code = banocc::banocc_model)
b_fit <- banocc::run_banocc(C = compositions_null, compiled_banocc_model=compiled_banocc_model)
```

```
## Warning: There were 8 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: The largest R-hat is 1.59, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
```

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
```

```
## Warning in evaluate_convergence(b_stanfit = Fit, verbose = verbose, num_level =
## num_level + : Fit has not converged as evaluated by the Rhat statistic. You
## might try a larger number of warmup iterations, different priors, or different
## initial values. See vignette for more on evaluating convergence.
```

`b_output <- banocc::get_banocc_output(banoccfit=b_fit)`

```
## Warning in evaluate_convergence(b_stanfit = b_stanfit, verbose = verbose, : Fit
## has not converged as evaluated by the Rhat statistic. You might try a larger
## number of warmup iterations, different priors, or different initial values. See
## vignette for more on evaluating convergence.
```

The hyperparameter values can be specified as input to
`run_banocc`

. Their names correspond to the parameters in the plate
diagram figure (see section The Model). For example,

```
p <- ncol(compositions_null)
b_fit_hp <- banocc::run_banocc(C = compositions_null,
compiled_banocc_model = compiled_banocc_model,
n = rep(0, p),
L = 10 * diag(p),
a = 0.5,
b = 0.01)
```

```
## Warning: There were 8 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
## https://mc-stan.org/misc/warnings.html#bfmi-low
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: The largest R-hat is 2.03, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
```

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
```

```
## Warning in evaluate_convergence(b_stanfit = Fit, verbose = verbose, num_level =
## num_level + : Fit has not converged as evaluated by the Rhat statistic. You
## might try a larger number of warmup iterations, different priors, or different
## initial values. See vignette for more on evaluating convergence.
```

There are several options to control the behavior of the HMC sampler
within `run_banocc`

. This is simply a call to `rstan::sampling`

, and
so many of the parameters are the same.

The number of chains, iterations, and warmup iterations as well as the
rate of thinning for `run_banocc`

can be specified using the same
parameters as for `rstan::sampling`

and `rstan::stan`

. For example,
the following code gives a total of three iterations from each of two
chains. These parameters are used only for brevity and are NOT
recommended in practice.

```
b_fit_sampling <- banocc::run_banocc(C = compositions_null,
compiled_banocc_model = compiled_banocc_model,
chains = 2,
iter = 11,
warmup = 5,
thin = 2)
```

```
## Warning: There were 4 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: There were 6 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning in evaluate_convergence(b_stanfit = Fit, verbose = verbose, num_level =
## num_level + : Fit has not converged as evaluated by the Rhat statistic. You
## might try a larger number of warmup iterations, different priors, or different
## initial values. See vignette for more on evaluating convergence.
```

The number of cores used for sampling on a multi-processor machine can also be specified, which allows chains to run in parallel and therefore decreases computation time. Since its purpose is running chains in parallel, computation time will decrease as cores are added up to when the number of cores and the number of chains are equal.

```
# This code is not run
b_fit_cores <- banocc::run_banocc(C = compositions_null,
compiled_banocc_model = compiled_banocc_model,
chains = 2,
cores = 2)
```

By default, the initial values for **\(m\)** and \(\lambda\) are sampled
from the priors and the initial values for **\(O\)** are set to the
identity matrix of dimension \(P\). Setting the initial values for
**\(O\)** to the identity helps ensure a parsimonious model fit. The
initial values can also be set to a particular value by using a list
whose length is the number of chains and whose elements are lists of
initial values for each parameter:

```
init <- list(list(m = rep(0, p),
O = diag(p),
lambda = 0.02),
list(m = runif(p),
O = 10 * diag(p),
lambda = runif(1, 0.1, 2)))
b_fit_init <- banocc::run_banocc(C = compositions_null,
compiled_banocc_model = compiled_banocc_model,
chains = 2,
init = init)
```

```
## Warning: The largest R-hat is 2.14, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
```

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
```

```
## Warning in evaluate_convergence(b_stanfit = Fit, verbose = verbose, num_level =
## num_level + : Fit has not converged as evaluated by the Rhat statistic. You
## might try a larger number of warmup iterations, different priors, or different
## initial values. See vignette for more on evaluating convergence.
```

More specific control of the samplerâ€™s behavior comes from the
`control`

argument to `rstan::sampling`

. Details about this argument
can be found in the help for the `rstan::stan`

function:

`?stan`

There are several parameters that control the type of output which is
returned by `get_banocc_output`

.

The width of the returned credible intervals is controlled by
`conf_alpha`

. A \(100\% * (1-\alpha_\text{conf})\) credible interval is
returned:

```
# Get 90% credible intervals
b_out_90 <- banocc::get_banocc_output(banoccfit=b_fit,
conf_alpha = 0.1)
```

```
## Warning in evaluate_convergence(b_stanfit = b_stanfit, verbose = verbose, : Fit
## has not converged as evaluated by the Rhat statistic. You might try a larger
## number of warmup iterations, different priors, or different initial values. See
## vignette for more on evaluating convergence.
```

```
# Get 99% credible intervals
b_out_99 <- banocc::get_banocc_output(banoccfit=b_fit,
conf_alpha = 0.01)
```

```
## Warning in evaluate_convergence(b_stanfit = b_stanfit, verbose = verbose, : Fit
## has not converged as evaluated by the Rhat statistic. You might try a larger
## number of warmup iterations, different priors, or different initial values. See
## vignette for more on evaluating convergence.
```

Convergence is evaluated automatically, and in this case the credible
intervals, estimates, and any additional output in section Additional
Output is missing. This behavior can be turned off using the
`eval_convergence`

option. But be careful!

```
# Default is to evaluate convergence
b_out_ec <- banocc::get_banocc_output(banoccfit=b_fit)
```

```
## Warning in evaluate_convergence(b_stanfit = b_stanfit, verbose = verbose, : Fit
## has not converged as evaluated by the Rhat statistic. You might try a larger
## number of warmup iterations, different priors, or different initial values. See
## vignette for more on evaluating convergence.
```

```
# This can be turned off using `eval_convergence`
b_out_nec <- banocc::get_banocc_output(banoccfit=b_fit,
eval_convergence = FALSE)
```

```
# Iterations are too few, so estimates are missing
b_out_ec$Estimates.median
```

```
## f_n_1 f_n_2 f_n_3 f_n_4 f_n_5 f_n_6 f_n_7 f_n_8 f_n_9
## f_n_1 NA NA NA NA NA NA NA NA NA
## f_n_2 NA NA NA NA NA NA NA NA NA
## f_n_3 NA NA NA NA NA NA NA NA NA
## f_n_4 NA NA NA NA NA NA NA NA NA
## f_n_5 NA NA NA NA NA NA NA NA NA
## f_n_6 NA NA NA NA NA NA NA NA NA
## f_n_7 NA NA NA NA NA NA NA NA NA
## f_n_8 NA NA NA NA NA NA NA NA NA
## f_n_9 NA NA NA NA NA NA NA NA NA
```

```
# Convergence was not evaluated, so estimates are not missing
b_out_nec$Estimates.median
```

```
## f_n_1 f_n_2 f_n_3 f_n_4 f_n_5
## f_n_1 1.000000000 -0.011265936 0.005928568 0.041879538 -0.008471193
## f_n_2 -0.011265936 1.000000000 0.006124928 0.011277351 0.002609208
## f_n_3 0.005928568 0.006124928 1.000000000 -0.015673194 0.045837820
## f_n_4 0.041879538 0.011277351 -0.015673194 1.000000000 -0.012022846
## f_n_5 -0.008471193 0.002609208 0.045837820 -0.012022846 1.000000000
## f_n_6 0.004334103 0.007571303 -0.005468911 -0.019430805 -0.007371658
## f_n_7 0.011617324 0.056860072 0.016275510 -0.012609530 0.015898043
## f_n_8 -0.007236235 -0.004835512 -0.030649899 -0.005453959 0.007408225
## f_n_9 -0.004141106 0.003219396 0.008350241 0.044851439 -0.009019997
## f_n_6 f_n_7 f_n_8 f_n_9
## f_n_1 0.004334103 0.011617324 -0.007236235 -0.004141106
## f_n_2 0.007571303 0.056860072 -0.004835512 0.003219396
## f_n_3 -0.005468911 0.016275510 -0.030649899 0.008350241
## f_n_4 -0.019430805 -0.012609530 -0.005453959 0.044851439
## f_n_5 -0.007371658 0.015898043 0.007408225 -0.009019997
## f_n_6 1.000000000 0.011492596 0.002911376 -0.011064775
## f_n_7 0.011492596 1.000000000 -0.031875620 -0.006561892
## f_n_8 0.002911376 -0.031875620 1.000000000 0.003722756
## f_n_9 -0.011064775 -0.006561892 0.003722756 1.000000000
```

Two types of output can be requested for each correlation that are not included by default:

- The smallest credible interval width that includes zero
- The scaled neighborhood criterion, or SNC (Li and Lin 2010)

```
# Get the smallest credible interval width that includes zero
b_out_min_width <- banocc::get_banocc_output(banoccfit=b_fit,
get_min_width = TRUE)
```

```
## Warning in evaluate_convergence(b_stanfit = b_stanfit, verbose = verbose, : Fit
## has not converged as evaluated by the Rhat statistic. You might try a larger
## number of warmup iterations, different priors, or different initial values. See
## vignette for more on evaluating convergence.
```

```
# Get the scaled neighborhood criterion
b_out_snc <- banocc::get_banocc_output(banoccfit=b_fit,
calc_snc = TRUE)
```

```
## Warning in evaluate_convergence(b_stanfit = b_stanfit, verbose = verbose, : Fit
## has not converged as evaluated by the Rhat statistic. You might try a larger
## number of warmup iterations, different priors, or different initial values. See
## vignette for more on evaluating convergence.
```

Detailed statements about the functionâ€™s execution can also be printed
using the `verbose`

argument. The relative indentation of the verbose
output indicates the nesting level of the function. The starting
indentation can be set with `num_level`

.

There are many ways of assessing convergence, but the two most easily implemented using BAnOCC are:

Traceplots of parameters, which show visually what values of a parameter have been sampled across all iterations. At convergence, the sampler should be moving rapidly across the space, and the chains should overlap well. In other words, it should look like grass.

The Rhat statistic (Gelman and Rubin 1992), which measures agreement between all the chains. It should be close to one at convergence.

Traceplots can be directly accessed using the `traceplot`

function in
the `rstan`

package, which creates a `ggplot2`

object that can be
further maniuplated to â€˜prettifyâ€™ the plot. The traceplots so
generated are for the samples drawn *after* the warmup period. For
example, we could plot the traceplots for the inverse covariances of
feature 1 with all other features. There is overlap between some of
the chains, but not all and so we conclude that we need more samples
from the posterior to be confident of convergence.

```
# The inverse covariances of feature 1 with all other features
rstan::traceplot(b_fit$Fit, pars=paste0("O[1,", 2:9, "]"))
```