`waddR`

package`waddR`

is an R package that provides a 2-Wasserstein distance based statistical test for detecting and describing differential distributions in one-dimensional data. Functions for wasserstein distance calculation, differential distribution testing, and a specialized test for differential expression in scRNA data are provided.

The package `waddR`

provides three sets of utilities to cover distinct use cases, each described in a separate vignette:

Fast and accurate calculation of the 2-Wasserstein distance

Two-sample test to check for differences between two distributions

Detect differential gene expression distributions in scRNAseq data

These are bundled into the same package, because they are internally dependent: The procedure for detecting differential distributions in single-cell data is a refinement of the general two-sample test, which itself uses the 2-Wasserstein distance to compare two distributions.

The 2-Wasserstein distance is a metric to describe the distance between two distributions, representing two diferent conditions A and B. This package specifically considers the squared 2-Wasserstein distance d := W^2 which offers a decomposition into location, size, and shape terms.

The package `waddR`

offers three functions to calculate the 2-Wasserstein distance, all of which are implemented in Cpp and exported to R with Rcpp for better performance. The function `wasserstein_metric`

is a Cpp reimplementation of the function `wasserstein1d`

from the package `transport`

and offers the most exact results. The functions `squared_wass_approx`

and `squared_wass_decomp`

compute approximations of the squared 2-Wasserstein distance with `squared_wass_decomp`

also returning the decomosition terms for location, size, and shape. See `?wasserstein_metric`

, `?squared_wass_aprox`

, and `?squared_wass_decomp`

.

This package provides two testing procedures using the 2-Wasserstein distance to test whether two distributions F_A and F_B given in the form of samples are different ba specifically testing the null hypothesis H0: F_A = F_B against the alternative hypothesis H1: F_A != F_B.

The first, semi-parametric (SP), procedure uses a test based on permutations combined with a generalized pareto distribution approximation to estimate small pvalues accurately.

The second procedure (ASY) uses a test based on asymptotic theory which is valid only if the samples can be assumed to come from continuous distributions.

See `?wasserstein.test`

for more details.

semi-parametric testing procedure based on the 2-Wasserstein distance which is specifically tailored to identify differential distributions in single-cell RNA-seqencing (scRNA-seq) data. In particular, a two-stage (TS) approach has been implemented that takes account of the specific nature of scRNA-seq data by separately testing for differential proportions of zero gene expression (using a logistic regression model) and differences in non-zero gene expression (using the semi-parametric 2-Wasserstein distance-based test) between two conditions.

See the documentation of the single cell procedure `?wasserstein.sc`

and the test for zero expression levels `?testZeroes`

for more details.

To install `waddR`

from Bioconductor, use `BiocManager`

with the following commands:

```
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("MyPackage")
```

Using `BiocManager`

, the package can also be installed from github directly:

The package `waddR`

can then be used in R:

```
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] waddR_1.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] MatrixGenerics_1.2.0 Biobase_2.50.0
#> [3] httr_1.4.2 bit64_4.0.5
#> [5] splines_4.0.3 Formula_1.2-4
#> [7] assertthat_0.2.1 statmod_1.4.35
#> [9] stats4_4.0.3 BiocFileCache_1.14.0
#> [11] latticeExtra_0.6-29 blob_1.2.1
#> [13] GenomeInfoDbData_1.2.4 yaml_2.2.1
#> [15] backports_1.1.10 pillar_1.4.6
#> [17] RSQLite_2.2.1 lattice_0.20-41
#> [19] glue_1.4.2 digest_0.6.27
#> [21] GenomicRanges_1.42.0 RColorBrewer_1.1-2
#> [23] XVector_0.30.0 checkmate_2.0.0
#> [25] minqa_1.2.4 colorspace_1.4-1
#> [27] htmltools_0.5.0 Matrix_1.2-18
#> [29] pkgconfig_2.0.3 zlibbioc_1.36.0
#> [31] purrr_0.3.4 scales_1.1.1
#> [33] jpeg_0.1-8.1 lme4_1.1-25
#> [35] BiocParallel_1.24.0 arm_1.11-2
#> [37] tibble_3.0.4 htmlTable_2.1.0
#> [39] generics_0.0.2 IRanges_2.24.0
#> [41] ggplot2_3.3.2 ellipsis_0.3.1
#> [43] SummarizedExperiment_1.20.0 nnet_7.3-14
#> [45] BiocGenerics_0.36.0 survival_3.2-7
#> [47] magrittr_1.5 crayon_1.3.4
#> [49] memoise_1.1.0 evaluate_0.14
#> [51] nlme_3.1-150 MASS_7.3-53
#> [53] foreign_0.8-80 data.table_1.13.2
#> [55] tools_4.0.3 lifecycle_0.2.0
#> [57] matrixStats_0.57.0 stringr_1.4.0
#> [59] S4Vectors_0.28.0 munsell_0.5.0
#> [61] cluster_2.1.0 DelayedArray_0.16.0
#> [63] compiler_4.0.3 GenomeInfoDb_1.26.0
#> [65] rlang_0.4.8 nloptr_1.2.2.2
#> [67] grid_4.0.3 RCurl_1.98-1.2
#> [69] rstudioapi_0.11 htmlwidgets_1.5.2
#> [71] rappdirs_0.3.1 SingleCellExperiment_1.12.0
#> [73] bitops_1.0-6 base64enc_0.1-3
#> [75] rmarkdown_2.5 boot_1.3-25
#> [77] gtable_0.3.0 abind_1.4-5
#> [79] DBI_1.1.0 curl_4.3
#> [81] R6_2.4.1 gridExtra_2.3
#> [83] knitr_1.30 dplyr_1.0.2
#> [85] bit_4.0.4 Hmisc_4.4-1
#> [87] stringi_1.5.3 parallel_4.0.3
#> [89] Rcpp_1.0.5 vctrs_0.3.4
#> [91] rpart_4.1-15 png_0.1-7
#> [93] coda_0.19-4 dbplyr_1.4.4
#> [95] tidyselect_1.1.0 xfun_0.18
```