BiocNeighbors 1.21.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 721 8023 4801 9646 5951 6617 285 3988 838 3115
## [2,] 7521 6892 226 7790 7123 6024 5935 3421 4141 5838
## [3,] 2665 8087 5208 1719 9951 6887 6458 707 3885 1584
## [4,] 5862 263 2028 6839 4148 6114 3276 5446 7236 3916
## [5,] 4237 3409 123 6269 1679 176 7847 7039 5233 9520
## [6,] 529 9075 547 6515 5381 8205 3693 7316 6095 2611
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9417115 0.9950517 0.9958048 0.9971197 1.0165982 1.0181322 1.0280851
## [2,] 0.8460645 0.8566931 0.8856097 0.9042277 0.9212898 0.9335468 0.9464830
## [3,] 0.9089214 0.9196798 1.0198131 1.0452418 1.0595542 1.0792850 1.0859764
## [4,] 0.8430012 0.8734008 0.9066945 0.9180705 0.9180801 0.9494844 0.9607598
## [5,] 0.8335814 0.8614795 0.8916872 0.9547298 0.9591728 0.9783390 0.9829778
## [6,] 0.8221679 0.8768271 0.9101445 1.0719161 1.0752848 1.0768983 1.0838666
## [,8] [,9] [,10]
## [1,] 1.0372286 1.0403241 1.0413691
## [2,] 0.9611377 0.9690827 0.9925194
## [3,] 1.0870292 1.0994574 1.1009396
## [4,] 0.9665997 0.9791939 0.9825258
## [5,] 1.0041276 1.0187839 1.0219136
## [6,] 1.0907190 1.0966433 1.1435466
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8225 1158 3262 5710 2028
## [2,] 5652 6985 4640 1878 581
## [3,] 9861 3141 1976 7353 4713
## [4,] 3933 6450 8317 8333 5970
## [5,] 7105 5121 7795 6388 3093
## [6,] 6284 6909 8369 2133 2981
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9273236 0.9818574 0.9831229 1.0257785 1.042869
## [2,] 0.9344687 0.9714112 1.0039803 1.0078968 1.010282
## [3,] 0.9953610 1.0133026 1.0391417 1.0400078 1.048978
## [4,] 1.0752112 1.1016754 1.1174171 1.1382046 1.146982
## [5,] 0.9110844 0.9277552 0.9481819 0.9916597 1.016271
## [6,] 0.9159690 1.0490739 1.0814068 1.0865263 1.093359
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmp5Gg2ey/file735d348e7ccc2.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.21.2 knitr_1.46 BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.41.6 htmltools_0.5.8.1
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.39 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-20 Rcpp_1.0.12 BiocParallel_1.37.1
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.7.0 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.49.1 cachem_1.0.8