Filtering and Subsetting

The step_filter_taxa function is a general function that allows for flexible filtering of OTUs based on across-sample abundance criteria. The other functions, step_filter_by_prevalence, step_filter_by_variance, step_filter_by_abundance, and step_filter_by_rarity, are convenience wrappers around step_filter_taxa, each designed to filter OTUs based on a specific criterion: prevalence, variance, abundance, and rarity, respectively.

The step_subset_taxa function is used to subset taxa based on their taxonomic level.

The phyloseq or TSE used as input can be pre-filtered using methods that are most convenient to the user. However, the dar package provides several functions to perform this filtering directly on the recipe object.

step_filter_taxa

The step_filter_taxa function applies an arbitrary set of functions to OTUs as across-sample criteria. It takes a phyloseq object as input and returns a logical vector indicating whether each OTU passed the criteria. If the “prune” option is set to FALSE, it returns the already-trimmed version of the phyloseq object.

library(dar)
data("metaHIV_phy")

rec <- recipe(metaHIV_phy, "RiskGroup2", "Species")
rec <- 
  step_filter_taxa(rec, .f = "function(x) sum(x > 0) >= (0 * length(x))") |> 
  prep()

Convenience Wrappers

step_filter_by_abundance

This function filters OTUs based on their abundance. The taxa retained in the dataset are those where the sum of their abundance is greater than the product of the total abundance and the provided threshold.

rec <- recipe(metaHIV_phy, "RiskGroup2", "Species")
rec <- 
  step_filter_by_abundance(rec, threshold = 0.01) |> 
  prep()

step_filter_by_prevalence

This function filters OTUs based on their prevalence. The taxa retained in the dataset are those where the prevalence is greater than the provided threshold.

rec <- recipe(metaHIV_phy, "RiskGroup2", "Species")
rec <- 
  step_filter_by_prevalence(rec, threshold = 0.01) |> 
  prep()

step_filter_by_rarity

This function filters OTUs based on their rarity. The taxa retained in the dataset are those where the sum of their rarity is less than the provided threshold.

rec <- recipe(metaHIV_phy, "RiskGroup2", "Species")
rec <- 
  step_filter_by_rarity(rec, threshold = 0.01) |> 
  prep()

step_filter_by_variance

This function filters OTUs based on their variance. The taxa retained in the dataset are those where the variance of their abundance is greater than the provided threshold.

rec <- recipe(metaHIV_phy, "RiskGroup2", "Species")
rec <- 
  step_filter_by_variance(rec, threshold = 0.01) |> 
  prep()

subset_taxa

The subset_taxa function subsets taxa based on their taxonomic level. The taxa retained in the dataset are those where the taxonomic level matches the provided taxa.

rec <- recipe(metaHIV_phy, "RiskGroup2", "Species")
rec <-
  step_subset_taxa(rec, tax_level = "Kingdom", taxa = c("Bacteria", "Archaea")) |>
  prep()

Conclusion

These functions provide a powerful and flexible way to filter and subset OTUs in phyloseq objects contained within a recipe object, making it easier to work with complex experimental data. By understanding how to use these functions effectively, you can streamline your data analysis workflow and focus on the aspects of your data that are most relevant to your research questions. The dar package offers the added convenience of performing these operations directly on the recipe object.

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.0 RC (2024-04-16 r86468)
#>  os       Ubuntu 22.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2024-05-01
#>  pandoc   2.7.3 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package                  * version  date (UTC) lib source
#>  abind                      1.4-5    2016-07-21 [2] CRAN (R 4.4.0)
#>  ade4                       1.7-22   2023-02-06 [2] CRAN (R 4.4.0)
#>  ape                        5.8      2024-04-11 [2] CRAN (R 4.4.0)
#>  assertthat                 0.2.1    2019-03-21 [2] CRAN (R 4.4.0)
#>  beachmat                   2.21.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  beeswarm                   0.4.0    2021-06-01 [2] CRAN (R 4.4.0)
#>  Biobase                  * 2.65.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  BiocGenerics             * 0.51.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  BiocNeighbors              1.23.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  BiocParallel               1.39.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  BiocSingular               1.21.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  biomformat                 1.33.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  Biostrings               * 2.73.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  bluster                    1.15.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  brio                       1.1.5    2024-04-24 [2] CRAN (R 4.4.0)
#>  bslib                      0.7.0    2024-03-29 [2] CRAN (R 4.4.0)
#>  ca                         0.71.1   2020-01-24 [2] CRAN (R 4.4.0)
#>  cachem                     1.0.8    2023-05-01 [2] CRAN (R 4.4.0)
#>  cli                        3.6.2    2023-12-11 [2] CRAN (R 4.4.0)
#>  cluster                    2.1.6    2023-12-01 [3] CRAN (R 4.4.0)
#>  codetools                  0.2-20   2024-03-31 [3] CRAN (R 4.4.0)
#>  colorspace                 2.1-0    2023-01-23 [2] CRAN (R 4.4.0)
#>  crayon                     1.5.2    2022-09-29 [2] CRAN (R 4.4.0)
#>  crosstalk                  1.2.1    2023-11-23 [2] CRAN (R 4.4.0)
#>  dar                      * 1.1.0    2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#>  data.table                 1.15.4   2024-03-30 [2] CRAN (R 4.4.0)
#>  DBI                        1.2.2    2024-02-16 [2] CRAN (R 4.4.0)
#>  DECIPHER                   3.1.0    2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  decontam                   1.25.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  DelayedArray               0.31.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  DelayedMatrixStats         1.27.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  dendextend                 1.17.1   2023-03-25 [2] CRAN (R 4.4.0)
#>  devtools                   2.4.5    2022-10-11 [2] CRAN (R 4.4.0)
#>  digest                     0.6.35   2024-03-11 [2] CRAN (R 4.4.0)
#>  DirichletMultinomial       1.47.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  dplyr                      1.1.4    2023-11-17 [2] CRAN (R 4.4.0)
#>  ellipsis                   0.3.2    2021-04-29 [2] CRAN (R 4.4.0)
#>  evaluate                   0.23     2023-11-01 [2] CRAN (R 4.4.0)
#>  fansi                      1.0.6    2023-12-08 [2] CRAN (R 4.4.0)
#>  farver                     2.1.1    2022-07-06 [2] CRAN (R 4.4.0)
#>  fastmap                    1.1.1    2023-02-24 [2] CRAN (R 4.4.0)
#>  foreach                    1.5.2    2022-02-02 [2] CRAN (R 4.4.0)
#>  fs                         1.6.4    2024-04-25 [2] CRAN (R 4.4.0)
#>  furrr                      0.3.1    2022-08-15 [2] CRAN (R 4.4.0)
#>  future                     1.33.2   2024-03-26 [2] CRAN (R 4.4.0)
#>  generics                   0.1.3    2022-07-05 [2] CRAN (R 4.4.0)
#>  GenomeInfoDb             * 1.41.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  GenomeInfoDbData           1.2.12   2024-04-23 [2] Bioconductor
#>  GenomicRanges            * 1.57.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  ggbeeswarm                 0.7.2    2023-04-29 [2] CRAN (R 4.4.0)
#>  ggplot2                    3.5.1    2024-04-23 [2] CRAN (R 4.4.0)
#>  ggrepel                    0.9.5    2024-01-10 [2] CRAN (R 4.4.0)
#>  globals                    0.16.3   2024-03-08 [2] CRAN (R 4.4.0)
#>  glue                       1.7.0    2024-01-09 [2] CRAN (R 4.4.0)
#>  gridExtra                  2.3      2017-09-09 [2] CRAN (R 4.4.0)
#>  gtable                     0.3.5    2024-04-22 [2] CRAN (R 4.4.0)
#>  heatmaply                  1.5.0    2023-10-06 [2] CRAN (R 4.4.0)
#>  highr                      0.10     2022-12-22 [2] CRAN (R 4.4.0)
#>  htmltools                  0.5.8.1  2024-04-04 [2] CRAN (R 4.4.0)
#>  htmlwidgets                1.6.4    2023-12-06 [2] CRAN (R 4.4.0)
#>  httpuv                     1.6.15   2024-03-26 [2] CRAN (R 4.4.0)
#>  httr                       1.4.7    2023-08-15 [2] CRAN (R 4.4.0)
#>  igraph                     2.0.3    2024-03-13 [2] CRAN (R 4.4.0)
#>  IRanges                  * 2.39.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  irlba                      2.3.5.1  2022-10-03 [2] CRAN (R 4.4.0)
#>  iterators                  1.0.14   2022-02-05 [2] CRAN (R 4.4.0)
#>  jquerylib                  0.1.4    2021-04-26 [2] CRAN (R 4.4.0)
#>  jsonlite                   1.8.8    2023-12-04 [2] CRAN (R 4.4.0)
#>  knitr                      1.46     2024-04-06 [2] CRAN (R 4.4.0)
#>  labeling                   0.4.3    2023-08-29 [2] CRAN (R 4.4.0)
#>  later                      1.3.2    2023-12-06 [2] CRAN (R 4.4.0)
#>  lattice                    0.22-6   2024-03-20 [3] CRAN (R 4.4.0)
#>  lazyeval                   0.2.2    2019-03-15 [2] CRAN (R 4.4.0)
#>  lifecycle                  1.0.4    2023-11-07 [2] CRAN (R 4.4.0)
#>  listenv                    0.9.1    2024-01-29 [2] CRAN (R 4.4.0)
#>  magrittr                   2.0.3    2022-03-30 [2] CRAN (R 4.4.0)
#>  MASS                       7.3-60.2 2024-04-23 [3] local
#>  Matrix                     1.7-0    2024-03-22 [3] CRAN (R 4.4.0)
#>  MatrixGenerics           * 1.17.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  matrixStats              * 1.3.0    2024-04-11 [2] CRAN (R 4.4.0)
#>  memoise                    2.0.1    2021-11-26 [2] CRAN (R 4.4.0)
#>  mgcv                       1.9-1    2023-12-21 [3] CRAN (R 4.4.0)
#>  mia                      * 1.13.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  microbiome                 1.27.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  mime                       0.12     2021-09-28 [2] CRAN (R 4.4.0)
#>  miniUI                     0.1.1.1  2018-05-18 [2] CRAN (R 4.4.0)
#>  MultiAssayExperiment     * 1.31.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  multtest                   2.61.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  munsell                    0.5.1    2024-04-01 [2] CRAN (R 4.4.0)
#>  nlme                       3.1-164  2023-11-27 [3] CRAN (R 4.4.0)
#>  parallelly                 1.37.1   2024-02-29 [2] CRAN (R 4.4.0)
#>  permute                    0.9-7    2022-01-27 [2] CRAN (R 4.4.0)
#>  phyloseq                 * 1.49.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  pillar                     1.9.0    2023-03-22 [2] CRAN (R 4.4.0)
#>  pkgbuild                   1.4.4    2024-03-17 [2] CRAN (R 4.4.0)
#>  pkgconfig                  2.0.3    2019-09-22 [2] CRAN (R 4.4.0)
#>  pkgload                    1.3.4    2024-01-16 [2] CRAN (R 4.4.0)
#>  plotly                     4.10.4   2024-01-13 [2] CRAN (R 4.4.0)
#>  plyr                       1.8.9    2023-10-02 [2] CRAN (R 4.4.0)
#>  profvis                    0.3.8    2023-05-02 [2] CRAN (R 4.4.0)
#>  promises                   1.3.0    2024-04-05 [2] CRAN (R 4.4.0)
#>  purrr                      1.0.2    2023-08-10 [2] CRAN (R 4.4.0)
#>  R6                         2.5.1    2021-08-19 [2] CRAN (R 4.4.0)
#>  RColorBrewer               1.1-3    2022-04-03 [2] CRAN (R 4.4.0)
#>  Rcpp                       1.0.12   2024-01-09 [2] CRAN (R 4.4.0)
#>  registry                   0.5-1    2019-03-05 [2] CRAN (R 4.4.0)
#>  remotes                    2.5.0    2024-03-17 [2] CRAN (R 4.4.0)
#>  reshape2                   1.4.4    2020-04-09 [2] CRAN (R 4.4.0)
#>  rhdf5                      2.49.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  rhdf5filters               1.17.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  Rhdf5lib                   1.27.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  rlang                      1.1.3    2024-01-10 [2] CRAN (R 4.4.0)
#>  rmarkdown                  2.26     2024-03-05 [2] CRAN (R 4.4.0)
#>  rsvd                       1.0.5    2021-04-16 [2] CRAN (R 4.4.0)
#>  Rtsne                      0.17     2023-12-07 [2] CRAN (R 4.4.0)
#>  S4Arrays                   1.5.0    2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  S4Vectors                * 0.43.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  sass                       0.4.9    2024-03-15 [2] CRAN (R 4.4.0)
#>  ScaledMatrix               1.13.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  scales                     1.3.0    2023-11-28 [2] CRAN (R 4.4.0)
#>  scater                     1.33.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  scuttle                    1.15.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  seriation                  1.5.5    2024-04-17 [2] CRAN (R 4.4.0)
#>  sessioninfo                1.2.2    2021-12-06 [2] CRAN (R 4.4.0)
#>  shiny                      1.8.1.1  2024-04-02 [2] CRAN (R 4.4.0)
#>  SingleCellExperiment     * 1.27.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  SparseArray                1.5.0    2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  sparseMatrixStats          1.17.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  stringi                    1.8.3    2023-12-11 [2] CRAN (R 4.4.0)
#>  stringr                    1.5.1    2023-11-14 [2] CRAN (R 4.4.0)
#>  SummarizedExperiment     * 1.35.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  survival                   3.6-4    2024-04-24 [3] CRAN (R 4.4.0)
#>  testthat                   3.2.1.1  2024-04-14 [2] CRAN (R 4.4.0)
#>  tibble                     3.2.1    2023-03-20 [2] CRAN (R 4.4.0)
#>  tidyr                      1.3.1    2024-01-24 [2] CRAN (R 4.4.0)
#>  tidyselect                 1.2.1    2024-03-11 [2] CRAN (R 4.4.0)
#>  tidytree                   0.4.6    2023-12-12 [2] CRAN (R 4.4.0)
#>  treeio                     1.29.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  TreeSummarizedExperiment * 2.13.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  TSP                        1.2-4    2023-04-04 [2] CRAN (R 4.4.0)
#>  UCSC.utils                 1.1.0    2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  UpSetR                     1.4.0    2019-05-22 [2] CRAN (R 4.4.0)
#>  urlchecker                 1.0.1    2021-11-30 [2] CRAN (R 4.4.0)
#>  usethis                    2.2.3    2024-02-19 [2] CRAN (R 4.4.0)
#>  utf8                       1.2.4    2023-10-22 [2] CRAN (R 4.4.0)
#>  vctrs                      0.6.5    2023-12-01 [2] CRAN (R 4.4.0)
#>  vegan                      2.6-4    2022-10-11 [2] CRAN (R 4.4.0)
#>  vipor                      0.4.7    2023-12-18 [2] CRAN (R 4.4.0)
#>  viridis                    0.6.5    2024-01-29 [2] CRAN (R 4.4.0)
#>  viridisLite                0.4.2    2023-05-02 [2] CRAN (R 4.4.0)
#>  webshot                    0.5.5    2023-06-26 [2] CRAN (R 4.4.0)
#>  withr                      3.0.0    2024-01-16 [2] CRAN (R 4.4.0)
#>  xfun                       0.43     2024-03-25 [2] CRAN (R 4.4.0)
#>  xtable                     1.8-4    2019-04-21 [2] CRAN (R 4.4.0)
#>  XVector                  * 0.45.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#>  yaml                       2.3.8    2023-12-11 [2] CRAN (R 4.4.0)
#>  yulab.utils                0.1.4    2024-01-28 [2] CRAN (R 4.4.0)
#>  zlibbioc                   1.51.0   2024-05-01 [2] Bioconductor 3.20 (R 4.4.0)
#> 
#>  [1] /tmp/RtmppiEMBP/Rinst3ffcc45cffd9ea
#>  [2] /home/biocbuild/bbs-3.20-bioc/R/site-library
#>  [3] /home/biocbuild/bbs-3.20-bioc/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────