Contents

1 Introduction

One of the most common metrics to assess the quality of genome assemblies is BUSCO (best universal single-copy orthologs) (Simão et al. 2015). cogeqc allows users to run BUSCO from an R session and visualize results graphically. BUSCO summary statistics will help you assess which assemblies have high quality based on the percentage of complete BUSCOs.

2 Installation

if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')
BiocManager::install("cogeqc")
# Load package after installation
library(cogeqc)

3 Running BUSCO

To run BUSCO from R, you will use the function run_busco()1 NOTE: You must have BUSCO installed and in your PATH to use run_busco(). You can check if BUSCO is installed by running busco_is_installed(). If you don’t have it already, you can manually install it or use a conda virtual environment with the Bioconductor package Herper (Paul, Carroll, and Barrows 2021).. Here, we will use an example FASTA file containing the first 1,000 lines of the Herbaspirilllum seropedicae SmR1 genome (GCA_000143225), which was downloaded from Ensembl Bacteria. We will run BUSCO using burkholderiales_odb10 as the lineage dataset. To view all available datasets, run list_busco_datasets().

# Path to FASTA file
sequence <- system.file("extdata", "Hse_subset.fa", package = "cogeqc")

# Path to directory where BUSCO datasets will be stored
download_path <- paste0(tempdir(), "/datasets")

# Run BUSCO if it is installed
if(busco_is_installed()) {
  run_busco(sequence, outlabel = "Hse", mode = "genome",
            lineage = "burkholderiales_odb10",
            outpath = tempdir(), download_path = download_path)
}

The output will be stored in the directory specified in outpath. You can read and parse BUSCO’s output with the function read_busco(). For example, let’s read the output of a BUSCO run using the genome of the green algae Ostreococcus tauri. The output directory is /extdata.

# Path to output directory
output_dir <- system.file("extdata", package = "cogeqc")

busco_summary <- read_busco(output_dir)
busco_summary
#>                Class Frequency           Lineage
#> 1        Complete_SC      1412 chlorophyta_odb10
#> 2 Complete_duplicate         4 chlorophyta_odb10
#> 3         Fragmented        35 chlorophyta_odb10
#> 4            Missing        68 chlorophyta_odb10

This is an example output for a BUSCO run with a single FASTA file. You can also specify a directory containing multiple FASTA files in the sequence argument of run_busco(). This way, BUSCO will be run in batch mode. Let’s see what the output of BUSCO in batch mode looks like:

data(batch_summary)
batch_summary
#>                Class Frequency               Lineage   File
#> 1        Complete_SC      98.5 burkholderiales_odb10 Hse.fa
#> 2        Complete_SC      98.8 burkholderiales_odb10 Hru.fa
#> 3 Complete_duplicate       0.7 burkholderiales_odb10 Hse.fa
#> 4 Complete_duplicate       0.7 burkholderiales_odb10 Hru.fa
#> 5         Fragmented       0.4 burkholderiales_odb10 Hse.fa
#> 6         Fragmented       0.3 burkholderiales_odb10 Hru.fa
#> 7            Missing       0.4 burkholderiales_odb10 Hse.fa
#> 8            Missing       0.2 burkholderiales_odb10 Hru.fa

The only difference between this data frame and the previous one is the column File, which contains information on the FASTA file. The example dataset batch_summary contains the output of run_busco() using a directory containing two genomes (Herbaspirillum seropedicae SmR1 and Herbaspirillum rubrisubalbicans M1) as parameter to the sequence argument.

4 Visualizing summary statistics

After using run_busco() and parsing its output with read_busco(), users can visualize summary statistics with plot_busco().

# Single FASTA file - Ostreococcus tauri
plot_busco(busco_summary)


# Batch mode - Herbaspirillum seropedicae and H. rubrisubalbicans
plot_busco(batch_summary)

We usually consider genomes with >90% of complete BUSCOs as having high quality. Thus, we can conclude that the three genomes analyzed here are high-quality genomes.

Session information

This document was created under the following conditions:

sessionInfo()
#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] cogeqc_1.0.0     BiocStyle_2.24.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8.3           ape_5.6-2              lattice_0.20-45       
#>  [4] tidyr_1.2.0            Biostrings_2.64.0      assertthat_0.2.1      
#>  [7] digest_0.6.29          utf8_1.2.2             plyr_1.8.7            
#> [10] R6_2.5.1               GenomeInfoDb_1.32.0    stats4_4.2.0          
#> [13] evaluate_0.15          highr_0.9              ggplot2_3.3.5         
#> [16] pillar_1.7.0           ggfun_0.0.6            yulab.utils_0.0.4     
#> [19] zlibbioc_1.42.0        rlang_1.0.2            lazyeval_0.2.2        
#> [22] jquerylib_0.1.4        S4Vectors_0.34.0       rmarkdown_2.14        
#> [25] labeling_0.4.2         stringr_1.4.0          igraph_1.3.1          
#> [28] RCurl_1.98-1.6         munsell_0.5.0          compiler_4.2.0        
#> [31] xfun_0.30              pkgconfig_2.0.3        BiocGenerics_0.42.0   
#> [34] gridGraphics_0.5-1     htmltools_0.5.2        tidyselect_1.1.2      
#> [37] tibble_3.1.6           GenomeInfoDbData_1.2.8 bookdown_0.26         
#> [40] IRanges_2.30.0         fansi_1.0.3            crayon_1.5.1          
#> [43] dplyr_1.0.8            bitops_1.0-7           grid_4.2.0            
#> [46] nlme_3.1-157           jsonlite_1.8.0         gtable_0.3.0          
#> [49] lifecycle_1.0.1        DBI_1.1.2              magrittr_2.0.3        
#> [52] scales_1.2.0           tidytree_0.3.9         cli_3.3.0             
#> [55] stringi_1.7.6          farver_2.1.0           reshape2_1.4.4        
#> [58] XVector_0.36.0         ggtree_3.4.0           bslib_0.3.1           
#> [61] ellipsis_0.3.2         generics_0.1.2         vctrs_0.4.1           
#> [64] tools_4.2.0            treeio_1.20.0          ggplotify_0.1.0       
#> [67] glue_1.6.2             purrr_0.3.4            parallel_4.2.0        
#> [70] fastmap_1.1.0          yaml_2.3.5             colorspace_2.0-3      
#> [73] BiocManager_1.30.17    aplot_0.1.3            knitr_1.39            
#> [76] patchwork_1.1.1        sass_0.4.1

References

Paul, Matt, Thomas Carroll, and Doug Barrows. 2021. Herper: The Herper Package Is a Simple Toolset to Install and Manage Conda Packages and Environments from R. https://github.com/RockefellerUniversity/Herper.

Simão, Felipe A, Robert M Waterhouse, Panagiotis Ioannidis, Evgenia V Kriventseva, and Evgeny M Zdobnov. 2015. “BUSCO: Assessing Genome Assembly and Annotation Completeness with Single-Copy Orthologs.” Bioinformatics 31 (19): 3210–2.