1 Retrieval of the gDNA contaminated RNA-seq data by Li et al. (2022)

Here we show how to download a subset of the RNA-seq data published in:

Li, X., Zhang, P., and Yu. Y. Gene expressed at low levels raise false discovery rates in RNA samples contaminated with genomic DNA. BMC Genomics, 23:554, 2022. https://doi.org/10.1186/s12864-022-08785-1

The subset of the data available through this package are BAM files containing about 100,000 alignments, sampled uniformly at random from complete BAM files. These complete BAM files were obtained by aligning the RNA-seq reads sequenced from total RNA libraries mixed with different concentrations of gDNA, concretely 0% (no contamination), 1% and 10%; see Fig. 2 from Li et al. (2022). The original RNA-seq data is publicly available at https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA007961 and you can find the pipeline to generate this subset of the data in the file gDNAinRNAseqData/inst/scripts/make-data_LiYu22subsetBAMfiles.R stored in this package.

To download these subsetted BAM files, and the corresponding index (.bai) files, we load this package and call the function LiYu22subsetBAMfiles():

library(gDNAinRNAseqData)

bamfiles <- LiYu22subsetBAMfiles()
bamfiles
## [1] "/tmp/RtmpLpdHMM/s32gDNA0.bam"  "/tmp/RtmpLpdHMM/s33gDNA0.bam" 
## [3] "/tmp/RtmpLpdHMM/s34gDNA0.bam"  "/tmp/RtmpLpdHMM/s26gDNA1.bam" 
## [5] "/tmp/RtmpLpdHMM/s27gDNA1.bam"  "/tmp/RtmpLpdHMM/s28gDNA1.bam" 
## [7] "/tmp/RtmpLpdHMM/s23gDNA10.bam" "/tmp/RtmpLpdHMM/s24gDNA10.bam"
## [9] "/tmp/RtmpLpdHMM/s25gDNA10.bam"

The previous function call can take a path argument to specify the path in the filesystem where we would like to store the downloaded BAM files, which by default is a temporary path from the current R session; consult the help page of LiYu22subsetBAMfiles() for full details.

We can also retrieve the gDNA concentrations associated to each BAM file with the following function call:

pdat <- LiYu22phenoData(bamfiles)
pdat
##           gDNA
## s32gDNA0     0
## s33gDNA0     0
## s34gDNA0     0
## s26gDNA1     1
## s27gDNA1     1
## s28gDNA1     1
## s23gDNA10   10
## s24gDNA10   10
## s25gDNA10   10

2 Session information

sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] Rsamtools_2.20.0       Biostrings_2.72.0      XVector_0.44.0        
##  [4] GenomicRanges_1.56.0   GenomeInfoDb_1.40.0    IRanges_2.38.0        
##  [7] S4Vectors_0.42.0       BiocGenerics_0.50.0    gDNAinRNAseqData_1.4.0
## [10] BiocStyle_2.32.0      
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.44.0         xfun_0.43               bslib_0.7.0            
##  [4] Biobase_2.64.0          vctrs_0.6.5             tools_4.4.0            
##  [7] bitops_1.0-7            generics_0.1.3          parallel_4.4.0         
## [10] curl_5.2.1              tibble_3.2.1            fansi_1.0.6            
## [13] AnnotationDbi_1.66.0    RSQLite_2.3.6           blob_1.2.4             
## [16] pkgconfig_2.0.3         dbplyr_2.5.0            lifecycle_1.0.4        
## [19] GenomeInfoDbData_1.2.12 compiler_4.4.0          codetools_0.2-20       
## [22] htmltools_0.5.8.1       sass_0.4.9              RCurl_1.98-1.14        
## [25] yaml_2.3.8              pillar_1.9.0            crayon_1.5.2           
## [28] jquerylib_0.1.4         BiocParallel_1.38.0     cachem_1.0.8           
## [31] mime_0.12               ExperimentHub_2.12.0    AnnotationHub_3.12.0   
## [34] tidyselect_1.2.1        digest_0.6.35           purrr_1.0.2            
## [37] dplyr_1.1.4             bookdown_0.39           BiocVersion_3.19.1     
## [40] fastmap_1.1.1           cli_3.6.2               magrittr_2.0.3         
## [43] XML_3.99-0.16.1         utf8_1.2.4              withr_3.0.0            
## [46] filelock_1.0.3          UCSC.utils_1.0.0        rappdirs_0.3.3         
## [49] bit64_4.0.5             rmarkdown_2.26          httr_1.4.7             
## [52] bit_4.0.5               png_0.1-8               memoise_2.0.1          
## [55] evaluate_0.23           knitr_1.46              BiocFileCache_2.12.0   
## [58] rlang_1.1.3             glue_1.7.0              DBI_1.2.2              
## [61] BiocManager_1.30.22     jsonlite_1.8.8          R6_2.5.1               
## [64] zlibbioc_1.50.0