cypress Package User’s Guide

Shilin Yu1*, Guanqun Meng2** and Wen Tang2***

1Department of Quantitative Health Sciences, Cleveland Clinic
2Department of Population and Quantitative Health Sciences, Case Western Reserve University

*sy597@georgetown.edu
**gxm324@case.edu
***wxt175@case.edu

1 May 2024

Abstract

This vignette of cypress (cell-type-specific power assessment), which is specifically designed to perform comprehensive power assessment for cell-type-specific differential expression (csDE) analysis using RNA-sequencing experiments. It could accept real bulk RNA-seq data as the input for parameter estimation and simulation, or use program-provided parameters to achieve the same goal. This flexible tool allows users to customize sample sizes, percentage of csDE genes, number of cell types, and effect size values. Additionally, it computes statistical power, true discovery rate (TDR), and false discovery cost (FDC) under different scenarios as results. The package also offers functions to generate basic line plots illustrating stratified power, TDR and FDC. DE methods that can be used with this tool include TOAST, CeDAR, and DESeq2.

Package

cypress 1.1.0

1 Background
2 Installation
3 Quickstart
4 Input data
5 Simulation & Evaluation
6 Results
7 Visualization
Session info

htmltools::img(
    src=knitr::image_uri("cypress_official.png"),
    alt="logo",
    style="position:absolute; top:0; right:0; padding:10px; height:280px"
)

1 Background

Bulk RNA-sequencing technology has now been routinely adopted in clinical studies to investigate transcriptome profiles alterations associated with phenotypes-of-interest. In the past several years, computational deconvolution methods were developed to solve for cell mixture proportions. More recently, identifying csDE genes has been made possible by methodology extensions of cell-type deconvolution. Currently, due to the lack of experimental design and statistical power assessment tool for csDE analysis, clinical researchers are facing difficulties in using csDE gene detection methods in practice. One of the major difficulties is to determine a proper sample sizes required for a well-powered experimental designs. In a real RNA-sequencing study, effect size, gene expression level, biological and technical variation all impact statistical power determinations. Rigorous experimental design requires the statistical power assessment method that considers these factors. However, such a method is yet to be developed.

Here we present a statistical framework and tool cypress (cell-type-specific differential expression power assessment) to guide experimental design, assess statistical power, and optimize sample size requirements for csDE analysis. We adopt a rigorous statistical framework to reliably model purified cell-type-specific profiles, cell type compositions, biological and technical variations; provide thorough assessment using multiple statistical metrics through simulations; and provide a user-friendly tool to researchers for their own data simulation and power evaluation.

To use the cypress package, users have the option to provide their own bulk-level RNA-sequencing data, allowing the package to estimate distribution parameters. Alternatively, the package includes three predefined distributional parameter sets from which users can choose from, should the users have no pilot datasets. Specifically, the RNA-sequencing data simulation framework is based on a Gamma-Poisson compound that aligns with our previous benchmark study. The csDE genes calling process is implemented by TOAST, a leading method in this domain. We will stratify genes into distinct strata based on their expression levels and evaluate the power within each stratum. We particularly focus on examining the influence and the impact of sample size and sequencing depth, giving special consideration to genes with a low signal-to-noise ratio in sequencing data.

2 Installation

From Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("cypress")

To view the package vignette in HTML format, run the following lines in R:

library(cypress)
vignette("cypress")

3 Quickstart

cypress offers the function simFromData when users want to provide their own bulk RNA-seq data for parameter estimation and simulation. simFromData needs the input file organized into SummarizedExperiment objects including a count matrix, study design, and an optional cell type proportion matrix. An example data ASD_prop_se is included:

library(cypress)
library(BiocParallel)
data(ASD_prop_se)
result1 <- simFromData(INPUTdata=ASD_prop, #SummarizedExperiment object.
                      CT_index=(seq_len(6) + 2), #Column index for cell types proportion matrix.
                      CT_unk=FALSE, #CT_unk should be True if no cell type proportion matrix.
                      n_sim=2, #Total number of iterations users wish to conduct.
                      n_gene=1000,  #Total number of genetic features users with to conduct.
                      DE_pct=0.05, #Percentage of DEG on each cell type.
                      ss_group_set=c(8,10), #Sample sizes per group users wish to simulate.
                      lfc_set=c(1,1.5), #Effect sizes users wish to simulate
                      DEmethod = "TOAST" # DE methods we used
                      )

Alternatively, the simFromData function needs to provide real data, quickPower() outputs power calculation results without the need to input pilot RNA-seq data. This function is designed to provide quick access to pre-calculated results.

library(cypress)
data(quickPowerGSE60424)
result2<-quickPower(data="IAD") # Options include 'IAD', 'IBD', or 'ASD'.

The following function produces basic line plots of power evaluation results conducted by quickPower() for visualization:

### plot statistical power results
plotPower(simulation_results=result2,# Simulation results generated by quickPower() or simFromData()
          effect.size=1,# A numerical value indicating which effect size is to be fixed.
          sample_size=10 )# A numerical value indicating which sample size to be fixed.

### plot TDR results
plotTDR(simulation_results=result2,  
        effect.size=1,                
        sample_size=10)

### plot FDC results
plotFDC(simulation_results=result2,  
        sample_size=10)

4 Input data

There are 3 options for users to utilize cypress functions for parameter estimation and simulation:

Option 1: Users wish to use their own real data for estimating parameters. This situation fits well to research project where a small pilot RNA-seq dataset has been collected, and the users want to use the information embedded in the pilot dataset to conduct rigorous experimental design. The input file needs to be organized into a SummarizedExperiment object. The object should contain a feature by sample count matrix stored in the counts slot. It should also use the first column in the colData slot with a column name of disease to store the group status (i.e. case/control), where the controls are indicated by 1 and the cases are indicated by 2. The second column in the colData slot to stores the subject IDs mapping to each sample. The remaining columns in the colData slot should store the cell type proportions. If provided, the cell type proportions should sum up to 1 for each sample. This cell type proportion matrix is helpful but optional.

data(ASD_prop_se)
ASD_prop

## class: SummarizedExperiment 
## dim: 3000 48 
## metadata(0):
## assays(1): counts
## rownames(3000): ENSG00000214727 ENSG00000214941 ... ENSG00000150471
##   ENSG00000198759
## rowData names(0):
## colnames(48): SRR1576466 SRR1576467 ... SRR1576512 SRR1576513
## colData names(8): disease sample_id ... Celltype5 Celltype6

Option 2: Users do not have pilot RNA-seq data for csDE analysis, and wish to use existing studies for parameter estimation and conduct customized downstream simulation. cypress provides 3 real datasets for estimating the parameters:

IAD(GSE60424): Whole transcriptome signatures of 6 immune cell types, including neutrophils, monocytes, B cells, CD4 T cells, CD8 T cells, and natural killer cells.
ASD: The bulk RNA-seq data in a large autism spectrum disorder (ASD) study. The study includes 251 samples of frontal/temporal cortex and cerebellum brain regions. Samples are from 48 ASD subjects versus 49 controls.
IBD(GSE57945): The ileal transcriptome in pediatric inflammatory bowel disease (IBD). This dataset includes a cohort of 359 treatment-naïve pediatric patients with Crohn’s disease (CD, n=213), ulcerative colitis (UC, n=60) and healthy controls (n=41).

Option 3: Users have no pilot data to provide but wish to see results generative based on our own simulation, and have some quick experimental design references to check.

5 Simulation & Evaluation

cypress offers 3 functions for simulation and power evaluation: simFromData() ,simFromParam() and quickPower(). If users have their own bulk RNA-seq count data, they can use the simFromData() function, otherwise, they can use simFromParam() estimating parameters from 3 existing studies to perform power evaluation on customized the simulation settings. If users prefer to quickly examine the power evaluation results using the built-in datasets, they can use the quickPower() function. The output of these 3 functions is a S4 object with a list of power measurements under various experimental settings, such as Statistical Power, TDR, and FDC.

5.1 Power evaluation with simFromData()

simFromData() enables users to provide their own data or use the example data data(ASD_prop_se) , they can also specify the number of simulations (nsim), number of genes(n_gene), percentage of differential expressed genes for each cell type(DE_pct), sample sizes (ss_group_set), and log fold change (lfc_set).

data(ASD_prop_se)
result1 <- simFromData(INPUTdata=ASD_prop,  
                      CT_index=(seq_len(6)+2), CT_unk=FALSE,
                      n_sim=2,n_gene=1000,DE_pct=0.05,
                      ss_group_set=c(8,10),
                      lfc_set=c(1, 1.5))

5.2 Power evaluation with simFromParam()

Unlike simFromData() which needs user to provide their own count matrix and often takes a while to run, simFromParam() produces results faster by extracting parameters from three existing studies. simFromParam() has 3 pre-estimated datasets for users to choose, they are IAD, ASD and IBD. Additionally, users can also define their own simulation settings.

data(quickParaGSE60424)
result2 <- simFromParam(sim_param="IAD",   #Options for 'IAD','ASD' and 'IBD'
                       n_sim=2,DE_pct=0.05,n_gene=1000,
                       ss_group_set=c(8, 10),lfc_set=c(1, 1.5),
                       lfc_target=0.5, fdr_thred=0.1)

5.3 Power evaluation with quickPower()

quickPower() runs even faster than simFromData() and simFromParam(), as it outputs the pre-evaluated results directly. Users only need to choose which study result they want the program to output in the data argument.

data(quickPowerGSE60424)
quickPower<-quickPower(data="IAD") ###Options include 'IAD', 'IBD', or 'ASD'.

6 Results

cypress uses power evaluation metrics, for each experimental scenario, of the following:

Power: Statistical power.
TDR: The ratio of the number of true positives to the number of significant discoveries.
FDC: The ratio of the number of false positives to the number of true positives, among significant discoveries.

7 Visualization

Once users have obtained an S4 object with a list of power measurements using either simFromData, simFromParam or quickPower(), they can use the following functions to generate basic line plots: plotPower(), plotTDR() and plotFDC().

7.1 Statistical power figure

plotPower(): Generates 6 plots in a 2x3 panel. The illustration of each plot from left to right and from up to bottom is as follows:

1: Statistical power by effect size, each line represents sample size. Statistical power was the average value across cell types.

2: Statistical power by effect size, each line represents cell type. Statistical power was calculated under the scenario of sample size to be fixed at 10 if sample_size=10.

3: Statistical power by sample size, each line represents cell type. Statistical power was calculated under the scenario of effect size to be fixed at 1 if effect.size=1.

4: Statistical power by strata, each line represents cell type. Statistical power was calculated under the scenario of sample size to be fixed at 10 and effect size to be fixed at 1 if sample_size=10 and effect.size=1.

5: Statistical power by strata, each line represents sample size. Statistical power was the average value across cell types and under the scenario of effect size to be fixed at 1 if effect.size=1.

6: Statistical power by strata, each line represents effect size. Statistical power was the average value across cell types and under the scenario of sample size to be fixed at 10 if sample_size=10.

### plot statistical power results
plotPower(simulation_results=quickPower,effect.size=1,sample_size=10 )

7.2 True discovery rate (TDR) figure

plotTDR(): Generates 4 plots in a 2x2 panel. The illustration of each plot from left to right and from up to bottom is as follow:

1: True discovery rate(TDR) by top-rank genes, each line represents one cell type. TDR was calculated under the scenario of sample size to be fixed at 10 and effect size to be fixed at 1 if sample_size=10 and effect.size=1.

2: True discovery rate(TDR) by top-rank genes, each line represents one effect size. TDR was the average value across cell types and under the scenario of sample size to be fixed at 10 if sample_size=10.

3: True discovery rate(TDR) by top-rank genes, each line represents one sample size. TDR was the average value across cell types and under the scenario of effect size to be fixed at 1 if effect.size=1.

4: True discovery rate(TDR) by effect size, each line represents one sample size. TDR was calculated under the scenario of top rank gene equals 350.

### plot TDR results
plotTDR(simulation_results=quickPower,effect.size=1,sample_size=10)

7.3 False discovery cost (FDC) figure

plotFDC(): Generates 2 plots in a 1x2 panel. The illustration of each plot from left to right is as follow:

1: False discovery cost(FDC) by effect size, each line represents one cell type. FDC was calculated under the scenario of sample size to be fixed at 10 if sample_size=10.

2: False discovery cost(FDC) by top effect size, each line represent one sample size. FDC was the average value across cell types.

### plot FDC results
plotFDC(simulation_results=quickPower,sample_size=10)

Session info

## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.39.0 cypress_1.1.0       BiocStyle_2.33.0   
## 
## loaded via a namespace (and not attached):
##   [1] pbapply_1.7-2               formatR_1.14               
##   [3] CDM_8.2-6                   rlang_1.1.3                
##   [5] magrittr_2.0.3              matrixStats_1.3.0          
##   [7] e1071_1.7-14                compiler_4.4.0             
##   [9] gdata_3.0.0                 vctrs_0.6.5                
##  [11] quadprog_1.5-8              stringr_1.5.1              
##  [13] pkgconfig_2.0.3             crayon_1.5.2               
##  [15] fastmap_1.1.1               magick_2.8.3               
##  [17] backports_1.4.1             XVector_0.45.0             
##  [19] utf8_1.2.4                  rmarkdown_2.26             
##  [21] pracma_2.4.4                preprocessCore_1.67.0      
##  [23] nloptr_2.0.3                UCSC.utils_1.1.0           
##  [25] tinytex_0.50                purrr_1.0.2                
##  [27] xfun_0.43                   zlibbioc_1.51.0            
##  [29] cachem_1.0.8                GenomeInfoDb_1.41.0        
##  [31] jsonlite_1.8.8              sirt_4.1-15                
##  [33] EpiDISH_2.21.0              highr_0.10                 
##  [35] DelayedArray_0.31.0         gmodels_2.19.1             
##  [37] parallel_4.4.0              R6_2.5.1                   
##  [39] bslib_0.7.0                 stringi_1.8.3              
##  [41] RColorBrewer_1.1-3          limma_3.61.0               
##  [43] GGally_2.2.1                GenomicRanges_1.57.0       
##  [45] jquerylib_0.1.4             Rcpp_1.0.12                
##  [47] bookdown_0.39               SummarizedExperiment_1.35.0
##  [49] iterators_1.0.14            knitr_1.46                 
##  [51] IRanges_2.39.0              polycor_0.8-1              
##  [53] Matrix_1.7-0                nnls_1.5                   
##  [55] splines_4.4.0               tidyselect_1.2.1           
##  [57] abind_1.4-5                 yaml_2.3.8                 
##  [59] doParallel_1.0.17           codetools_0.2-20           
##  [61] admisc_0.35                 lattice_0.22-6             
##  [63] tibble_3.2.1                plyr_1.8.9                 
##  [65] Biobase_2.65.0              evaluate_0.23              
##  [67] lambda.r_1.2.4              futile.logger_1.4.3        
##  [69] ggstats_0.6.0               proxy_0.4-27               
##  [71] pillar_1.9.0                BiocManager_1.30.22        
##  [73] MatrixGenerics_1.17.0       checkmate_2.3.1            
##  [75] foreach_1.5.2               stats4_4.4.0               
##  [77] generics_0.1.3              S4Vectors_0.43.0           
##  [79] ggplot2_3.5.1               munsell_0.5.1              
##  [81] TOAST_1.19.0                scales_1.3.0               
##  [83] gtools_3.9.5                class_7.3-22               
##  [85] glue_1.7.0                  tools_4.4.0                
##  [87] data.table_1.15.4           locfit_1.5-9.9             
##  [89] mvtnorm_1.2-4               grid_4.4.0                 
##  [91] tidyr_1.3.1                 matrixcalc_1.0-6           
##  [93] edgeR_4.3.0                 PROPER_1.37.0              
##  [95] TAM_4.2-21                  colorspace_2.1-0           
##  [97] GenomeInfoDbData_1.2.12     config_0.3.2               
##  [99] rsvd_1.0.5                  cli_3.6.2                  
## [101] futile.options_1.0.1        fansi_1.0.6                
## [103] S4Arrays_1.5.0              dplyr_1.1.4                
## [105] corpcor_1.6.10              gtable_0.3.5               
## [107] DESeq2_1.45.0               sass_0.4.9                 
## [109] digest_0.6.35               BiocGenerics_0.51.0        
## [111] SparseArray_1.5.0           htmltools_0.5.8.1          
## [113] lifecycle_1.0.4             httr_1.4.7                 
## [115] TCA_1.2.1                   locfdr_1.1-8               
## [117] statmod_1.5.0               mime_0.12                  
## [119] MASS_7.3-60.2