1 Introduction

In comparative RNA sequencing (RNA-seq) experiments, selecting the appropriate sample size is an important optimization step [1]. Empirical RNA-seq Sample Size Analysis (ERSSA) is a R software package designed to test whether an existing RNA-seq dataset has sufficient biological replicates to detect a majority of the differentially expressed genes (DEGs) between two conditions. In contrast to existing RNA-seq sample size analysis algorithms, ERSSA does not rely on any a priori assumptions about the data [2]. Rather, ERSSA takes a user-supplied RNA-seq sample set and evaluates the incremental improvement in identification of DEGs with each increase in sample size up to the total samples provided, enabling the user to determine whether sufficient biological replicates have been included.

Based on the number of replicates available (N for each of the two conditions), the algorithm subsamples at each step-wise replicate levels (n= 2, 3, 4, …, N-1) and uses existing differential expression (DE) analysis software (e.g., edgeR [8] and DESeq2 [9]) to measure the number of DEGs. As N increases, the set of all distinct subsamples for a particular n can be very large and experience with ERSSA shows that it is not necessary to evaluate the entire set. Instead, 30-50 subsamples at each replicate level typically provide sufficient evidence to evaluate the marginal return for each increase in sample size. If the number of DEGs identified is similar for n = N-2, N-1 and N, there may be little to be gained by analyzing further replicates. Conversely, if the number of DEGs identified is still increasing strongly as n approaches N, the user can expect to identify significantly more DEGs if they acquire additional samples.

When applied to a diverse set of RNA-seq experimental settings (human tissue, human population study and in vitro cell culture), ERSSA demonstrated proficiency in determining whether sufficient biological replicates have been included. Overall, ERSSA can be used as a flexible and easy-to-use tool that offers an alternative approach to identify the appropriate sample size in comparative RNA-seq studies.

2 Installation

Install the latest stable version of ERSSA by entering the following commands in R console:

install.packages("BiocManager")
BiocManager::install("ERSSA")

3 Usage

3.1 Utility

In this vignette, we demonstrate ERSSA’s analytical approach using an RNA-seq dataset containing 10 human heart samples and 10 skeletal muscle samples from GTEx [3] and ask whether 10 replicates are sufficient to identify a majority of DE genes (adjusted p-value < 0.05 and |log2(fold-change)| > 1). At the end of the ERSSA run, four plots are generated to summarize the results. For now, let’s briefly focus on the most important of these, the number of DEGs identified as a function of the number of replicates included in the analysis. In the present example, the average number of DEGs discovered increases approximately 2% from n=6 to n=7 with no improvement as n increases to 8 and beyond. This suggests that our example dataset with N=10 replicates is sufficient to identify the vast majority of DEGs. To verify this conclusion, an additional 15 human heart and 15 skeletal muscle samples from GTEx were added and the analysis was repeated with N=25. The results for n<10 obtained with N=25 gave similar mean and distribution of the number of DEGs identified as those obtained with N=10, validating the utility of the statistical subsampling approach. The rest of this vignette will further explore ERSSA’s functionalities using the 10-replicate GTEx heart vs. muscle dataset. We will also briefly go through two additional examples that help to illustrate the variety of experimental settings where ERSSA can be applied.

3.2 Load example data

First, let’s load the N=10 GTEx heart and muscle dataset into the R workspace. The data can be loaded into R directly from the ERSSA package using:

library(ERSSA)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
# GTEx dataset with 10 heart and 10 muscle samples
# "full"" dataset contains all ensembl genes
data(condition_table.full)
data(count_table.full)

# For test purposes and faster run time, we will use a smaller "partial" dataset
# 4 heart and 4 muscle samples
# partial dataset contains 1000 genes
data(condition_table.partial)
data(count_table.partial)

# NOTE: the figures are generated from the "full" dataset

The original dataset was obtained from the recount2 project [6], which is a systematic effort to generate gene expression count tables from thousands of RNA-seq studies. To generate the objects loaded above, GTEx heart and muscle count tables were manually downloaded from the recount2 website and processed by the recount package. The first 10 samples were then selected and a corresponding condition table generated to complete this example dataset.

For any ERSSA analysis, we need a few essential inputs. First is the RNA-seq count table that contains genes on each row and samples on each column. ERSSA expects the input count table as a dataframe object with gene names as the index and sample IDs as column names. For example, the first few cells of our example count table looks like this:

head(count_table.full[,1:2])
##                    heart_SRR598148.txt heart_SRR598509.txt
## ENSG00000000003.14                 147                 153
## ENSG00000000005.5                    0                   5
## ENSG00000000419.12                 333                 222
## ENSG00000000457.13                 226                  95
## ENSG00000000460.16                 103                  64
## ENSG00000000938.12                 835                1093

Next, we need to supply a condition table in the form of a dataframe object that contains two columns and the same number of rows as the total number of samples. Column one contains the sample IDs exactly as they appear in the count tables and column two contains the corresponding condition names. Only two conditions are supported. Our full condition table is shown below:

condition_table.full
##                    name condition
## 1   heart_SRR598148.txt     heart
## 2   heart_SRR598509.txt     heart
## 3   heart_SRR598589.txt     heart
## 4   heart_SRR599025.txt     heart
## 5   heart_SRR599086.txt     heart
## 6   heart_SRR599249.txt     heart
## 7   heart_SRR599380.txt     heart
## 8   heart_SRR600474.txt     heart
## 9   heart_SRR600829.txt     heart
## 10  heart_SRR600852.txt     heart
## 11 muscle_SRR598044.txt    muscle
## 12 muscle_SRR598452.txt    muscle
## 13 muscle_SRR600656.txt    muscle
## 14 muscle_SRR600981.txt    muscle
## 15 muscle_SRR601387.txt    muscle
## 16 muscle_SRR601671.txt    muscle
## 17 muscle_SRR601695.txt    muscle
## 18 muscle_SRR601815.txt    muscle
## 19 muscle_SRR602010.txt    muscle
## 20 muscle_SRR603116.txt    muscle

Finally, we need to specify which condition to use as the control in the DE comparison. In this case, let’s set “heart” as the control condition.

3.3 Run ERSSA

With the count and condition tables prepared, we can now call the main erssa wrapper function to start the sample size analysis:

set.seed(1) # for reproducible subsample generation

ssa = erssa(count_table.partial, condition_table.partial, DE_ctrl_cond='heart')

# Running full dataset is skipped in the interest of run time
# ssa = erssa(count_table.full, condition_table.full, DE_ctrl_cond='heart')

With this command, the erssa wrapper function calls various ERSSA functions to perform the following calculations in sequence:

  1. Filter the count table by gene expression level to remove low-expressing genes.

  2. Generate unique subsamples (sample combinations) at each replicate level.

  3. Call one of the DE packages to perform DE analysis. Identify the genes that pass test statistic and fold change cutoffs.

  4. Generate plots to visualize the results.

By default, the erssa wrapper function will save the generated plots plus all of the generated data in the current working directory. An alternative path can be set using the path argument.

Note that, under default setting, the ERSSA calculations may require runtime in order of minutes. This is because for each subsample, a DE analysis is performed by calling the DE software. Thus, runtime is scaled linearly to the number of calls to the DE software and when hundreds of comparisons (in our example dataset: 8 replicate levels * 30 subsamples per level) need to be done, the entire calculation can take some time to complete. Fortunately, this issue can be mitigated by running the DE calculations in parallel (ERSSA uses the BiocParallel package to manage this [7]). This along with other ERSSA capabilities are further explained in the next section.

3.4 ERSSA in more detail

In this section, the steps erssa take are explained in more detail. Additionally, we will also cover the parameters that can be set to optimize the analysis for each user’s specific needs. Full descriptions and usage examples can be found in each function’s manual.

3.4.1 Filter count table

First, erssa calls the function count_filter to remove low-expressing genes from the count table. Filtering away low-expressing genes before differential expression comparison is a widely accepted practice and should be performed here to maximize discovery [4]. A gene-wise average Count Per Million (CPM) calculation is performed and at default, all genes with average CPM below 1 are removed from further analysis. The default CPM cutoff can be changed by adjusting the argument filter_cutoff. Additionally, if the user prefers to perform their own gene filtering prior to ERSSA (e.g., with the genefilter package [5]), a pre-filtered count table can be supplied and gene filter by ERSSA turned off by setting the argument counts_filtered=TRUE.

3.4.2 Generate subsample combinations

Next, ERSSA runs comb_gen function to randomly generate the subsample combinations. Briefly, at each replicate level (n=2 to N-1), this function employs a random process to sample from the entire dataset to generate at most 30 (at default) unique subsample combinations per condition. Note that only unique sample combinations are kept; so in the case where we select 5 samples out of 6 total replicates, only 6 unique combinations will be generated. Finally, at most 30 unique pairs of control vs. experimental samples are randomly selected for DE analysis by combining the l