Contents

1 Introduction

saseR is a R package to perform aberrant expression and splicing analyses in bulk RNA-seq datasets. It uses adapted offsets to model aberrant splicing with the negative binomial framework, and estimates the parameters on a scalable and efficient manner. In this vignette, we first show how the different functions of the saseR package can be used, after which we show how to perform differential usage analysis with adapted offsets in conventional bulk RNA-seq methods DESeq2 and edgeR.

All details about the saseR model, statistical tests, the use of adapted offsets to model proportions and the adapted parameter estimation are described in our preprint (Segers et al. 2023).

In this vignette, we analyse the data from the ASpli package (Mancini et al. 2021), to show how a workflow for aberrant expression and splicing is done. However, we note that this dataset is not specific for aberrant expression or splicing analyses. The data consists of BAM-files which will be converted into gene, bin,and junction reads, on which three different saseR analysis will be performed (aberrant expression and two times an aberrant splicing analysis). Next, we show how a differential usage/splicing analysis can be done using DESeq2 (Love, Huber, and Anders 2014) and edgeR (Robinson, McCarthy, and Smyth 2009) with adapted offsets.

2 Aberrant expression and splicing workflow

The package has two different categories of functions, with respectively two and three functions.

First, BAM-files need to be transformed to gene, bin or junctions counts. This is done by:

  1. BamtoAspliCounts, which transforms BAM-files to different counts using adapted functions of the ASpli package. The different counts are given in a SummarizedExperiment object in the metadata slots.

  2. convertAspli selects the relevant counts (gene, bins or junctions) to be further used in the analyses.

Then, the chosen counts can be further used to perform an aberrant expression or splicing analysis with the following functions:

  1. calculateOffsets, which calculates the offset matrix used in the analysis. These offsets can be ordinary edgeR or DESeq2 offsets when performing an aberrant expression analysis, or can be the total bin or junction counts aggregated per gene when performing an aberrant splicing analysis.

  2. saseRfindEncodingDim estimates the optimal number of latent factors to control for unknown confounders. This number of latent factors is then further used in the final fit.

  3. saseRfit calculates the mean expression and feature-specific dispersions for each feature in each sample, controlling for both known and unknown confounders. It calculates p-values which represent how extreme each observation is compared to its estimated distribution.

Note that one can do a saseR analysis by using readily available count matrices, therefore not requiring to start from BAM-files.

2.1 Package installation

saseR can be installed from Bioconductor with:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("saseR")

Alternatively, the development version of saseR can be downloaded with:

library(devtools)
devtools::install_github("statOmics/saseR")

2.2 Load libraries

library(saseR)
suppressWarnings(library(ASpli))
library(edgeR)
library(DESeq2)
library(limma)
library(GenomicFeatures)
library(dplyr)
library(pracma)
library(PRROC)
library(BiocParallel)
library(data.table)
library(igraph)

2.3 Transforming BAM to count matrices.

The following data corresponds to the vignette data from the ASpli package (Mancini et al. 2021).

The input data required for saseR are BAM-files with corresponding BAI files and a genome annotation in TxDb format. These BAM-files will be transformed to gene/bin/junction count matrices. When a count matrix is readily available, one can just use this to perform an aberrant expression or splicing analysis (see section 5 and 6) .

First, we specify the path to our GTF file, which we will later transform to the TxDb format. Also, we include the pathnames of the BAM-files we want to include in the analysis.

gtfFileName <- ASpli::aspliExampleGTF()
BAMFiles <- ASpli::aspliExampleBamList()

gtfFileName
## [1] "/home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/genes.mini.gtf"
head(BAMFiles)
## [1] "/home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_C_0.bam"
## [2] "/home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_C_1.bam"
## [3] "/home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_C_2.bam"
## [4] "/home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_D_0.bam"
## [5] "/home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_D_1.bam"
## [6] "/home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_D_2.bam"

Then, we define a targets dataframe, which contains the sample names, BAM-file pathsand experimental factors column. For the case of rare Mendelian disorders and outlier detection, we choose these experimental factors to be all equal to A. Note that the experimental factors included here will not further be used in the final analysis, but are needed to compute the count matrices with functions from ASpli.

targets <- data.frame( 
  row.names = paste0('Sample',c(1:12)),
  bam = BAMFiles,
  f1 = rep("A",12),
  stringsAsFactors = FALSE)

head(targets)
##                                                                          bam f1
## Sample1 /home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_C_0.bam  A
## Sample2 /home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_C_1.bam  A
## Sample3 /home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_C_2.bam  A
## Sample4 /home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_D_0.bam  A
## Sample5 /home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_D_1.bam  A
## Sample6 /home/biocbuild/bbs-3.20-bioc/R/site-library/ASpli/extdata/A_D_2.bam  A

Next, we transform our GTF file to TxDb format, and extract the features from the annotation using the binGenome function from saseR, which is an adapted copy from ASpli as this function is deprecated. Note that the binGenome function can take a while.

genomeTxDb <- txdbmaker::makeTxDbFromGFF(gtfFileName)
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
features <- binGenome(genomeTxDb, logTo = NULL)
## * Number of extracted Genes = 10
## * Number of extracted Genes = 10
## * Number of extracted Exon Bins = 38
## * Number of extracted Exon Bins = 38
## * Number of extracted intron bins = 34
## * Number of extracted intron bins = 34
## * Number of extracted trascripts = 25
## * Number of extracted trascripts = 25
## * Number of extracted junctions = 23
## * Number of extracted junctions = 23
## * Number of AS bins (not include external) = 17
## * Number of AS bins (include external) = 17
## * Classified as: 
##  ES bins = 1 (6%)
##  IR bins = 3 (18%)
##  Alt5'ss bins = 2    (12%)
##  Alt3'ss bins = 2    (12%)
##  Multiple AS bins = 9    (53%)
##  classified as:
##          ES bins = 1 (11%)
##          IR bins = 4 (44%)
##          Alt5'ss bins = 2    (22%)
##          Alt3'ss bins = 1    (11%)
## * Number of AS bins (not include external) = 17
## * Number of AS bins (include external) = 17
## * Classified as: 
##  ES bins = 1 (6%)
##  IR bins = 3 (18%)
##  Alt5'ss bins = 2    (12%)
##  Alt3'ss bins = 2    (12%)
##  Multiple AS bins = 9    (53%)
##  classified as:
##          ES bins = 1 (11%)
##          IR bins = 4 (44%)
##          Alt5'ss bins = 2    (22%)
##          Alt3'ss bins = 1    (11%)
## Correcting Io ends, this might take a while...
## Genome binning completed

Now that we have the targets data.frame containing the samples and BAM-file names, and the features extracted from the genome annotation, we can obtain the count matrices that will be used to detect aberrant expression. This can done with BamtoAspliCounts. Note that the counting procedure (featurecounts) can take a while. If only one type of analysis is preferred (e.g. only aberrant expression analysis on the gene counts), one can do the feature counting with a different method, and start from a count matrix at section 5 and 6. Make sure to select the right type of libType (SE for single-end reads, PE for paired-end reads), and the minReadLength, which is the minimal number of bases a read needs to be counted.

ASpliSE <- BamtoAspliCounts(
                    features = features,
                    targets = targets,
                    minReadLength = 100,
                    libType = "SE",
                    BPPARAM = MulticoreParam(1L)
                    )
## Junctions PJU completed
## Junctions PIR completed
## Junctions IR PIR completed
## Junctions AltSS PSI completed
## Junctions ES PSI completed

ASpliSE is a SummarizedExperiment object that contains the gene, bin and junction counts, together with different annotations in the metadata slots. We will extract these counts using the convertASpli function.

SEgenes <- convertASpli(ASpliSE, type = "gene")
SEbins <- convertASpli(ASpliSE, type = "bin")
SEjunctions <- convertASpli(ASpliSE, type = "junction")

2.4 Aberrant expression analysis

We will show how to perform an aberrant expression analysis based on the gene level counts. Note that other feature counts (e.g. bin counts) can also be used to search for aberrant expression. We start from a SummarizedExperiment that contains the feature counts of interest.

First, a design formula is included in the metadata slot design. In this example, we do not include known covariates, but only introduce an intercept. If one includes known covariates in the formula, these covariates should be included in the colData slot.

metadata(SEgenes)$design <- ~1

For an aberrant expression analysis, we first retain only the genes that have sufficiently large counts. Then, we need to calculate the offsets for our analysis. For aberrant expression analysis, we recommend using TMM or geommean for edgeR and DESeq2 normalisation respectively. This results in an offsets matrix in the assays slot.

Note that for aberrant splicing detection, first offsets are calculated before filtering features that have low counts.

filtergenes <- filterByExpr(SEgenes)
## Warning in filterByExpr.DGEList(y, design = design, group = group, lib.size =
## lib.size, : All samples appear to belong to the same group.
SEgenes <- SEgenes[filtergenes,]
SEgenes <- calculateOffsets(SEgenes, method = "TMM")

Having defined the design and the offsets of the analysis, we can continue to estimate the optimal number of latent confounders included in the regression framework. This is often needed when performing aberrant expression or splicing analysis to improve power. We obtain the latent factors by using RUV (Risso et al. 2014), which performs a singular value decomposition on the deviance residuals. Then, we recommend using the Gavish and Donoho (Gavish and Donoho 2014) threshold for singular values (method = GD) to choose upon the number of latent factors included in the final regression model. This is much faster, while having similar performance compared to using a denoising autoencoder (method = DAE).

SEgenes <- saseRfindEncodingDim(SEgenes, method = "GD")

Having estimated the optimal number of latent factors to include in the regression framework, we can fit our final model, controlling for both known confounders (specified in metadata slot design) and the latent factors based on the deviance residuals. The final regression fit can be done with edgeR (fit = edgeR), but is set to default to our fast parameter estimation (fit = fast). This fast parameter estimation uses a overdispersed quadratic mean-variance relationship, which enables matrix multiplication while keeping parameter estimation unbiased. This is often required in the context of rare diseases that need many latent factors to control for, which leads to ordinary fitting procedures such as DESeq2 and edgeR becoming slow. If the number of confounders included in the regression model is small (known confounders + latent factors), edgeR parameter estimation could be used, although it is not expected to give an increase in performance. analysis is specified to AE to choose an aberrant expression analysis.

SEgenes <- saseRfit(SEgenes,
                    analysis = "AE",
                    padjust = "BH",
                    fit = "fast")

This results in estimates in mean expression, feature specific dispersions and p-values calculated for each observation in each sample. Further analysing the top prioritised genes within each sample is possible using for example:

order <- apply(assays(SEgenes)$pValue, 2, order, decreasing = FALSE)

topPvalsGenes <- sapply(1:ncol(SEgenes), 
                   function(x) {assays(SEgenes)$pValue[order[1:5,x],x]})
rownames(topPvalsGenes) <- NULL

topGenes <- sapply(1:ncol(SEgenes), 
                   function(x) {rownames(SEgenes)[order[1:5,x]]})

significantgenes <- sapply(1:ncol(SEgenes), 
              function(x) {
                  rownames(SEgenes)[assays(SEgenes)$pValueAdjust[,x] < 0.05]})


head(topPvalsGenes)
##            [,1]      [,2]      [,3]      [,4]      [,5]       [,6]      [,7]
## [1,] 0.03102749 0.1159464 0.3990040 0.1286522 0.2695281 0.04821107 0.1310854
## [2,] 0.17011462 0.1335962 0.4043607 0.4014844 0.2751492 0.36674082 0.3675906
## [3,] 0.31623422 0.2436769 0.5734204 0.4410169 0.7156347 0.70575353 0.4863269
## [4,] 0.37490428 0.3439380 0.7622561 0.4970871 0.7222973 0.70865388 0.5156083
## [5,] 0.39977621 0.5405994 0.8160304 0.5674733 0.7552254 0.75059475 0.6228496
##           [,8]      [,9]     [,10]      [,11]     [,12]
## [1,] 0.1115451 0.1421563 0.2208241 0.09214189 0.2371822
## [2,] 0.2951036 0.2621759 0.3283615 0.23906661 0.2664025
## [3,] 0.5051797 0.2869709 0.4140313 0.29306342 0.4882280
## [4,] 0.5307402 0.3739860 0.4719410 0.55818088 0.6595572
## [5,] 0.5677993 0.3777315 0.5070343 0.65837916 0.6978440
head(topGenes)
##      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]    
## [1,] "GENE07" "GENE09" "GENE04" "GENE09" "GENE03" "GENE09" "GENE05" "GENE05"
## [2,] "GENE05" "GENE05" "GENE07" "GENE01" "GENE05" "GENE01" "GENE02" "GENE07"
## [3,] "GENE10" "GENE07" "GENE09" "GENE05" "GENE06" "GENE08" "GENE06" "GENE02"
## [4,] "GENE04" "GENE01" "GENE02" "GENE03" "GENE09" "GENE03" "GENE08" "GENE04"
## [5,] "GENE09" "GENE03" "GENE10" "GENE08" "GENE04" "GENE10" "GENE07" "GENE03"
##      [,9]     [,10]    [,11]    [,12]   
## [1,] "GENE07" "GENE02" "GENE05" "GENE10"
## [2,] "GENE06" "GENE10" "GENE02" "GENE05"
## [3,] "GENE04" "GENE07" "GENE08" "GENE07"
## [4,] "GENE08" "GENE04" "GENE09" "GENE01"
## [5,] "GENE09" "GENE08" "GENE03" "GENE03"
head(significantgenes)
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)

2.5 Aberrant splicing analysis

We will show how to perform an aberrant splicing analysis based on the bin and junction level counts. Note that other feature counts can also be used if proportions of interest are to be modelled, which only requires the use of the correct offsets. We start from a SummarizedExperiment that contains the feature counts of interest.

First, a design formula is included in the metadata slot design. In this example, we do not include known covariates, but only introduce an intercept. If one includes known covariates in the formula, these covariates should be included in the colData slot.

metadata(SEbins)$design <- ~1
metadata(SEjunctions)$design <- ~1

Then, we need to calculate the offsets for our analysis. For an aberrant splicing analysis, AS or proportion should be used as method to calculate offsets that allow to model proportions. This results in an offsets matrix in the assays slot. We specificy the column in the rowData slot which is used to aggregate feature counts to obtain proper offsets (aggregation). This is the locus column for the bins analysis, and the symbol or ASpliCluster column for a junctions analysis. For the junctions analysis, it is also possible to use aggregation based on symbol (= gene) whenever an annotated gene is available, and switch to ASpliCluster when the junction is not linked to a specific gene. This is done by setting mergeGeneASpli equal to TRUE. Note that a new column in the rowData will be made, locus, which specifies the used aggregation.

Note that also other rowData columns can be used to aggregate feature counts, and also a predefined offsets matrix can be included as offsets in the assays slot.

SEbins <- calculateOffsets(SEbins, 
                            method = "AS", 
                            aggregation = "locus")
## Using locus column in rowData to calculate offsets.
## 0 features filtered due to NA, '-', or 'noHit' locus.
SEjunctions <- calculateOffsets(SEjunctions, 
                            method = "AS", 
                            aggregation = "symbol",
                            mergeGeneASpli = TRUE)
## Using symbol column in rowData to calculate offsets.
## 0 features filtered due to NA, '-', or 'noHit' locus.

Then, after calculating the offsets for the aberrant splicing analysis, we filter the features that have very low counts. Note that for aberrant expression analyses, first the features are filtered, followed by calculating the offsets.

filterbins <- filterByExpr(SEbins)
## Warning in filterByExpr.DGEList(y, design = design, group = group, lib.size =
## lib.size, : All samples appear to belong to the same group.
SEbins <- SEbins[filterbins,]

filterjunctions <- filterByExpr(SEjunctions)
## Warning in filterByExpr.DGEList(y, design = design, group = group, lib.size =
## lib.size, : All samples appear to belong to the same group.
SEjunctions <- SEjunctions[filterjunctions,]

Having defined the design and the offsets of the analysis, we can continue to estimate the optimal number of latent confounders included in the regression framework. This is often needed when performing aberrant expression or splicing analysis to obtain better power. We obtain the latent factors by using RUV (Risso et al. 2014), which performs a singular value decomposition on the deviance residuals. Then, we recommend using the Gavish and Donoho (Gavish and Donoho 2014) threshold for singular values (method = GD) to choose upon the number of latent factors included in the final regression model. This is much faster, while having similar performance compared to using a denoising autoencoder (method = DAE).

SEbins <- saseRfindEncodingDim(SEbins, method = "GD")
SEjunctions <- saseRfindEncodingDim(SEjunctions, method = "GD")

Having estimated the optimal number of latent factors to include in the regression framework, we can fit our final model, controlling for both known confounders (specified in metadata slot design) and the latent factors based on the deviance residuals. The final regression fit can be done with edgeR (fit = edgeR), but is set to default to our fast parameter estimation (fit = fast). This fast parameter estimation uses a overdispersed quadratic mean-variance relationship, which enables matrix multiplication while keeping parameter estimation unbiased.This is often required in the context of rare diseases that need many latent factors to control for, which leads to ordinary fitting procedures such as edgeR becoming slow. If the number of confounders included in the regression model is small (known confounders + latent factors), edgeR parameter estimation could be used, although it is not expected to give an increase in performance. analysis is specified on AS to choose an aberrant splicing analysis.

SEbins <- saseRfit(SEbins,
                    analysis = "AS",
                    padjust = "BH",
                    fit = "fast")

SEjunctions <- saseRfit(SEjunctions,
                    analysis = "AS",
                    padjust = "BH",
                    fit = "fast")

This results in estimates in mean expression, feature specific dispersions and p-values calculated for each observation in each sample. Further analysing the top prioritised features within each sample is possible using for example:

order <- apply(assays(SEbins)$pValue, 2, order, decreasing = FALSE)

topPvalsBins <- sapply(1:ncol(SEbins), 
                   function(x) {assays(SEbins)$pValue[order[1:5,x],x]})

rownames(topPvalsBins) <- NULL

topBins <- sapply(1:ncol(SEbins), 
                   function(x) {rownames(SEbins)[order[1:5,x]]})

significantBins <- sapply(1:ncol(SEbins), 
                   function(x) {
                    rownames(SEbins)[assays(SEgenes)$pValueAdjust[,x] < 0.05]})

head(topPvalsBins)
##           [,1]      [,2]      [,3]       [,4]       [,5]       [,6]      [,7]
## [1,] 0.2920730 0.2479812 0.1773876 0.03983227 0.03029549 0.07030136 0.2820690
## [2,] 0.3452435 0.2606600 0.3482568 0.12819629 0.20612234 0.09166173 0.3182864
## [3,] 0.4411531 0.2685777 0.3999498 0.13273306 0.28950883 0.27493653 0.3284258
## [4,] 0.4670477 0.3588860 0.4597824 0.16778278 0.29595504 0.28393816 0.4128227
## [5,] 0.5075236 0.4212592 0.4953552 0.18583522 0.31699366 0.28819200 0.4463418
##           [,8]       [,9]     [,10]     [,11]      [,12]
## [1,] 0.1505300 0.06543611 0.1561171 0.1020822 0.01881900
## [2,] 0.2690009 0.14449012 0.2302083 0.1439282 0.06356613
## [3,] 0.3026018 0.14901032 0.2543287 0.3428522 0.16000229
## [4,] 0.3408827 0.19215776 0.2752078 0.4655831 0.17319650
## [5,] 0.3827884 0.28273415 0.3054174 0.4674561 0.21068788
head(topBins)
##      [,1]          [,2]          [,3]          [,4]          [,5]         
## [1,] "GENE08:E005" "GENE08:E005" "GENE08:E004" "GENE06:E004" "GENE08:E004"
## [2,] "GENE08:E004" "GENE08:E001" "GENE09:E004" "GENE03:E001" "GENE01:E002"
## [3,] "GENE01:E002" "GENE06:E004" "GENE08:E001" "GENE08:E004" "GENE01:E003"
## [4,] "GENE02:E002" "GENE03:E001" "GENE08:E003" "GENE09:E002" "GENE08:E002"
## [5,] "GENE05:E004" "GENE02:E002" "GENE07:E003" "GENE10:E004" "GENE03:E002"
##      [,6]          [,7]          [,8]          [,9]          [,10]        
## [1,] "GENE03:E001" "GENE10:E003" "GENE08:E002" "GENE07:E002" "GENE02:E002"
## [2,] "GENE03:E002" "GENE02:E001" "GENE08:E004" "GENE07:E001" "GENE02:E003"
## [3,] "GENE06:E004" "GENE06:E002" "GENE10:E004" "GENE10:E001" "GENE06:E001"
## [4,] "GENE09:E003" "GENE06:E003" "GENE07:E002" "GENE08:E002" "GENE03:E002"
## [5,] "GENE09:E001" "GENE09:E005" "GENE10:E001" "GENE05:E004" "GENE06:E003"
##      [,11]         [,12]        
## [1,] "GENE02:E002" "GENE02:E002"
## [2,] "GENE05:E002" "GENE03:E002"
## [3,] "GENE10:E001" "GENE05:E002"
## [4,] "GENE09:E001" "GENE02:E003"
## [5,] "GENE10:E003" "GENE06:E001"
head(significantBins)
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)

Furthermore, also aggregated p-values are calculated based on the newly formed locus column in the rowData slot. These are currently the minimal p-value of each feature belonging to the same aggregation group, and can be found in the metadata slot under pValuesLocus.

order <- apply(metadata(SEbins)$pValuesLocus, 2, order, decreasing = FALSE)

topPvalsBinsAggregated <- sapply(1:ncol(SEbins), 
                   function(x) {metadata(SEbins)$pValuesLocus[order[1:5,x],x]})
rownames(topPvalsBinsAggregated) <- NULL

topBinsAggregated <- sapply(1:ncol(SEbins), 
                   function(x) {
                     rownames(metadata(SEbins)$pValuesLocus)[order[1:5,x]]})

significantBinsAggregated <- sapply(1:ncol(SEbins), 
              function(x) {
                  rownames(SEbins)[metadata(SEgenes)$pValuesLocus[,x] < 0.05]})

head(topPvalsBinsAggregated)
##           [,1]      [,2]      [,3]       [,4]       [,5]       [,6]      [,7]
## [1,] 0.2920730 0.2479812 0.1773876 0.03983227 0.03029549 0.07030136 0.2820690
## [2,] 0.4411531 0.2685777 0.3482568 0.12819629 0.20612234 0.27493653 0.3182864
## [3,] 0.4670477 0.3588860 0.4953552 0.13273306 0.31699366 0.28393816 0.3284258
## [4,] 0.5075236 0.4212592 0.5428406 0.16778278 0.32869031 0.39572189 0.4463418
## [5,] 0.5397520 0.4431609 0.5934711 0.18583522 0.34831266 0.47232361 0.4636188
##           [,8]       [,9]     [,10]     [,11]      [,12]
## [1,] 0.1505300 0.06543611 0.1561171 0.1020822 0.01881900
## [2,] 0.3026018 0.14901032 0.2543287 0.1439282 0.06356613
## [3,] 0.3408827 0.19215776 0.2752078 0.3428522 0.16000229
## [4,] 0.4037764 0.28273415 0.4817044 0.4655831 0.21068788
## [5,] 0.4178444 0.29755667 0.5166177 0.4705838 0.43419989
head(topBinsAggregated)
##      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]    
## [1,] "GENE08" "GENE08" "GENE08" "GENE06" "GENE08" "GENE03" "GENE10" "GENE08"
## [2,] "GENE01" "GENE06" "GENE09" "GENE03" "GENE01" "GENE06" "GENE02" "GENE10"
## [3,] "GENE02" "GENE03" "GENE07" "GENE08" "GENE03" "GENE09" "GENE06" "GENE07"
## [4,] "GENE05" "GENE02" "GENE10" "GENE09" "GENE10" "GENE08" "GENE09" "GENE02"
## [5,] "GENE09" "GENE01" "GENE02" "GENE10" "GENE06" "GENE02" "GENE04" "GENE01"
##      [,9]     [,10]    [,11]    [,12]   
## [1,] "GENE07" "GENE02" "GENE02" "GENE02"
## [2,] "GENE10" "GENE06" "GENE05" "GENE03"
## [3,] "GENE08" "GENE03" "GENE10" "GENE05"
## [4,] "GENE05" "GENE05" "GENE09" "GENE06"
## [5,] "GENE06" "GENE08" "GENE08" "GENE08"
head(significantBinsAggregated)
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)

3 Differential usage/splicing using adapted offsets in DESeq2 and edgeR

Here, we show how to perform a differential usage/splicing analysis, using adapted offsets in conventual bulk RNA-seq tools DESeq2/edgeR. This way, we avoid modelling both the counts and the other counts, which means that no sample-specific regression coefficients are estimated. Note that these steps are similar compared to a conventual DESeq2/edgeR analysis, where different offsets or normalisation is used. We refer to the DESeq2/edgeR vignettes for more information upon their package and functions. Also, we refer to the DEXSeq (Anders, Reyes, and Huber, n.d.) for more information upon the preprocessing of the annotation.

3.1 Load libraries

library(ASpli)
library(DESeq2)
library(DEXSeq)
library(edgeR)
library(dplyr)
library(GenomicAlignments)
library(GenomicFeatures)

3.2 Load example data

We use the same example dataset as in the aberrant analyses, and are loaded from the ASpli package. This includes an annotation file and BAM-files.

gtfFileName <- aspliExampleGTF()
BAMFiles <- aspliExampleBamList()

Next, we extract the exonic parts from the annotation file using the GenomicFeatures package.

genomeTxDb <- makeTxDbFromGFF(gtfFileName)
## Warning in call_fun_in_txdbmaker("makeTxDbFromGFF", ...): makeTxDbFromGFF() has moved to the txdbmaker package. Please call
##   txdbmaker::makeTxDbFromGFF() to get rid of this warning.
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
flattenedAnnotation = exonicParts(genomeTxDb, 
                                  linked.to.single.gene.only=TRUE )

Using summarizeOverlaps, we count the number of overlaps with each part of exonic parts of the annotation file. This way, we obtain a SummarizedExperiment object with read counts for each bin. These bin counts can then be further used to model differential usage/splicing. Make sure to use the correct settings for singleEnd and ignore.strand.

se = summarizeOverlaps(
    flattenedAnnotation,
    BAMFiles, 
    singleEnd=TRUE, 
    ignore.strand=TRUE )

Now that we have a count matrix, we can further specify the known covariates, whichin this case is the condition between which we will test differential bin usage.

colData(se)$condition <- factor(c(rep(c("control", "case"), each = 6)))

Instead of modelling differential usage/splicing by modelling both the counts and the other counts, we rather calculate the total bin-counts per gene. This will be further used as offsets in the negative binomial regression, allowing to model the proportion of bin counts over the total counts overlapping that gene. We here aggregate the read counts per gene, although the offset can be chosen flexible depending on the analysis of interest (e.g. aggregating transcript counts per gene to model differential transcript usage).

counts <- assays(se)$counts
offsetsGene <- aggregate(counts,
                     by = list("gene" = rowData(se)$gene_id), FUN = sum) 
offsets <- offsetsGene[match(rowData(se)$gene_id, offsetsGene$gene), ] %>% 
                     mutate("gene" = NULL) %>% 
                     as.matrix() 

Because currently it is not possible to include offsets that are equal to 0 in DESeq2/edgeR, we have to change these offsets. Here, we change the counts that correspond to an offset of 0 (these counts are also equal to 0 in a differential usage/splicing analysis) to 1, together with the corresponding offset. It is possible to change the offsets and/or counts to a number of choice, as long as the offsets are not equal to 0.

index <- offsets == 0
counts[index] <- 1
offsets[index] <- 1

We can now perform a conventual DESeq2/edgeR analysis, using the custom offsets. For DESeq2, we include the offsets on the original scale, while for edgeR, we include these on the log-scale. For the DESeq2 analysis we can use the following code:

dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = colData(se),
                              rowData = rowData(se),
                              design= ~ condition)
## converting counts to integer mode
##   it appears that the last variable in the design formula, 'condition',
##   has a factor level, 'control', which is not the reference level. we recommend
##   to use factor(...,levels=...) or relevel() to set this as the reference level
##   before proceeding. for more information, please see the 'Note on factor levels'
##   in vignette('DESeq2').
normalizationFactors(dds) <- offsets

dds <- DESeq2::estimateDispersionsGeneEst(dds)
dispersions(dds) <- mcols(dds)$dispGeneEst
dds <- DESeq2::nbinomWaldTest(dds)
results_dds <- DESeq2::results(dds)

While for the edgeR analysis we can use the following code:

DGE <- DGEList(counts = counts)
DGE$offset <- log(offsets)
design <- model.matrix(~condition, data = colData(se))

DGE <- edgeR::estimateDisp(DGE, design = design)
fitDGE <- edgeR::glmFit(DGE, design= design)
results_DGE <- edgeR::glmLRT(fitDGE, coef = 2)

These results now have an interpretation in terms of proportions, with the interpretation dependent on the offsets used.

4 Session info

sessionInfo()
## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GenomicAlignments_1.41.0    Rsamtools_2.21.0           
##  [3] Biostrings_2.73.0           XVector_0.45.0             
##  [5] DEXSeq_1.51.0               RColorBrewer_1.1-3         
##  [7] igraph_2.0.3                data.table_1.15.4          
##  [9] BiocParallel_1.39.0         PRROC_1.3.1                
## [11] pracma_2.4.4                dplyr_1.1.4                
## [13] GenomicFeatures_1.57.0      DESeq2_1.45.0              
## [15] SummarizedExperiment_1.35.0 MatrixGenerics_1.17.0      
## [17] matrixStats_1.3.0           GenomicRanges_1.57.0       
## [19] GenomeInfoDb_1.41.0         ASpli_2.15.0               
## [21] AnnotationDbi_1.67.0        IRanges_2.39.0             
## [23] S4Vectors_0.43.0            Biobase_2.65.0             
## [25] BiocGenerics_0.51.0         edgeR_4.3.0                
## [27] limma_3.61.0                saseR_1.1.0                
## [29] knitr_1.46                  BiocStyle_2.33.0           
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.0            BiocIO_1.15.0            bitops_1.0-7            
##   [4] filelock_1.0.3           tibble_3.2.1             XML_3.99-0.16.1         
##   [7] rpart_4.1.23             lifecycle_1.0.4          httr2_1.0.1             
##  [10] pbmcapply_1.5.1          lattice_0.22-6           ensembldb_2.29.0        
##  [13] MASS_7.3-60.2            backports_1.4.1          magrittr_2.0.3          
##  [16] Hmisc_5.1-2              sass_0.4.9               rmarkdown_2.26          
##  [19] jquerylib_0.1.4          yaml_2.3.8               Gviz_1.49.0             
##  [22] DBI_1.2.2                abind_1.4-5              zlibbioc_1.51.0         
##  [25] purrr_1.0.2              AnnotationFilter_1.29.0  biovizBase_1.53.0       
##  [28] RCurl_1.98-1.14          nnet_7.3-19              VariantAnnotation_1.51.0
##  [31] rappdirs_0.3.3           GenomeInfoDbData_1.2.12  genefilter_1.87.0       
##  [34] annotate_1.83.0          codetools_0.2-20         DelayedArray_0.31.0     
##  [37] DT_0.33                  xml2_1.3.6               precrec_0.14.4          
##  [40] tidyselect_1.2.1         UCSC.utils_1.1.0         BiocFileCache_2.13.0    
##  [43] base64enc_0.1-3          jsonlite_1.8.8           Formula_1.2-5           
##  [46] survival_3.6-4           tools_4.4.0              progress_1.2.3          
##  [49] Rcpp_1.0.12              glue_1.7.0               gridExtra_2.3           
##  [52] SparseArray_1.5.0        xfun_0.43                withr_3.0.0             
##  [55] BiocManager_1.30.22      fastmap_1.1.1            latticeExtra_0.6-30     
##  [58] fansi_1.0.6              digest_0.6.35            R6_2.5.1                
##  [61] colorspace_2.1-0         jpeg_0.1-10              dichromat_2.0-0.1       
##  [64] biomaRt_2.61.0           RSQLite_2.3.6            UpSetR_1.4.0            
##  [67] utf8_1.2.4               tidyr_1.3.1              generics_0.1.3          
##  [70] robustbase_0.99-2        rtracklayer_1.65.0       prettyunits_1.2.0       
##  [73] httr_1.4.7               htmlwidgets_1.6.4        S4Arrays_1.5.0          
##  [76] pkgconfig_2.0.3          gtable_0.3.5             blob_1.2.4              
##  [79] hwriter_1.3.2.1          pcaPP_2.0-4              htmltools_0.5.8.1       
##  [82] bookdown_0.39            geneplotter_1.83.0       ProtGenerics_1.37.0     
##  [85] scales_1.3.0             png_0.1-8                rstudioapi_0.16.0       
##  [88] rjson_0.2.21             checkmate_2.3.1          curl_5.2.1              
##  [91] cachem_1.0.8             stringr_1.5.1            foreign_0.8-86          
##  [94] restfulr_0.0.15          pillar_1.9.0             grid_4.4.0              
##  [97] vctrs_0.6.5              dbplyr_2.5.0             xtable_1.8-4            
## [100] cluster_2.1.6            htmlTable_2.4.2          evaluate_0.23           
## [103] mvtnorm_1.2-4            cli_3.6.2                locfit_1.5-9.9          
## [106] compiler_4.4.0           rlang_1.1.3              crayon_1.5.2            
## [109] rrcov_1.7-5              interp_1.1-6             plyr_1.8.9              
## [112] stringi_1.8.3            deldir_2.0-4             txdbmaker_1.1.0         
## [115] munsell_0.5.1            lazyeval_0.2.2           Matrix_1.7-0            
## [118] BSgenome_1.73.0          hms_1.1.3                bit64_4.0.5             
## [121] ggplot2_3.5.1            KEGGREST_1.45.0          statmod_1.5.0           
## [124] memoise_2.0.1            bslib_0.7.0              DEoptimR_1.1-3          
## [127] bit_4.0.5

References

Anders, Simon, Alejandro Reyes, and Wolfgang Huber. n.d. “Detecting Differential Usage of Exons from Rna-Seq Data.” Genome Res 22 (10): 2008–17. https://doi.org/10.1101/gr.133744.111.

Gavish, Matan, and David L. Donoho. 2014. “The Optimal Hard Threshold for Singular Values Is \(4/\sqrt {3}\).” IEEE Trans. Inf. Theory 60 (8): 5040–53. https://doi.org/10.1109/TIT.2014.2323359.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Mancini, Estefania, Andres Rabinovich, Javier Iserte, Marcelo Yanovsky, and Ariel Chernomoretz. 2021. “ASpli: integrative analysis of splicing landscapes through RNA-Seq assays.” Bioinformatics 37 (17): 2609–16. https://doi.org/10.1093/bioinformatics/btab141.

Risso, Davide, John Ngai, Terence P Speed, and Sandrine Dudoit. 2014. “Normalization of Rna-Seq Data Using Factor Analysis of Control Genes or Samples.” Nat. Biotechnol. 32 (9): 896–902. https://doi.org/10.1038/nbt.2931.

Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2009. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” 10.1093/bioinformatics/btp616. Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Segers, Alexandre, Jeroen Gilis, Mattias Van Heetvelde, Elfride De Baere, and Lieven Clement. 2023. “Juggling Offsets Unlocks Rna-Seq Tools for Fast Scalable Differential Usage, Aberrant Splicing and Expression Analyses.” bioRxiv. https://doi.org/10.1101/2023.06.29.547014.