Contents

1 Introduction

INSPEcT is an R/Bioconductor compliant solution for the study of RNA transcriptional dynamics from RNA-seq data (De Pretis et al. 2015, Furlan et al. (2019)). It is based on a system of two ordinary differential equations (ODEs) that describes the synthesis (\(k_1\)) and processing (\(k_2\)) of premature RNA (\(P\)) and the degradation (\(k_3\)) of mature RNA (\(M=T-P\)):

\[\begin{equation} \left\{ \begin{array}{l l} \dot{P}=k_1 - k_2 \, \cdot \, P \\ \dot{T}=k_1 - k_3 \, \cdot \, (T - P) \end{array} \right. \tag{1} \end{equation}\]

The total RNA (\(T\)) is measured from the exonic quantification of transcripts in RNA-seq experiment, while premature RNA (\(P\)) is measured by the intronic quantification. The model is based on two commonly accepted assumptions: nuclear degradation is deniable, and nuclear export occurs at a rate considerably faster than other rates and can be disregarded.

The package supports the analysis of several experimental designs, such as steady-state conditions or time-course data, and the presence or absence of the nascent RNA library. According to the experimental design, the software returns different outputs. In particular, when the nascent RNA is present (INSPEcT+), the rates of synthesis, processing and degradation can be calculated both in time-course and in steady-state state conditions and tested for differential regulation. Without the information from the nascent RNA (INSPEcT-), the rates of synthesis, processing and degradation, as well as their significant variation, can be calculated only in time-course, while only a variation in the ratio between processing and degradation (post-transcriptional ratio) can be assessed between steady states. A schema of the possible configurations and related outputs is reported in Table 1 and 2.


Table 1: INSPEcT+ experimental design
Design Synthesis Processing Degradation
Steady-state Yes Yes Yes
Time-course Yes Yes Yes

Table 2: INSPEcT- experimental design
Design Synthesis Processing Degradation
Steady-state No Ratio Ratio
Time-course Yes Yes Yes

2 Quantification of Exon and Intron features

The INSPEcT analysis starts with the quantification of exonic and intronic quantifications and the estimation of their variance from the replicates in the different conditions of the analysis. INSPEcT can estimate transcripts abundances and associated variances starting from different entry points according to the source of data available: SAM (or BAM) files, read counts or transcripts abundances. The user can choose to estimate the variance by means of two different softwares: DESeq2 or plgem. By default, DESeq2 is used when starting from BAM or read counts, while plgem is used when starting from transcripts abundances.

library(INSPEcT)

2.1 From BAM or SAM files

INSPEcT can quantify transcripts abundaces and variances directly from SAM or BAM files. In this case, transcrips annotations are retrieved from a TxDb object. By default, INSPEcT collapses the exons of the transcripts that belong to the same gene (argument by=“gene”). Alternatively, it can work also at the transcript level (argument by=’tx). When working at the transcript level, INSPEcT cannot discriminate betweeen transcript, therefore we suggest to assign reads to all the overlapping features by setting the argument allowMultiOverlap to TRUE. INSPEcT manage strandness of the reads with the argument strandSpecific, which can be set to 0 in case of unstranded reads, 1 for stranded and 2 for reverse-stranded experiments. INSPEcT prioritizes the exon annotation, meaning that only the reads that do not overlap with any of the annotated exon are eventually accounted for introns. Here we report an example where the data from 4 sample BAM files (2 time points, 2 replicates each, as specified with the argument experimentalDesign) is quantified in a non-stranded specific way, at the gene level, not assigning the reads mapping to more than one feature.

require(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene

bamfiles_paths <- system.file('extdata/',
  c('bamRep1.bam','bamRep2.bam','bamRep3.bam','bamRep4.bam'), 
  package='INSPEcT')

exprFromBAM <- quantifyExpressionsFromBAMs(txdb=txdb
           , BAMfiles=bamfiles_paths
           , by = 'gene'
           , allowMultiOverlap = FALSE
           , strandSpecific = 0
           , experimentalDesign=c(0,0,1,1))

The function returns a list containing exonic/intronic expressions and variances, as well as exonic/intronic feature widths extracted from the annotation, exonic/intronic counts and counts statitics.

names(exprFromBAM)
## [1] "exonsExpressions"   "intronsExpressions" "exonsVariance"     
## [4] "intronsVariance"    "exonsWidths"        "intronsWidths"     
## [7] "exonsCounts"        "intronsCounts"      "countsStats"
exprFromBAM$countsStats
##                       0_rep1 0_rep2 1_rep1 1_rep2
## Unassigned_Ambiguity       5      0      0      0
## Assigned_Exons          2515   2763   2729   2672
## Assigned_Introns         862    609    654    706
## Unassigned_NoFeatures      0      0      0      0

2.2 From read counts

In case the matices of intronic and exonic reads have been already calulated, INSPEcT calulates the mean expression and the associated variance using, by default, the software DESeq2. The package INSPEcT contains the read counts associated to the intronic and exonic features of 500 genes profiled both for the nascent and total RNA in 11 time-points and replicated 3 times each following the induction of Myc in 3T9 cells (De Pretis et al. 2017). As a complementary information, the width of the intronic and exonic features for the same set of genes, as well as the sequencing depth of all the samples is provided within the package. The nascent and total quantification evaluated in this section of the vignette will be used later to generate the synthetic dataset for the benchmarking of the method.

data('allcounts', package='INSPEcT')
data('featureWidths', package='INSPEcT')
data('libsizes', package='INSPEcT')

nascentCounts<-allcounts$nascent
matureCounts<-allcounts$mature
expDes<-rep(c(0,1/6,1/3,1/2,1,1.5,2,4,8,12,16),3)

nasExp_DESeq2<-quantifyExpressionsFromTrCounts(
        allcounts=nascentCounts
        ,libsize=nascentLS
        ,exonsWidths=exWdths
        ,intronsWidths=intWdths
        ,experimentalDesign=expDes)

matExp_DESeq2<-quantifyExpressionsFromTrCounts(
        allcounts=matureCounts
        ,libsize=totalLS
        ,exonsWidths=exWdths
        ,intronsWidths=intWdths
        ,experimentalDesign=expDes)

2.3 From transcripts abundances

In order to exemplify the quantification of mean transcripts abundances and variances starting from their replicated quantification, we transform the read counts loaded from the package in the section above into RPKMs using the width of introns and exons and the library sizes.

trAbundancesFromCounts <- function(counts, widths, libsize)
        t(t(counts/widths)/libsize*10^9)
nascentTrAbundance <- list(
  exonsAbundances=trAbundancesFromCounts(
        nascentCounts$exonsCounts, exWdths, nascentLS),
  intronsAbundances=trAbundancesFromCounts(
        nascentCounts$intronsCounts, intWdths, nascentLS))

Once we have obtained transcripts abundances for introns and exons in the different conditions and replicates of the experimental design, we can calulate the mean expression and variance (using plgem) by:

nasExp_plgem<-quantifyExpressionsFromTrAbundance(
        trAbundaces = nascentTrAbundance
        , experimentalDesign = expDes)

3 Analysis of RNA dynamics in time-course experiments

3.1 Analysis of Total and Nascent RNA (INSPEcT+)

In case of the joint analysis of total and nascent RNA data, for each transcript with at least one intron, the exonic and intronic quantifications are available in the Total and in the Nascent RNA libraries. The system of equations (1) describes the dynamics of the total RNA population, but when integrated between zero and the time of labeling (\(t_L\)) it can be used to describe nascent RNA. For matter of simplicity INSPEcT assumes that, during the labeling time, synthesis and processing rates are constant and nascent transcripts are not degradaed. The set of equations to describe Total and Nascent RNA is:

\[\begin{equation} \left\{ \begin{array}{l l} \dot{P}_{R}=k_1 - k_2 \, \cdot \, P_{R} \\ \dot{T}_{R}=k_1 - k_3 \, \cdot \, (T_{R} - P_{R}) \\ s_F \, \cdot \, P_{L}=\frac{k_1}{k_2} - ( 1 - e^{k_2 \, \cdot \, t_L} ) \\ s_F \, \cdot \, T_{L}=k_1 \, \cdot \, t_L \end{array} \right. \tag{2} \end{equation}\]

where \(P_{R}\) and \(T_{R}\) are the premature and total RNA levels, respectively, estimated from the total RNA library, \(P_{L}\) and \(T_{L}\) are premature and total RNA levels, respectively, estimated from the nascent RNA library at each condition or time-point. First of all, \(\dot{P}_{R}\) and \(\dot{T}_{R}\) are estimated from the interpolation of \(P_R(t)\) and \(T_R(t)\) via cubic splines. INSPEcT exploits the overdetermination of the system (three unknowns: \(k_1\), \(k_2\) adn \(k_3\); and four equations) to calculate a scaling factor (\(s_F\)) between the Total and Nascent RNA for each condition or time-point (De Pretis et al. 2015).

3.1.1 First guess estimation of the rates

The rates of synthesis, processing and degradation are calculated from the Total and Nascent RNA quantifications by the newINSPEcT function:

  tpts<-c(0,1/6,1/3,1/2,1,1.5,2,4,8,12,16)
  tL<-1/6
  nascentInspObj<-newINSPEcT(tpts=tpts
                            ,labeling_time=tL
                            ,nascentExpressions=nasExp_DESeq2
                            ,matureExpressions=matExp_DESeq2)

During this step, INSPEcT calculates a scaling factor between each time-point of Total and Nascent expressions (as described in the section above) and integrates the differential equations in (2) assuming that, between experimental observations, total RNA, premature RNA and synthesis rate, behave linearly (linear piecewise), and that processing and degradation rates are constant (constant piecewise). Rates estimated by this procedure can be accessed from the INSPEcT object by the method ratesFirstGuess, which allows to access also RNA concentrations and variances associated to the experimental observations:

  round(ratesFirstGuess(nascentInspObj,'total')[1:5,1:3],3)
##           total_0 total_0.166666667 total_0.333333333
## 100503670 605.805           609.366           601.865
## 101206      5.107             5.356             5.196
## 101489     15.629            15.767            15.606
## 102436      1.543             1.107             1.305
## 104458     86.966            93.632            88.837
  round(ratesFirstGuessVar(nascentInspObj,'total')[1:5,1:3],3)
##           total_0 total_0.166666667 total_0.333333333
## 100503670 775.569           784.570           764.586
## 101206      0.382             0.417             0.392
## 101489      1.192             1.208             1.179
## 102436      0.281             0.145             0.202
## 104458     24.917            28.736            25.866
  round(ratesFirstGuess(nascentInspObj,'synthesis')[1:5,1:3],3)
##           synthesis_0 synthesis_0.166666667 synthesis_0.333333333
## 100503670     231.119               211.859               210.099
## 101206          4.875                 4.412                 3.964
## 101489          7.749                 7.412                 7.421
## 102436         32.087                28.312                28.645
## 104458         47.333                46.424                50.867

The INSPEcT object can be subsetted to focus on a specific set of genes, and for the sake of speeding up the downstream analysis, we will focus on the first 10 genes of the INSPEcT object.

nascentInspObj10<-nascentInspObj[1:10]

The complete pattern of total and pre-mRNA concentrations variation together with the synthesis, degradation and processing rates can be visualized using the inHeatmap method (Figure 1):

inHeatmap(nascentInspObj10, clustering=FALSE)
inHeatmap of the ratesFirstGuess. Heatmap representing the concentrations of total RNA, of premature RNA and the first guess of the rates of the RNA life cycle

Figure 1: inHeatmap of the ratesFirstGuess
Heatmap representing the concentrations of total RNA, of premature RNA and the first guess of the rates of the RNA life cycle

In case of a long \(t_L\) (more than 30 minutes), it can be useful to set the argument degDuringPulse equal to TRUE to estimate the rates of the RNA life-cycle without assuming that nascent transcripts are not degradaed during the labeling time. The longer the labelling time is the weaker this assumption gets, however, taking into account nascent RNA degradation involves the solution of a more complicated system of equations and can originate unrealistic high rates of synthesis, processing and degradation.

  nascentInspObjDDP<-newINSPEcT(tpts=tpts
                            ,labeling_time=tL
                            ,nascentExpressions=nasExp_DESeq2
                            ,matureExpressions=matExp_DESeq2
                            ,degDuringPulse=TRUE)

For short labeling times, as 10 minutes can be considered, the absence of degradation during the pulse is a good assumption and similar rates are estimated in both cases. In Figure 2 we compared the degradation rates estimated with or without the assumption that degradation of mature RNA occurs during the labeling time.

k3 <- ratesFirstGuess(nascentInspObj, 'degradation')
k3ddp <- ratesFirstGuess(nascentInspObjDDP, 'degradation')
plot(rowMeans(k3), rowMeans(k3ddp), log='xy', 
     xlim=c(.2,10), ylim=c(.2,10), 
     xlab='no degradation during pulse',
     ylab='degradation during pulse',
     main='first guess degradation rates')
abline(0,1,col='red')
Dotplot of degradation rates calculated with or without the assumption of no degradation during pulse. First guess of the degradation rates was calcualted with or without degDuringPulse option. In red is represented the line of equation y = x.

Figure 2: Dotplot of degradation rates calculated with or without the assumption of no degradation during pulse
First guess of the degradation rates was calcualted with or without degDuringPulse option. In red is represented the line of equation y = x.

3.1.2 Modeling with sigmoid and impulse functions

The rates evaluated during the “first guess” are highly exposed to the experimental noise. In particular, processing and degradation rates sum up the noise coming from different experimental data (i.e. the Total and the Nascent libraries), and could be challenging to distinguish real variations from noise. For this reason, the method modelRates fits either sigmoid or impulse functions to total RNA, processing and degradation rates, that minimize the error on experimental data. In this approach, premature RNA and synthesis rate functional forms direclty result from the system of equations in (1).

nascentInspObj10<-modelRates(nascentInspObj10, seed=1)
## [1] "Mature RNA fit."
## [1] "VVV modeling."
## [1] "Confidence intervals."

The rates computed through this modeling procedure are accessible via the method viewModelRates and can visualized via the method inHeatmap, setting the argument type=‘model’} (Figure 3).

round(viewModelRates(nascentInspObj10, 'synthesis')[1:5,1:3],3)
##           synthesis_0 synthesis_0.17 synthesis_0.33
## 100503670     208.693        208.693        208.693
## 101206          4.694          4.512          4.276
## 101489          6.794          6.793          6.791
## 102436         23.816         23.816         23.816
## 104458         47.443         47.459         47.478
inHeatmap(nascentInspObj10, type='model', clustering=FALSE)
inHeatmap of the rates after the second step of modeling. Heatmap representing the concentrations of total RNA, of premature RNA and the rates of the RNA life cycle after the second step of modeling.

Figure 3: inHeatmap of the rates after the second step of modeling
Heatmap representing the concentrations of total RNA, of premature RNA and the rates of the RNA life cycle after the second step of modeling.

3.1.3 Confidence intervals estimation and selection of the regulatory scenario

For each rate and time point, 95% confidence intervals are calculated and used to test the variability of a rate. Specifically, a chi-squared goodness-of-fit test is performed to assess how a constant rate fit within confidence intervals. Confidence intervals can be accessed by the method viewConfidenceIntervals, the p-values of the rates variability tests can be accessed by the method ratePvals and the regulatory class assigned to each gene by the method geneClass. In particular, each gene is assigned to a class named after the set of varying rates: “0” denotes a gene whose rates are constant over time, “a” denotes a gene whose synthesis changes over time, “b” denotes a gene whose degradation changes over time, “c” denotes a gene whose processing changes over time.

head(ratePvals(nascentInspObj10),3)
##              synthesis processing degradation
## 100503670 9.303012e-01  0.9993862   0.9912472
## 101206    2.156341e-10  0.9998447   0.9684174
## 101489    9.969447e-01  0.7952281   1.0000000
head(geneClass(nascentInspObj10),3)
## 100503670    101206    101489 
##       "0"       "a"       "0"

The metohd plotGene can be used to fully investigate the profiles of RNA concentrations and rates of a single gene. Estimated synthesis, degradation and processing rates, premature RNA and total RNA concentrations are displayed with solid thin lines, while their standard deviations or confidence intervals are in dashed lines and the modelled rates and concentrations are in thick solid lines. This example shows a gene of class “a”, indicating that its expression levels are controlled just by synthesis rate (Figure 4).

plotGene(nascentInspObj10, ix="101206")