# 1 Introduction

LineagePulse is a differential expression algorithm for single-cell RNA-seq (scRNA-seq) data. LineagePulse is based on zero-inflated negative binomial noise model and can capture both discrete and continuous population structures: Discrete population structures are groups of cells (e.g. condition of an experiment or tSNE clusters). Continous population structures can for example be pseudotemporal orderings of cells or temporal orderings of cells. The main use and novelty of LineagePulse lies in its ability to fit gene expression trajectories on pseudotemporal orderings of cells well. Note that LineagePulse does not infer a pseudotemporal ordering but is a downstream analytic tool to analyse gene expression trajectories on a given pseudotemporal ordering (such as from diffusion pseudotime or monocle2).

To run LineagPulse on scRNA-seq data, the user needs to use a minimal input parameter set for the wrapper function runLineagePulse, which then performs all normalisation, model fitting and differential expression analysis steps without any more user interaction required:

• counts the count matrix (genes x cells) which MUST NOT be normalised in any way. A valid input option is expected counts from an aligner. Note that TPM or depth normalised expected counts are NOT count data. The statistical framework of LineagePulse rests on the assumption that matCounts contain count data. The count matrix can also be supplied as the path to a .mtx file (sparse matrix format file) or as a SummarizedExperiment or SingleCellExperiment object.
• dfAnnotation data frame that contains cell-wise annotation. The rownames of dfAnnotation must be equal to the column names of matCounts as both correspond to cells. Note that if counts is a SummarizedExperiment or SingleCellExperiment object, the annotation data frame is taken to be colData(counts) if dfAnnotation is NULL. dfAnnotation must contain
• a column named “continuous” if a continuous model is fit (e.g. strMuModel is impulse or splines and expression is fit as a function of time or pseudotime coordinates)
• a column named “groups” if a discrete population model is fit (e.g. strMuModel is groups and expression is fit as a function of group assignment, e.g. clusters or experimental conditions) that contains the assignemnt of cells to these groups as strings
• columns that describe the batch structure (if any).

• matPiConstPredictors a matrix of gene-specific predictors of the drop-out rate (genes x predictors). We suggest to use the log average expression (unless strDropModel is “logistic_ofMu”) and potentially parameters which may affect sequencing efficiency such as GC content of the gene.
• strMuModel the type of expression model to use as an alternative model for differential expression analysis: “impulse” for an impulse model and “splines” for a natural cubic spline model.
• vecConfoundersMu a vector of strings which corresond to column names in dfAnnotation which describe the batch structure to be corrected for.
• scaDFSplinesMu the degrees of freedom of the spline-based model for the mean parameter if strMuModel was set to “splines”.
• vecNormConstExternal cell-wise normalisation constants to be used (e.g. sequencing depth correction factors). the names of the elements have to be the column names of matCounts (cells).
• scaNProc to set the number of processes for parallelization.
• boolVerbose output basic progress reports while the wrapper functions runs.
• boolSuperVerbose output detailed progress reports for each step of the wrapper function.

Lastly, the experienced user who has a solid grasp of the mathematical and algorithmic basis of LineagePulse may change the defaults of these advanced input options:

• vecConfoundersDisp batch variables to be used to correct the dispersion (variance).
• strDispModelFull the dispersion model to be used for the full model.
• strDispModelRed the dispersion model to be used for the reduced model.
• strDropModel the drop-out model to be used.
• strDropFitGroup the groups of cells which receive one parameterisation of the drop-out model.
• scaDFSplinesDisp the degrees of freedom of the spline-based model for the dispersion parameter if strDispModel was set to “splines”.
• boolEstimateNoiseBasedOnH0 whether to estimate the drop-out model on the null or alternative expression model. Note that setting this to FALSE strongly increases the run time.
• scaMaxEstimationCycles maximum number of drop-out and expression model estimation iteration cycles.

# 2 Differential expression analysis

Here, we present a differential expression analysis scenario on a longitudinal ordering. The differential expression results are in a data frame which can be accessed from the output object via list like properties ($). The core differential expression analysis result are p-value and false-discovery-rate corrected p-value of differential expression which are the result of a gene-wise hypothesis test of a non-constant expression model (impulse, splines or groups) versus a constant expression model. library(LineagePulse) lsSimulatedData <- simulateContinuousDataSet( scaNCells = 100, scaNConst = 10, scaNLin = 10, scaNImp = 10, scaMumax = 100, scaSDMuAmplitude = 3, vecNormConstExternal=NULL, vecDispExternal=rep(20, 30), vecGeneWiseDropoutRates = rep(0.1, 30)) ## Draw mean trajectories ## Setting size factors uniformly =1 ## Draw dispersion ## Simulate negative binomial noise ## Simulate drop-out objLP <- runLineagePulse( counts = lsSimulatedData$counts,
dfAnnotation = lsSimulatedData$annot) ## LineagePulse for count data: v1.9.0 ## --- Data preprocessing ## # 0 out of 100 cells did not have a continuous covariate and were excluded. ## # 0 out of 30 genes did not contain non-zero observations and are excluded from analysis. ## # 0 out of 100 cells did not contain non-zero observations and are excluded from analysis. ## --- Compute normalisation constants: ## # All size factors are set to one. ## --- Fit ZINB model for both H1 and H0. ## ### a) Fit ZINB model A (H0: mu=constant disp=constant) with noise model. ## # . Initialisation: ll -25940.4612566106 ## # 1. Iteration with ll -13803.7930277264 in 0.03 min. ## # 2. Iteration with ll -13724.2947686562 in 0.04 min. ## # 3. Iteration with ll -13724.2943327694 in 0.02 min. ## Finished fitting zero-inflated negative binomial model A with noise model in 0.12 min. ## ### b) Fit ZINB model B (H1: mu=splines disp=constant). ## # . Initialisation: ll -15125.3869039773 ## # 1. Iteration with ll -13243.9091574384 in 0.03 min. ## Finished fitting zero-inflated negative binomial model B in 0.04 min. ## ### c) Fit NB model A (H0: mu=constant disp=constant). ## # . Initialisation: ll -14971.6379798301 ## # 1. Iteration with ll -14774.165800185 in 0.01 min. ## Finished fitting NB model B in 0.02 min. ## ### d) Fit NB model B (H1: mu=splines disp=constant). ## # . Initialisation: ll -15157.3043312836 ## # 1. Iteration with ll -14658.8595371101 in 0.02 min. ## Finished fitting NB model B in 0.03 min. ## Time elapsed during ZINB fitting: 0.24 min ## --- Run differential expression analysis. ## Finished runLineagePulse(). head(objLP$dfResults)
##          gene         p      padj  mean_H0      p_nb   padj_nb df_full_zinb
## gene_1 gene_1 0.1553172 0.2740892 59.08028 0.1553172 0.9570756            7
## gene_2 gene_2 0.3689709 0.5534564 75.37529 0.3689709 0.9570756            7
## gene_3 gene_3 0.4785398 0.6241824 27.03991 0.4785398 0.9964618            7
## gene_4 gene_4 0.2872319 0.4787198 16.06026 0.2872319 0.9964618            7
## gene_5 gene_5 0.9385945 1.0000000 79.82076 0.9385945 0.9964618            7
## gene_6 gene_6 0.6176504 0.7720630 63.54965 0.6176504 0.9964618            7
##        df_red_zinb df_full_nb df_red_nb loglik_full_zinb loglik_red_zinb
## gene_1           2          7         2        -425.3500       -429.3583
## gene_2           2          7         2        -459.4673       -462.1676
## gene_3           2          7         2        -388.1034       -390.3585
## gene_4           2          7         2        -355.4147       -358.5147
## gene_5           2          7         2        -476.8176       -477.4496
## gene_6           2          7         2        -426.6864       -428.4554
##        loglik_full_nb loglik_red_nb allZero
## gene_1      -486.8682     -489.1232   FALSE
## gene_2      -522.5857     -524.8436   FALSE
## gene_3      -421.9647     -422.9899   FALSE
## gene_4      -370.9114     -372.6123   FALSE
## gene_5      -533.8240     -534.0892   FALSE
## gene_6      -470.6703     -471.8207   FALSE

In addition to the raw p-values, one may be interested in further details of the expression models such as shape of the expression mean as a function of pseudotime, log fold changes (LFC) and global expression trends as function of pseudotime. We address each of these follow-up questions with separate sections in the following. Note that all of these follow-up questions are answered based on the model that were fit to compute the p-value of differential expression. Therefore, once runLineagePulse() was called once, no further model fitting is required.

# Further inspection of results ## Plot gene-wise trajectories

Multiple options are available for gene-wise expression trajectory plotting: Observations can be coloured by the posterior probability of drop-out (boolColourByDropout). Observations can be normalized based on the alternative expression model or taken as raw observerations for the scatter plot (boolH1NormCounts). Lineage contours can be added to aid visual interpretation of non-uniform population density in pseudotime related effects (boolLineageContour). Log counts can be displayed instead of counts if the fold changes are large (boolLogPlot). In any case, the output object of the gene-wise expression trajectors plotting function plotGene is a ggplot2 object which can then be printed or modified.

# plot the gene with the lowest p-value of differential expression
gplotExprProfile <- plotGene(
objLP = objLP, boolLogPlot = FALSE,
strGeneID = objLP$dfResults[which.min(objLP$dfResults$p),]$gene,
boolLineageContour = FALSE)
gplotExprProfile

The function plotGene also shows the H1 model fit under a negative binomial noise model (“H1(NB)”) as a reference to show what the model fit looks like if drop-out is not accounted for.

## 2.1 Manual analysis of expression trajectories

LineagePulse provides the user with parameter extraction functions that allow the user to interact directly with the raw model fits for analytic tasks or questions not addressed above.

# extract the mean parameter fits per cell of the gene with the lowest p-value.
matMeanParamFit <- getFitsMean(
lsMuModel = lsMuModelH1(objLP),
vecGeneIDs = objLP$dfResults[which.min(objLP$dfResults$p),]$gene)
cat("Minimum fitted mean parameter: ", round(min(matMeanParamFit),1) )
## Minimum fitted mean parameter:  59.7
cat("Mean fitted mean parameter: ", round(mean(matMeanParamFit),1) )
## Mean fitted mean parameter:  206.3

## 2.2 Fold changes

Given a discrete population structure, such as tSNE cluster or experimental conditions, a fold change is the ratio of the mean expression value of both groups. The definition of a fold change is less clear if a continous expression trajector is considered: Of interest may be for example the fold change from the first to the last cell on the expression trajectory or from the minimum to the maximum expression value. Note that in both cases, we compute fold changes on the model fit of the expression mean parameter which is corrected for noise and therefore more stable than the estimate based on the raw expression count observation.

# first, extract the model fits for a given gene again
vecMeanParamFit <- getFitsMean(
lsMuModel = lsMuModelH1(objLP),
vecGeneIDs = objLP$dfResults[which.min(objLP$dfResults$p),]$gene)
# compute log2-fold change from first to last cell on trajectory
idxFirstCell <- which.min(dfAnnotationProc(objLP)$pseudotime) idxLastCell <- which.max(dfAnnotationProc(objLP)$pseudotime)
cat("LFC first to last cell on trajectory: ",
round( (log(vecMeanParamFit[idxLastCell]) -
log(vecMeanParamFit[idxFirstCell])) / log(2) ,1) )
## LFC first to last cell on trajectory:
# compute log2-fold change from minimum to maximum value of expression trajectory
cat("LFC minimum to maximum expression value of model fit: ",
round( (log(max(vecMeanParamFit)) -
log(min(vecMeanParamFit))) / log(2),1) )
## LFC minimum to maximum expression value of model fit:  2.6

## 2.3 Global expression profiles

Global expression profiles or expression profiles across large groups of genes can be visualised via heatmaps of expression z-scores. One could extract the expression mean parameter fits as described above and create such heatmaps from scratch. LineaegePulse also offers a wrapper for creating such a heatmap:

# create heatmap with all differentially expressed genes
lsHeatmaps <- sortGeneTrajectories(
vecIDs = objLP$dfResults[which(objLP$dfResults$padj < 0.01),]$gene,
lsMuModel = lsMuModelH1(objLP),
dirHeatmap=NULL)
print(lsHeatmaps\$hmGeneSorted)

# 3 Session information

sessionInfo()
## R version 4.0.0 RC (2020-04-19 r78255)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] LineagePulse_1.9.0 BiocStyle_2.17.0
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6                lattice_0.20-41
##  [3] circlize_0.4.8              gtools_3.8.2
##  [5] png_0.1-7                   assertthat_0.2.1
##  [7] digest_0.6.25               SingleCellExperiment_1.11.0
##  [9] R6_2.4.1                    GenomeInfoDb_1.25.0
## [11] stats4_4.0.0                evaluate_0.14
## [13] ggplot2_3.3.0               pillar_1.4.3
## [15] gplots_3.0.3                GlobalOptions_0.1.1
## [17] zlibbioc_1.35.0             rlang_0.4.5
## [19] gdata_2.18.0                magick_2.3
## [21] S4Vectors_0.27.0            GetoptLong_0.1.8
## [23] Matrix_1.2-18               rmarkdown_2.1
## [25] labeling_0.3                splines_4.0.0
## [27] BiocParallel_1.23.0         stringr_1.4.0
## [29] RCurl_1.98-1.2              munsell_0.5.0
## [31] DelayedArray_0.15.0         compiler_4.0.0
## [33] xfun_0.13                   pkgconfig_2.0.3
## [35] BiocGenerics_0.35.0         shape_1.4.4
## [37] htmltools_0.4.0             tidyselect_1.0.0
## [39] SummarizedExperiment_1.19.0 tibble_3.0.1
## [41] GenomeInfoDbData_1.2.3      bookdown_0.18
## [43] IRanges_2.23.0              matrixStats_0.56.0
## [45] crayon_1.3.4                dplyr_0.8.5
## [47] bitops_1.0-6                grid_4.0.0
## [49] gtable_0.3.0                lifecycle_0.2.0
## [51] magrittr_1.5                scales_1.1.0
## [53] KernSmooth_2.23-17          stringi_1.4.6
## [55] farver_2.0.3                XVector_0.29.0
## [57] ellipsis_0.3.0              vctrs_0.2.4
## [59] rjson_0.2.20                RColorBrewer_1.1-2
## [61] tools_4.0.0                 Biobase_2.49.0
## [63] glue_1.4.0                  purrr_0.3.4
## [65] parallel_4.0.0              yaml_2.2.1
## [67] clue_0.3-57                 colorspace_1.4-1
## [69] cluster_2.1.0               BiocManager_1.30.10
## [71] caTools_1.18.0              GenomicRanges_1.41.0
## [73] ComplexHeatmap_2.5.0        knitr_1.28