Contents

1 Introduction

mfa is an R package for fitting a Bayesian mixture of factor analysers to infer developmental trajectories with bifurcations from single-cell gene expression data. It is able to jointly infer pseudotimes, branching, and genes differentially regulated across branches using a generative, Bayesian hierarchical model. Inference is performed using fast Gibbs sampling.

2 Installation

mfa can be installed in one of two ways:

2.1 From Bioconductor

source("https://bioconductor.org/biocLite.R")
biocLite("mfa")
library(mfa)

2.2 From Github

This requires the devtools package to be installed first

install.packages("devtools") # If not already installed
devtools::install_github("kieranrcampbell/mfa")
library(mfa)

3 An example on synthetic data

3.1 Generating synthetic data

We first create some synthetic data for 100 cells and 40 genes calling the mfa function create_synthetic. This returns a list with gene expression, pseudotime, branch allocation, and various parameter estimates:

synth <- create_synthetic(C = 100, G = 40)
print(str(synth))
## List of 7
##  $ X          : num [1:100, 1:40] 6.028 6.369 0 1.874 0.603 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:100] "cell1" "cell2" "cell3" "cell4" ...
##   .. ..$ : chr [1:40] "feature1" "feature2" "feature3" "feature4" ...
##  $ branch     : int [1:100] 0 1 0 0 0 0 0 1 0 1 ...
##  $ pst        : num [1:100] 0.177 0.196 0.935 0.961 0.408 ...
##  $ k          : num [1:40, 1:2] -9.12 6.95 7.42 6.4 -9.16 ...
##  $ phi        : num [1:40, 1:2] 8.19 9.63 7.05 6.13 8.05 ...
##  $ delta      : num [1:40, 1:2] 0.119 0.18 0.127 0.422 0.327 ...
##  $ p_transient: num 0
## NULL

We can then PCA and put into a tidy format:

df_synth <- as_data_frame(prcomp(synth$X)$x[,1:2]) %>% 
  mutate(pseudotime = synth$pst,
        branch = factor(synth$branch))

and have a look at a PCA representation, coloured by both pseudotime and branch allocation:

ggplot(df_synth, aes(x = PC1, y = PC2, color = pseudotime)) + geom_point()

ggplot(df_synth, aes(x = PC1, y = PC2, color = branch)) + geom_point()

3.2 Calling mfa

The input to mfa is either an ExpressionSet (e.g. from using the package Scater) or a cell-by-gene expression matrix. If an ExpressionSet is provided then the values in the exprs slot are used for gene expression.

We invoke mfa with a call to the mfa(...) function. Depending on the size of the dataset and number of MCMC iterations used, this may take some time:

m <- mfa(synth$X)
print(m)
## MFA fit with
##  100 cells and 40 genes
##  ( 2000 iterations )

Particular care must be paid to the initialisation of the pseudotimes: by default they are initialised to the first principal component, though if the researcher suspects (based on plotting marker genes) that the trajectory corresponds to a different PC, this can be set using the pc_initialise argument.

3.3 MCMC diagnostics

As in any MCMC analysis, basic care is needed to make sure the samples have converged to something resembling the stationary distribution (see e.g. Cowles and Carlin (1996) for a full discussion).

For a quick summary of these, mfa provides two functions: plot_mfa_trace and plot_mfa_autocorr for quick plotting of the trace and autocorrelation of the posterior log-likelihood:

plot_mfa_trace(m)

plot_mfa_autocorr(m)

3.4 Plotting results

We can extract posterior mean estimates along with credible intervals using the summary function:

ms <- summary(m)
print(head(ms))
## # A tibble: 6 x 5
##    pseudotime branch branch_certainty pseudotime_lower pseudotime_upper
##         <dbl> <fctr>            <dbl>            <dbl>            <dbl>
## 1 -0.54376885      2                1       -0.8606464       -0.3808977
## 2 -0.62685025      2                1       -0.8779826       -0.4015615
## 3  1.03879308      1                1        0.8334416        1.3256056
## 4  1.03366282      1                1        0.8403877        1.3320335
## 5 -0.08539131      1                1       -0.3965906        0.1127942
## 6  0.85058607      1                1        0.6117504        1.0773458

This has six entries:

  • pseudotime The MAP pseudotime estimate
  • branch The MAP branch estimate
  • branch_certainty The proportion of MCMC traces (after burn-in) for which the cell was assigned to the MAP branch
  • pseudotime_lower and pseudotime_upper: the lower and upper 95% highest-probability-density posterior credible intervals

We can compare the inferred pseudotimes to the true values:

qplot(synth$pst, ms$pseudotime, color = factor(synth$branch)) +
  xlab('True pseudotime') + ylab('Inferred pseudotime') +
  scale_color_discrete(name = 'True\nbranch')