1 Introduction

OncoSimulR is an individual- or clone-based forward-time genetic simulator for biallelic markers (wildtype vs. mutated) in asexually reproducing populations without spatial structure (perfect mixing). Its design emphasizes flexible specification of fitness and mutator effects.

OncoSimulR was originally developed to simulate tumor progression with emphasis on allowing users to set restrictions in the accumulation of mutations as specified, for example, by Oncogenetic Trees (OT: Desper et al. 1999; Szabo and Boucher 2008) or Conjunctive Bayesian Networks (CBN: Beerenwinkel, Eriksson, and Sturmfels 2007; Gerstung et al. 2009; Gerstung et al. 2011), with the possibility of adding passenger mutations to the simulations and allowing for several types of sampling.

Since then, OncoSimulR has been vastly extended to allow you to specify other types of restrictions in the accumulation of genes, such as the XOR models of Korsunsky et al. (2014) or the “semimonotone” model of Farahani and Lagergren (2013). Moreover, different fitness effects related to the order in which mutations appear can also be incorporated, involving arbitrary numbers of genes. This is very different from “restrictions in the order of accumulation of mutations”. With order effects, described in a recent cancer paper by Ortmann and collaborators (Ortmann et al. 2015), the effect of having both mutations “A” and “B” differs depending on whether “A” appeared before or after “B” (the actual case involves genes JAK2 and TET2).

More generally, OncoSimulR now also allows you to specify arbitrary epistatic interactions between arbitrary collections of genes and to model, for example, synthetic mortality or synthetic viability (again, involving an arbitrary number of genes, some of which might also depend on other genes, or show order effects with other genes). Moreover, it is possible to specify the above interactions in terms of modules, not genes. This idea is discussed in, for example, Raphael and Vandin (2015) and Gerstung et al. (2011): the restrictions encoded in, say, CBNs or OT can be considered to apply not to genes, but to modules, where each module is a set of genes (and the intersection between modules is the empty set) that performs a specific biological function. Modules, then, play the role of a “union operation” over the set of genes in a module. In addition, arbitrary numbers of genes without interactions (and with fitness effects coming from any distribution you might want) are also possible.

Mutator/antimutator genes, genes that alter the mutation rate of other genes (Gerrish et al. 2007; Tomlinson, Novelli, and Bodmer 1996), can also be simulated with OncoSimulR and specified with most of the mechanisms above (you can have, for instance, interactions between mutator genes). And, regardless of the presence or not of other mutator/antimutator genes, different genes can have different mutation rates.

Simulations can be stopped as a function of total population size, number of mutated driver genes, or number of time periods. Simulations can also be stopped with a stochastic detection mechanism where the probability of detecting a tumor increases with total population size. Simulations return the number of cells of every genotype/clone at each of the sampling periods and we can take samples from the former with single-cell or whole- tumor resolution, adding noise if we want. If we ask for them, simulations also store and return the genealogical relationships of all clones generated during the simulation.

The models so far implemented are all continuous time models, which are simulated using the BNB algorithm of Mather, Hasty, and Tsimring (2012). The core of the code is implemented in C++, providing for fast execution. To help with simulation studies, code to simulate random graphs of the kind often seen in CBNs, OTs, etc, is also available. Finally, OncoSimulR also allows for the generation of random fitness landscapes and the representation of fitness landscapes and provides statistics of evolutionary predictability.

1.1 Key features of OncoSimulR

As mentioned above, OncoSimulR is now a very general package for forward genetic simulation, with applicability well beyond tumor progression. This is a summary of some of its key features:

  • You can specify arbitrary interactions between genes, with arbitrary fitness effects, with explicit support for:
    • Restrictions in the accumulations of mutations, as specified by Oncogenetic Trees (OTs), Conjunctive Bayesian Networks (CBNs), semimonotone progression networks, and XOR relationships.

    • Epistatic interactions including, but not limited to, synthetic viability and synthetic lethality.
    • Order effects.

  • You can add passenger mutations.
  • You can add mutator/antimutator effects.
  • Fitness and mutation rates can be gene-specific.
  • You can add arbitrary numbers of non-interacting genes with arbitrary fitness effects.

  • you can allow for deviations from the OT, CBN, semimonotone, and XOR models, specifying a penalty for such deviations (the \(s_h\) parameter).

  • You can conduct multiple simulations, and sample from them with different temporal schemes and using both whole tumor or single cell sampling.

  • You can stop the simulations using a flexible combination of conditions: final time, number of drivers, population size, fixation of certain genotypes, and a stochastic stopping mechanism that depends on population size.

  • Right now, three different models are available, two that lead to exponential growth, one of them loosely based on Bozic et al. (2010), and another that leads to logistic-like growth, based on McFarland et al. (2013).

  • You can use large numbers of genes (e.g., see an example of 50000 in section 6.5.3 ).

  • Simulations are generally very fast: I use C++ to implement the BNB algorithm (see sections 12.5 and 12.6 for more detailed comments on the usage of this algorithm).

  • You can obtain the true sequence of events and the phylogenetic relationships between clones (see section 12.1 for the details of what we mean by “clone”).

  • You can generate random fitness landscapes (under the House of Cards, Rough Mount Fuji, or additive models, or combinations of the former) and use those landscapes as input to the simulation functions.

  • You can plot fitness landscapes.

  • You can obtain statistics of evolutionary predictability from the simulations.

The table below, modified from the table at the Genetics Simulation Resources (GSR) page, provides a summary of the key features of OncoSimulR. (An explanation of the meaning of terms specific to the GSR table is available from https://popmodels.cancercontrol.cancer.gov/gsr/search/ or from the Genetics Simulation Resources table itself, by moving the mouse over each term).

Table 1.1: Key features of OncoSimulR. Modified from the original table from https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed .
Attribute Category Attribute
Target
  Type of Simulated Data Haploid DNA Sequence
  Variations Biallelic Marker, Genotype or Sequencing Error
Simulation Method Forward-time
  Type of Dynamical Model Continuous time
  Entities Tracked Clones (see 12.2)
Input Program specific (R data frames and matrices specifying genotypes’ fitness, gene effects, and starting genotype)
Output
  Data Type Genotype or Sequence, Individual Relationship (complete parent-child relationships between clones), Demographic (populations sizes of all clones at sampling times), Diversity Measures (LOD, POM, diversity of genotypes), Fitness
  Sample Type Random or Independent, Longitudinal, Other (proportional to population size)
Evolutionary Features
  Mating Scheme Asexual Reproduction
  Demographic
    Population Size Changes Exponential (two models), Logistic (McFarland et al., 2013)
  Fitness Components
    Birth Rate Individually Determined from Genotype (models “Exp” and “McFL”)
    Death Rate Individually Determined from Genotype (model “Bozic”), Influenced by Environment —population size (model “McFL”)
 Natural Selection
    Determinant Single and Multi-locus, Fitness of Offspring, Environmental Factors (population size)
    Models Directional Selection, Multi-locus models, Epistasis, Random Fitness Effects
  Mutation Models Two-allele Mutation Model (wildtype, mutant), without back mutation
  Events Allowed Varying Genetic Features: change of individual mutation rates (mutator/antimutator genes)
  Spatial Structure No Spatial Structure (perfectly mixed and no migration)

Further details about the original motivation for wanting to simulate data this way in the context of tumor progression can be found in Diaz-Uriarte (2015), where additional comments about model parameters and caveats are discussed.

Are there similar programs? The Java program by Reiter et al. (2013), TTP, offers somewhat similar functionality to the previous version of OncoSimulR, but it is restricted to at most four drivers (whereas v.1 of OncoSimulR allowed for up to 64), you cannot use arbitrary CBNs or OTs (or XORs or semimonotone graphs) to specify restrictions, there is no allowance for passengers, and a single type of model (a discrete time Galton-Watson process) is implemented. The current functionality of OncoSimulR goes well beyond the the previous version (and, thus, also the TPT of Reiter et al. (2013)).