# 1 Introduction

OncoSimulR is an individual- or clone-based forward-time genetic simulator for biallelic markers (wildtype vs. mutated) in asexually reproducing populations without spatial structure (perfect mixing). Its design emphasizes flexible specification of fitness and mutator effects.

OncoSimulR was originally developed to simulate tumor progression with emphasis on allowing users to set restrictions in the accumulation of mutations as specified, for example, by Oncogenetic Trees (OT: Desper et al. 1999; Szabo and Boucher 2008) or Conjunctive Bayesian Networks (CBN: Beerenwinkel, Eriksson, and Sturmfels 2007; Gerstung et al. 2009, 2011), with the possibility of adding passenger mutations to the simulations and allowing for several types of sampling.

Since then, OncoSimulR has been vastly extended to allow you to specify other types of restrictions in the accumulation of genes, such as the XOR models of Korsunsky et al. (2014) or the “semimonotone” model of Farahani and Lagergren (2013). Moreover, different fitness effects related to the order in which mutations appear can also be incorporated, involving arbitrary numbers of genes. This is very different from “restrictions in the order of accumulation of mutations”. With order effects, described in a recent cancer paper by Ortmann and collaborators (Ortmann et al. 2015), the effect of having both mutations “A” and “B” differs depending on whether “A” appeared before or after “B” (the actual case involves genes JAK2 and TET2).

More generally, OncoSimulR now also allows you to specify arbitrary epistatic interactions between arbitrary collections of genes and to model, for example, synthetic mortality or synthetic viability (again, involving an arbitrary number of genes, some of which might also depend on other genes, or show order effects with other genes). Moreover, it is possible to specify the above interactions in terms of modules, not genes. This idea is discussed in, for example, Raphael and Vandin (2015) and Gerstung et al. (2011): the restrictions encoded in, say, CBNs or OT can be considered to apply not to genes, but to modules, where each module is a set of genes (and the intersection between modules is the empty set) that performs a specific biological function. Modules, then, play the role of a “union operation” over the set of genes in a module. In addition, arbitrary numbers of genes without interactions (and with fitness effects coming from any distribution you might want) are also possible.

You can also directly specify the mapping between genotypes and fitness and, thus, you can simulate on fitness landscapes of arbitrary complexity.

It is now (released initially in this repo as the freq-dep-fitness branch on February 2019) also possible to simulate scenarios with frequency-dependent fitness, where the fitness of one or more genotypes depends on the relative or absolute frequencies of other genotypes, as in game theory and adaptive dynamics. This makes it possible to model predation and parasitism, cooperation and mutualism, and commensalism. It also allows to model therapeutic interventions (where fitness changes at specified time points or as a function of the total populations size).

Simulations can start from arbitrary initial population compositions and it is also possible to simulate multiple species. Thus, simulations that involve both ecological and evolutionary processes are possible.

Mutator/antimutator genes, genes that alter the mutation rate of other genes (Gerrish et al. 2007; Tomlinson, Novelli, and Bodmer 1996), can also be simulated with OncoSimulR and specified with most of the mechanisms above (you can have, for instance, interactions between mutator genes). And, regardless of the presence or not of other mutator/antimutator genes, different genes can have different mutation rates.

Simulations can be stopped as a function of total population size, number of mutated driver genes, or number of time periods. Simulations can also be stopped with a stochastic detection mechanism where the probability of detecting a tumor increases with total population size. Simulations return the number of cells of every genotype/clone at each of the sampling periods and we can take samples from the former with single-cell or whole- tumor resolution, adding noise if we want. If we ask for them, simulations also store and return the genealogical relationships of all clones generated during the simulation.

The models so far implemented are all continuous time models, which are simulated using the BNB algorithm of Mather, Hasty, and Tsimring (2012). The core of the code is implemented in C++, providing for fast execution. To help with simulation studies, code to simulate random graphs of the kind often seen in CBNs, OTs, etc, is also available. Finally, OncoSimulR also allows for the generation of random fitness landscapes and the representation of fitness landscapes and provides statistics of evolutionary predictability.

## 1.1 Key features of OncoSimulR

As mentioned above, OncoSimulR is now a very general package for forward genetic simulation, with applicability well beyond tumor progression. This is a summary of some of its key features:

• You can specify arbitrary interactions between genes, with arbitrary fitness effects, with explicit support for:
• Restrictions in the accumulations of mutations, as specified by Oncogenetic Trees (OTs), Conjunctive Bayesian Networks (CBNs), semimonotone progression networks, and XOR relationships.

• Epistatic interactions including, but not limited to, synthetic viability and synthetic lethality.
• Order effects.

• You can add passenger mutations.
• You can add mutator/antimutator effects.
• Fitness and mutation rates can be gene-specific.
• You can add arbitrary numbers of non-interacting genes with arbitrary fitness effects.

• you can allow for deviations from the OT, CBN, semimonotone, and XOR models, specifying a penalty for such deviations (the $$s_h$$ parameter).

• You can conduct multiple simulations, and sample from them with different temporal schemes and using both whole tumor or single cell sampling.

• You can stop the simulations using a flexible combination of conditions: final time, number of drivers, population size, fixation of certain genotypes, and a stochastic stopping mechanism that depends on population size.

• Right now, three different models are available, two that lead to exponential growth, one of them loosely based on Bozic et al. (2010), and another that leads to logistic-like growth, based on McFarland et al. (2013).

• You can use large numbers of genes (e.g., see an example of 50000 in section 6.5.3).

• Simulations are generally very fast: I use C++ to implement the BNB algorithm (see sections 15.5 and 15.6 for more detailed comments on the usage of this algorithm).

• You can obtain the true sequence of events and the phylogenetic relationships between clones (see section 15.1 for the details of what we mean by “clone”).

• You can generate random fitness landscapes (under the House of Cards, Rough Mount Fuji, or additive models, or combinations of the former and under the NK model) and use those landscapes as input to the simulation functions.

• You can plot fitness landscapes.

• You can obtain statistics of evolutionary predictability from the simulations.

• You can now also use simulations with frequency-dependent fitness: fitness (birth rate) is not fixed for a genotype, but can be a function of the frequecies of the clones (see section 10). We can therefore use OncoSimulR to examine, via simulations, results from game theory and adaptive dynamics and study complex scenarios that are not amenable to analytical solutions. More generally, we can model predation and parasitism, cooperation and mutualism, and commensalism.

• It is possible to start the simulation with arbitrary initial composition (section 6.7) and to simulate multiple species (section 6.8). You can thus run simulations that involve both ecological and evolutionary processes involving inter-species relationships plus genetic restrictions in evolution.

• It is possible to simulate many different therapeutic interventions. Section 12 shows examples of interventions where certain genotypes change fitness (because of chemotherapy) at specified times. More generally, since fitness (birth rates) can be made a function of total populations sizes and/or frequencies (see section 10), many different arbitrary intervention schemes can be simulated. Possible models are, of course, not limited to cancer chemotherapy, but could include antibiotic treatment of bacteria, antiviral therapy, etc.

The table below, modified from the table at the Genetics Simulation Resources (GSR) page, provides a summary of the key features of OncoSimulR. (An explanation of the meaning of terms specific to the GSR table is available from https://popmodels.cancercontrol.cancer.gov/gsr/search/ or from the Genetics Simulation Resources table itself, by moving the mouse over each term).

Table 1.1: Key features of OncoSimulR. Modified from the original table from https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed .
Attribute Category Attribute
Target
Type of Simulated Data Haploid DNA Sequence
Variations Biallelic Marker, Genotype or Sequencing Error
Simulation Method Forward-time
Type of Dynamical Model Continuous time
Entities Tracked Clones (see 15.2)
Input Program specific (R data frames and matrices specifying genotypes’ fitness, gene effects, and starting genotype)
Output
Data Type Genotype or Sequence, Individual Relationship (complete parent-child relationships between clones), Demographic (populations sizes of all clones at sampling times), Diversity Measures (LOD, POM, diversity of genotypes), Fitness
Sample Type Random or Independent, Longitudinal, Other (proportional to population size)
Evolutionary Features
Mating Scheme Asexual Reproduction
Demographic
Population Size Changes Exponential (two models), Logistic (McFarland et al., 2013)
Fitness Components
Birth Rate Individually Determined from Genotype (models “Exp”, “McFL”, “McFLD”). Frequency-Dependently Determined from Genotype (models “Exp”, “McFL”, “McFLD”)
Death Rate Individually Determined from Genotype (model “Bozic”), Influenced by Environment —population size (models “McFL” and “McFLD”)
Natural Selection
Determinant Single and Multi-locus, Fitness of Offspring, Environmental Factors (population size, genotype frequencies)
Models Directional Selection, Multi-locus models, Epistasis, Random Fitness Effects, Frequency-Dependent
Mutation Models Two-allele Mutation Model (wildtype, mutant), without back mutation
Events Allowed Varying Genetic Features: change of individual mutation rates (mutator/antimutator genes)
Spatial Structure No Spatial Structure (perfectly mixed and no migration)

Further details about the original motivation for wanting to simulate data this way in the context of tumor progression can be found in Diaz-Uriarte (2015), where additional comments about model parameters and caveats are discussed.

Are there similar programs? The Java program by Reiter et al. (2013), TTP, offers somewhat similar functionality to the previous version of OncoSimulR, but it is restricted to at most four drivers (whereas v.1 of OncoSimulR allowed for up to 64), you cannot use arbitrary CBNs or OTs (or XORs or semimonotone graphs) to specify restrictions, there is no allowance for passengers, and a single type of model (a discrete time Galton-Watson process) is implemented. The current functionality of OncoSimulR goes well beyond the the previous version (and, thus, also the TPT of Reiter et al. (2013)). We now allow you to specify all types of fitness effects in other general forward genetic simulators such as FFPopSim (Zanini and Neher 2012), and some that, to our knowledge (e.g., order effects) are not available from any genetics simulator. In addition, the “Lego system” to flexibly combine different fitness specifications is also unique; by “Lego system” I mean that we can combine different pieces and blocks, similarly to what we do with Lego bricks. (I find this an intuitive and very graphical analogy, which I have copied from Hothorn et al. (2006) and Hothorn et al. (2008)). In a nutshell, salient features of OncoSimulR compared to other simulators are the unparalleled flexibility to specify fitness and mutator effects, with modules and order effects as particularly unique, and the options for sampling and stopping the simulations, particularly convenient in cancer evolution models. Also unique in this type of software is the addition of functions for simulating fitness landscapes and assessing evolutionary predictability.

## 1.2 What kinds of questions is OncoSimulR suited for?

OncoSimulR can be used to address questions that span from the effect of mutator genes in cancer to the interplay between fitness landscapes and mutation rates. The main types of questions that OncoSimulR can help address involve combinations of:

• Simulating asexual evolution (the oncoSimul* functions) where:
• Fitness is:
• A function of specific epistatic effects between genes
• A function of order effects
• A function of epistatic effects specified using DAGs/posets where these DAGs/posets:
• Are user-specified
• Generated randomly (simOGraph)
• Any mapping between genotypes and fitness where this mapping is:
• User-specified
• Generated randomly from families of random fitness landscapes (rfitness)
• A function of the frequency of other genotypes (i.e., frequency-dependent fitness), such as in adaptive dynamics (see section 10 for more details). This also allows you to model competition, cooperation and mutualism, parasitism and predation, and commensalism between clones.
• Mutation rates can:
• Vary between genes
• Be affected by other genes
• Examining times to evolutionarily or biomedically relevant events (fixation of genotypes, reaching a minimal size, acquiring a minimal number of driver genes, etc —specified with the stopping conditions to the oncoSimul* functions).

• Using different sampling schemes (samplePop) that are related to:
• Assessing genotypes from single-cell vs. whole tumor (or whole population) with the typeSample argument
• Genotyping error (propError argument)
• Timing of samples (timeSample argument)
• … and assessing the consequences of those on the observed genotypes and their diversity (sampledGenotypes) and any other inferences that depend on the observational process.
• (OncoSimulR returns the abundances of all genotypes at each of the sampling points, so you are not restricted by what the samplePop function provides.)
• Tracking the genealogical relationships of clones (plotClonePhylog) and assessing evolutionary predictability (LOD, POM).

Some specific questions that you can address with the help of OncoSimulR are discussed in section 1.3.

A quick overview of the main functions and their relationships is shown in Figure 1.1, where we use italics for the type/class of R object and courier font for the name of the functions.