1. Introduction

1.1 Motivation for Developing Oncomix

The advent of large, well-curated databases, such as the genomic data commons, that contain RNA sequencing data from hundreds of patient tumors has made it possible to identify oncogene candidates solely based off of patterns present in mRNA expression data. Oncomix is the first method developed to identify oncogenes in a visually-interpretable manner from RNA-sequencing data in large cohorts of patients.

Oncomix is an R package for identifying oncogene candidates based off of 2-component Gaussian mixture models. It estimates parameters using the expectation maximization procedure as implemented in the R package mclust. This tutorial will demonstrate how to identify oncogene candidates from a set of mRNA sequencing data. We start by loading the package:

#devtools::install_github("dpique/oncomix", build_vignettes=T)
library(oncomix)

1.2 Distribution of Oncogene mRNA Expression

We first explore the idea of what the distribution of gene expression values for a oncogene should look like. It is known that oncogenes such as ERBB2 are overexpressed in 15-20% of all breast cancer patients. In addition, oncogenes should not be expressed in normal tissue. Based on this line of reasoning, we formulate a model for the distribution of oncogene mRNA expression values in a population of both tumor (teal curves) and normal (red-orange curves) tissue:

library(ggplot2)
oncoMixIdeal()