In this example, we will use clinical data and three types of ’omic data for binary classification of breast tumours. We also use several strategies and definitions of similarity to create features.
For this we will use data from the The Cancer Genome Atlas, and will integrate four types of -omic data:
- gene expression from Agilent mRNA microarrays
- DNA methylation (Illumina HumanMethylation 27K microarrays))
- proteomic measures from reverse-phase protein arrays, and
- miRNA sequencing
Figure 1 shows the rules for converting patient data into similarity networks, which serve as units of input (or “features”) for the model.
- Gene expression: Features are defined at the level of pathways; i.e. a feature groups genes corresponding to the pathway. Similarity is defined as pairwise Pearson correlation
- Clinical variables: Each variable is its own feature and similarity is defined as normalized difference.
- Proteomic and methylation data: Features are defined at the level of the entire data layer; a single feature is created for all of proteomic data, and the same for methylation. Similarity is defined by pairwise Pearson correlation