June 15, 2017

Road map

  • use cases
  • user interface concepts
  • cluster analysis components
    • primitive sensitivity analysis
  • classifier components
    • role of metapackages like caret/mlr/MLInterfaces

Use case 1: transcript profiles to distinguish tissue source

  • illumina bodymap in GEO
  • another application: adequacy of mouse models of human biology

Species and organ of origin: microarrays and orthologues (McCall et al., NAR 2012)

Species, organ of origin, and batch: RNA-seq and orthologues (Lin et al., PNAS 2014)

  • Between-species disparity stronger than within-organ similarity

Conflict

  • Distinguishing organ of origin through gene expression patterns
    • McCall et al., NAR 2011
    • adjusted arrays yield 85 22215-vectors
    • barcode transformation: transcriptomes cluster by organ
  • Comparison of human and mouse transcriptomes
    • Lin et al., PNAS 2014
    • mRNA abundance for orthologous genes by RNA-seq, 30 15106-vectors
    • transcriptomes cluster by species

Use case 2: Oncotype DX gene signature for breast cancer survival

  • 21 genes useful for prediction of breast cancer recurrence
  • Paik, Shak, Tang et al. NEJM 2004
  • genefu package includes notation for the signature (sig.oncotypedx)
  • We'll consider the capacity of the gene set for predicting overall survival in a classic breast cancer dataset (van de Vijver 2002) as packaged in genefu

Setup for NKI breast cancer expression/clinical data

library(genefu); library(survival)
data(nkis)
map = as.character(annot.nkis$NCBI.gene.symbol)
names(map) = as.character(annot.nkis$probe)
ndata.nkis = data.nkis
colnames(ndata.nkis) = map[colnames(data.nkis)]
cbind(ndata.nkis[1:4,1:4], demo.nkis[1:4,5:8])
##           ESR1 TBC1D9  GATA3   CA12 grade node size age
## NKI_123  0.195 -0.114  0.202  0.158     3    0  2.0  48
## NKI_327  0.034  0.033  0.158  0.103     2    1  2.0  49
## NKI_291 -0.417  0.140  0.006 -0.266     2    1  1.2  39
## NKI_370  0.429  0.352 -0.050  0.236     1    1  1.8  51

Label expression columns with appropriate symbols; test

nkSurv = Surv(demo.nkis$t.os, demo.nkis$e.os)
odata = ndata.nkis[, intersect(as.character(sig.oncotypedx$symbol), 
    colnames(ndata.nkis))]
fullnk = cbind(demo.nkis, odata)
coxph(nkSurv~er+age, data=fullnk)
## Call:
## coxph(formula = nkSurv ~ er + age, data = fullnk)
## 
##        coef exp(coef) se(coef)     z      p
## er  -1.0018    0.3672   0.3425 -2.92 0.0034
## age -0.0328    0.9677   0.0271 -1.21 0.2268
## 
## Likelihood ratio test=10.1  on 2 df, p=0.00657
## n= 129, number of events= 36 
##    (21 observations deleted due to missingness)

Create a survival tree using all available clinical and expression data

rfullnk = fullnk[,-c(1,2,3,9,10,11,12,13,14,17,18,19)]
library(rpart); r1 = rpart(nkSurv~.,data=rfullnk)
r1
## n=129 (21 observations deleted due to missingness)
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 129 146.652400 1.00000000  
##    2) BIRC5< -0.0365 85  62.712830 0.47436610  
##      4) BIRC5< -0.3975 32   1.801804 0.09909801 *
##      5) BIRC5>=-0.3975 53  52.568420 0.70984040  
##       10) BAG1< -0.219 14   1.660224 0.16988820 *
##       11) BAG1>=-0.219 39  44.603630 0.96814410  
##         22) GSTM1< 0.1565 30  22.464060 0.58792190  
##           44) MKI67>=-0.0655 19   8.070774 0.23294560 *
##           45) MKI67< -0.0655 11   7.582306 1.38868000 *
##         23) GSTM1>=0.1565 9  12.691410 2.77622500 *
##    3) BIRC5>=-0.0365 44  58.962600 2.35960200  
##      6) PGR>=-0.1625 17  16.872130 1.05016300 *
##      7) PGR< -0.1625 27  34.118410 3.40043200  
##       14) GSTM1< -0.1235 7   5.180967 1.32643500 *
##       15) GSTM1>=-0.1235 20  23.712420 4.39730500 *

CRAN package partykit enhances tree support in rpart and provides many additional models

library(partykit)
## 
## Attaching package: 'partykit'
## The following object is masked from 'package:IRanges':
## 
##     width
## The following object is masked from 'package:S4Vectors':
## 
##     width
## The following object is masked from 'package:BiocGenerics':
## 
##     width
p1p = as.party(prune(r1, cp=.05))

Visualize the pruned tree along with K-M curves for leaves

Use case 3: Cell fate signatures from the fruitfly blastocyst

Data setup

library(drosmap) # biocLite("vjcitn/drosmap")
data(expressionPatterns)
data(template); template=template[,-1]
data(uniqueGenes)
uex = expressionPatterns[,uniqueGenes]
uex[1:5,1:5]
##           pnr      Abd.B        lama      Mkp3       fz2
## 1 0.014123479 0.05531271 0.014584370 0.2086337 0.3759253
## 2 0.009015973 0.01234864 0.014212999 0.3222693 0.5585198
## 3 0.023047258 0.01486692 0.013431432 0.3599486 0.5329454
## 4 0.013179102 0.03184486 0.005370888 0.2365888 0.2585371
## 5 0.008820991 0.06811459 0.016528382 0.1136623 0.1034636

Spatial gene-specific patterns

imageBatchDisplay(uex[,1:16], nrow=4, ncol=4, template=template)

Can we transform spatial patterns for 701 genes to cohere with this fate map?

An assignment of "principal patterns"

Comments

  • Curse of dimensionality: as the number of features increases, utility of distance metrics for object grouping diminishes (space is mostly empty, distances generally small)
  • Bet on sparsity principle: favor procedures that are able to prune features/dimensions, because in non-sparse case, nothing works
  • All the results displayed are tunable, could be interactive
  • Sensitivity analysis: Enhance the capacity of reports to demonstrate their own robustness

Remainder of talk

  • Bioconductor strategies: user interface and object designs
  • Cluster analysis formalities; hclustWidget
  • Classifier formalities; mlearnWidget

On the user interface

  • The method is primary (constituents of CRAN task view "MachineLearning")
  • What does the learner consume?
    • data in a specific format, tuning parameters
  • What does the learner emit?
    • an object with scores, assignments, metadata about the run
  • Aims
    • reduce complexity of user tasks
    • capitalize on formal structuring of containers for inputs and outputs
    • foster sensitivity analysis
  • We'll now use a modified MLInterfaces::hclustWidget that capitalizes on these notions

Exploring clusters with tissue-of-origin data

## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Shiny applications not supported in static R Markdown documents

Some definitions: general distance

Examples:

Euclidean distance

  • High-school analytic geometry: distance between two points in \(R^3\)
  • \(p_1 = (x_1, y_1, z_1)\), \(p_2 = (x_2, y_2, z_2)\)
  • \(\Delta x = x_1 - x_2\), etc.
  • \(d(p_1, p_2) = \sqrt{(\Delta x)^2 + (\Delta y)^2 + (\Delta z)^2}\)

Manhattan distance

  • \(d(p_1, p_2) = |\Delta x| + |\Delta y| + |\Delta z|\)

New concept of distance for categorical vectors:

Sam Buttrey and Lyn Whitaker's treeClust (R Journal article)

What is the ward.D2 agglomeration method?

  • Enables very rapid update upon change of distance or # genes

What is the Jaccard similarity coefficient?

Summary

  • Hierarchical clustering is tunable; distance, fusion method, feature selection all have impact
  • There are other principles/algorithms: divisive, semi-supervised, model-based
  • Other figures of merit: consensus, gap statistic
  • See the mlr for structured interface

On classification methods with genomic data

  • Vast topic
  • Key resources in R:
  • In Bioconductor, consider
    • The 'StatisticalMethod' task view (next slide)
    • MLInterfaces (a kind of metapackage)

BiocViews: StatisticalMethod

Conceptual basis for methods covered in the talk

  • "Two cultures" of statistical analysis (Leo Breiman)
    • model-based
    • algorithmic
  • Ideally you will understand and use both
    • \(X \sim N_p(\mu, \Sigma)\), seek and use structure in \(\mu\), \(\Sigma\) as estimated from data; pursue weakening of model assumptions
    • \(y \approx f(x)\) with response \(y\) and features \(x\), apply agnostic algorithms to the data to choose \(f\) and assess the quality of the prediction/classification

Linear discriminant analysis

  • Use a linear combination of features to define a score for each object
  • The value of the score determines the class assignment
  • This assumes that the features are quantitative and are measured consistently for all objects
  • for \(p\)-dimensional feature vector \(x\) with prior probability \(\pi_k\), mean \(\mu_k\) for class \(k\), and common covariance matrix for all classes \[ \delta_k(x) = x^t\Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^t \Sigma^{-1} \mu_k + \log \pi_k \] is the discriminant function; \(x\) is assigned to the class for which \(\delta_k(x)\) is largest

Other approaches, issues

  • Direct "learning" of statistical parameters in regression or neural network models
  • Recursive partitioning of classes, repeating searches through all features for optimal discrimination
  • Ensemble methods in which votes are assembled among different learners or over perturbations of the data
  • Unifying loss-function framework: see Elements of statistical learning by Hastie, Tibshirani and Friedman
  • Figures of merit: misclassification rate (cross-validated), AUROC

A demonstration with tissue-of-origin expression data follows

Shiny applications not supported in static R Markdown documents

Remarks

  • all examples here employ mature, reduced data
  • statistical learning also important at early stages, but data volume leads to challenges
  • interactive modeling/learning as the product
  • in opposition to a potentially overoptimistic selection
  • new work on post-selection inference in selectiveInference