Introduction to Machine Learning

Lets say that we are interested in predicting the run time of an athlete depending on his shoe size, height and weight in a study of 100 people. We can do so using a simple linear regression model where

y = beta0 + beta1 * height + beta2 * weight + beta3 * shoe_size 

Here y is the response variable (run time), n is the number of observations (100 people), p is the number of variables/ features/ predictors (3 IE height, weight, shoe size), X is a nxp matrix

This data set is a low dimensional data where n >> p but most of the biological data sets coming out of modern biological techniques are high dimensional IE n << p This poses statistical challenge and simple linear regression can no longer help us.

For example,

In all of the 3 examples, listed above n, number of observations, is 30-40 patients whereas p, number of features, is approximately 30,000 genes. Try writing a linear regression formula for the outcome variable, y, in any of the above three scenarios..

Listed below are things that can go wrong with high dimensional data - some of these predictors are useful, some are not - if we include too many predictors, we can over fit the data

This is why we need Machine Learning. Lets first introduce some basic concepts and then dive into examples and a lab session.

Supervised Learning - Use a data set X to predict the association with a response variable Y. The response variable can be continuous or categorical. For example: Predicting the chances of breast cancer survival in a patient.

Unsupervised Learning - Discover the associations or patterns in X. No response variable is present. For example: Cluster similar genes into groups.

Training & Test Datasets - Usually we split observation into test and training data sets. We fit the model on the training data set and evaluate on the test data set. The test set error rate is an estimate of the models performance on future data sets.

Model Selection - We usually consider numerous models for a given problem. For example, we are trying to identify the genes responsible for a given disease using gene expression data set- we could have the following models a) model 1 - Use all 30000 genes from the array to build a model b) model 2 - we include only genes related to the pathway that we know is upregulated in that disease to build a model c) model 3 - include genes found in literature which are known to influence this disease It is highly recommended to use the test set only on our final model to see how our model will do with new, unseen data. So how do we pick the best model which can be tested on the test data set?

Cross-validation We can use different approaches to find the best model. Lets look at the commonly used approaches, namely, validation set, leave one out cross-validation, k-fold cross validation.

Briefly, the validation set approach deals with diving the full data sets into 3 groups - training set, validation set and the test set. We train the models on the training set, evaluate their performance on the validation set and then the best model is chosen to fit on the test set.

The leave one out cross validation starts with fitting n models (where n is number of observations in the training data set), each on n-1 observations, evaluating each model on the left-out observation. The best model is the one for which the total test error is the smallest and that is then used to predict the test set.

Lastly the 5 fold cross validation (here k=5), is splitting the training data set into 5 sets and repeatedly training the model on the other 4 sets and evaluating the performance on the fifth.

Bias, Variance, Overfitting - Bias refers to the average difference between the actual betas and the predicted betas, Variance refers to the amount by which the betas differ across experiments. As the model complexity(no of variables) increases, the bias decreases and the variance increases. This is know as the Bias-Variance Tradeoff and a model that has too much of variance, is said to be over fit.

Datasets

For Unsupervised learning, we will use RNA-Seq count data from the Biocoductor package, airway. From the abstract, a brief description of the RNA-Seq experiment on airway smooth muscle (ASM) cell lines: “Using RNA-Seq, a high-throughput sequencing method, we characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone - a potent synthetic glucocorticoid (1 micromolar for 18 hours).”

library(airway)
data("airway")
se <- airway
colData(se)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677
library("DESeq2")
dds <- DESeqDataSet(se, design = ~ cell + dex)  

For Supervised learning, we will use cervical count data from the Biocoductor package, MLSeq. This data set contains expressions of 714 miRNA’s of human samples. There are 29 tumor and 29 non-tumor cervical samples. For learning purposes, we can treat these as two separate groups and run various classification algorithms.

library(MLSeq)
filepath = system.file("extdata/cervical.txt", package = "MLSeq")
cervical = read.table(filepath, header = TRUE)

Unsupervised Learning

Unsupervised Learning is a set of statistical tools intended for the setting in which we have only a set of ‘p’ features measured on ‘n’ observations. We are primarily interested in discovering interesting things about the ‘p’ features.

Unsupervised Learning is often performed as a part of Exploratory Data Analysis. These tools help us to get a good idea about the data set. Unlike a supervised learning problem, where we can use prediction to gain some confidence about our learning algorithm, there is no way to check our model. The learning algorithm is thus, aptly named “unsupervised”.

RLOG TRANSFORMATION

Many common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.g., principal-component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observed quantity (here, the expression strength of a gene) does not depend on the mean.

In RNA-Seq data, the variance grows with the mean.If one performs PCA (principal components analysis) directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples.

As a solution, DESeq2 offers the regularized-logarithm transformation, or rlog for short. See the help for ?rlog for more information and options.

The function rlog returns a SummarizedExperiment object which contains the rlog-transformed values in its assay slot:

rld <- rlog(dds)   
head(assay(rld))    
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003   9.399151   9.142478   9.501695   9.320796   9.757212
## ENSG00000000005   0.000000   0.000000   0.000000   0.000000   0.000000
## ENSG00000000419   8.901283   9.113976   9.032567   9.063925   8.981930
## ENSG00000000457   7.949897   7.882371   7.834273   7.916459   7.773819
## ENSG00000000460   5.849521   5.882363   5.486937   5.770334   5.940407
## ENSG00000000938  -1.638084  -1.637483  -1.558248  -1.636072  -1.597606
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003   9.512183   9.617378   9.315309
## ENSG00000000005   0.000000   0.000000   0.000000
## ENSG00000000419   9.108531   8.894830   9.052303
## ENSG00000000457   7.886645   7.946411   7.908338
## ENSG00000000460   5.663847   6.107733   5.907824
## ENSG00000000938  -1.639362  -1.637608  -1.637724

To assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design? We use the R function dist to calculate the Euclidean distance between samples. To avoid that the distance measure is dominated by a few highly variable genes, and have a roughly equal contribution from all genes, we use it on the rlog-transformed data

sampleDists <- dist( t( assay(rld) ) )
sampleDists
##            SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## SRR1039509   40.89060                                            
## SRR1039512   37.35231   50.07638                                 
## SRR1039513   55.74569   41.49280   43.61052                      
## SRR1039516   41.98797   53.58929   40.99513   57.10447           
## SRR1039517   57.69438   47.59326   53.52310   46.13742   42.10583
## SRR1039520   37.06633   51.80994   34.86653   52.54968   43.21786
## SRR1039521   56.04254   41.46514   51.90045   34.82975   58.40428
##            SRR1039517 SRR1039520
## SRR1039509                      
## SRR1039512                      
## SRR1039513                      
## SRR1039516                      
## SRR1039517                      
## SRR1039520   57.13688           
## SRR1039521   47.90244   44.78171

Note the use of the function t to transpose the data matrix. We need this because dist calculates distances between data rows and our samples constitute the columns.

HEATMAP

We visualize the sample-to-sample distances in a heatmap, using the function heatmap.2 from the gplots package. Note that we have changed the row names of the distance matrix to contain treatment type and patient number instead of sample ID, so that we have all this information in view when looking at the heatmap.

library("gplots")
library("RColorBrewer")

sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
hc <- hclust(sampleDists)
heatmap.2( sampleDistMatrix, Rowv=as.dendrogram(hc),
          symm=TRUE, trace="none", col=colors,
          margins=c(2,10), labCol=FALSE )

PCA

Another way to visualize sample-to-sample distances is a principal-components analysis (PCA). In this ordination method, the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out in the two directions which explain most of the differences in the data. The x-axis is the direction (or principal component) which separates the data points the most. The amount of the total variance which is contained in the direction is printed in the axis label. Here, we have used the function plotPCA which comes with DESeq2. The two terms specified by intgroup are the interesting groups for labelling the samples; they tell the function to use them to choose colors.

plotPCA(rld, intgroup = c("dex", "cell"))

From both visualizations, we see that the differences between cells are considerable, though not stronger than the differences due to treatment with dexamethasone. This shows why it will be important to account for this in differential testing by using a paired design (“paired”, because each dex treated sample is paired with one untreated sample from the same cell line). We are already set up for this by using the design formula ~ cell + dex when setting up the data object in the beginning.

MDS Another plot, very similar to the PCA plot, can be made using the multidimensional scaling (MDS) function in base R. This is useful when we don’t have the original data, but only a matrix of distances. Here we have the MDS plot for the distances calculated from the rlog transformed counts:

library(ggplot2)
mds <- data.frame(cmdscale(sampleDistMatrix))
mds <- cbind(mds, colData(rld))
qplot(X1,X2,color=dex,shape=cell,data=as.data.frame(mds))

Exercise:

Use the plotMDS function from the limma package to make a simila plot. What is the advtange of using this function over base R’s cmdscale?

Solutions:

A similar plot can be made using the plotMDS() function in limma where the input is a matrix of log-fold expression values. Here the advantage is that the distances on plot are proportional to log2-fold change and not only is the plot created, but the object (with distance matrix) is also returned.

suppressPackageStartupMessages({
   library(limma)
   library(DESeq2)
   library(airway)
})
plotMDS(assay(rld), col=as.integer(dds$dex), pch=as.integer(dds$cell))

Supervised Learning

In supervised learning, along with the ‘p’ features, we also have the a response Y measured on the same n observations. The goal is then to predict Y using X (n x p matrix) for new observations.

For the cervical data, we know that the first 29 are non-Tumor samples whereas the last 29 are Tumor samples. We will code these as 0 and 1 respectively. We will randomly sample 30% of our data and use that as a test set. The remaining 70% of the data will be used as training data

set.seed(9)

class = data.frame(condition = factor(rep(c(0, 1), c(29, 29))))

nTest = ceiling(ncol(cervical) * 0.2)
ind = sample(ncol(cervical), nTest, FALSE)

cervical.train = cervical[, -ind]
cervical.train = as.matrix(cervical.train + 1)
classtr = data.frame(condition = class[-ind, ])

cervical.test = cervical[, ind]
cervical.test = as.matrix(cervical.test + 1)
classts = data.frame(condition = class[ind, ])

MLSeq aims to make computation less complicated for a user and allows one to learn a model using various classifier’s with one single function.

The main function of this package is classify which requires data in the form of a DESeqDataSet instance. The DESeqDataSet is a subclass of SummarizedExperiment, used to store the input values, intermediate calculations and results of an analysis of differential expression.

So lets create DESeqDataSet object for both the training and test set, and run DESeq on it.

cervical.trainS4 = DESeqDataSetFromMatrix(countData = cervical.train, 
        colData = classtr, formula(~condition))
## converting counts to integer mode
cervical.trainS4 = DESeq(cervical.trainS4, fitType = "local")
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 72 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
cervical.testS4 = DESeqDataSetFromMatrix(countData = cervical.test, colData = classts,
formula(~condition))
## converting counts to integer mode
cervical.testS4 = DESeq(cervical.testS4, fitType = "local")
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 55 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing

Classify using Support Vector Machines.

svm = classify(data = cervical.trainS4, method = "svm", normalize = "deseq",
deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:Biostrings':
## 
##     type
svm
## 
##   An object of class  MLSeq 
## 
##             Method  :  svm 
## 
##        Accuracy(%)  :  97.83 
##     Sensitivity(%)  :  100 
##     Specificity(%)  :  95 
## 
##   Reference Class   :  1

It returns an object of class ‘MLseq’ and we observe that it successfully fitted a model with 97.8% accuracy. We can access the slots of this S4 object by

getSlots("MLSeq")
##            method    deseqTransform     normalization      confusionMat 
##       "character"       "character"       "character" "confusionMatrix" 
##           trained               ref 
##           "train"       "character"

And also, ask about the model trained.

trained(svm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
##  46 samples
## 714 predictors
##   2 classes: '1', '0' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## 
## Summary of sample sizes: 37, 37, 36, 37, 37, 37, ... 
## 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa      Accuracy SD  Kappa SD 
##   0.25  0.5644444  0.0000000  0.01840175   0.0000000
##   0.50  0.7955556  0.5553664  0.10836284   0.2471095
##   1.00  0.8770370  0.7329642  0.11396855   0.2544595
## 
## Tuning parameter 'sigma' was held constant at a value of 0.001113943
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.001113943 and C = 1.

We can predict the class labels of our test data using “predict”

pred.svm = predictClassify(svm, cervical.testS4)
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
table(pred.svm, relevel(cervical.testS4$condition, 2))
##         
## pred.svm 1 0
##        1 3 1
##        0 0 8

The other classification methods available are ‘randomforest’, ‘cart’ and ‘bagsvm’.

Exercise:

Train the same training data and test data using randomForest.

Solutions:

rf = classify(data = cervical.trainS4, method = "randomforest", 
        normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
trained(rf)
## Random Forest 
## 
##  46 samples
## 714 predictors
##   2 classes: '1', '0' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## 
## Summary of sample sizes: 37, 37, 36, 37, 37, 37, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD 
##     2   0.8044444  0.5877801  0.11061795   0.2326000
##    37   0.8614815  0.7105296  0.11399949   0.2379795
##   713   0.8474074  0.6871242  0.09186167   0.1882186
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 37.
pred.rf = predictClassify(rf, cervical.testS4)
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
table(pred.rf, relevel(cervical.testS4$condition, 2))
##        
## pred.rf 1 0
##       1 3 1
##       0 0 8

SessionInfo

sessionInfo()
## R version 3.2.0 alpha (2015-03-25 r68090)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] kernlab_0.9-20                         
##  [2] RColorBrewer_1.1-2                     
##  [3] gplots_2.16.0                          
##  [4] MLSeq_1.3.0                            
##  [5] edgeR_3.9.15                           
##  [6] randomForest_4.6-10                    
##  [7] limma_3.23.12                          
##  [8] caret_6.0-41                           
##  [9] ggplot2_1.0.1                          
## [10] lattice_0.20-31                        
## [11] DESeq2_1.7.50                          
## [12] RcppArmadillo_0.4.650.1.1              
## [13] Rcpp_0.11.5                            
## [14] airway_0.101.1                         
## [15] genefilter_1.49.2                      
## [16] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
## [17] GenomicFeatures_1.19.36                
## [18] AnnotationDbi_1.29.21                  
## [19] Biobase_2.27.3                         
## [20] BSgenome.Hsapiens.UCSC.hg19_1.4.0      
## [21] BSgenome_1.35.20                       
## [22] rtracklayer_1.27.11                    
## [23] GenomicAlignments_1.3.33               
## [24] Rsamtools_1.19.49                      
## [25] Biostrings_2.35.12                     
## [26] XVector_0.7.4                          
## [27] GenomicRanges_1.19.52                  
## [28] GenomeInfoDb_1.3.16                    
## [29] IRanges_2.1.43                         
## [30] S4Vectors_0.5.22                       
## [31] BiocGenerics_0.13.11                   
## [32] BiocStyle_1.5.3                        
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-120         bitops_1.0-6         pbkrtest_0.4-2      
##  [4] tools_3.2.0          rpart_4.1-9          KernSmooth_2.23-14  
##  [7] Hmisc_3.15-0         DBI_0.3.1            mgcv_1.8-6          
## [10] colorspace_1.2-6     nnet_7.3-9           BradleyTerry2_1.0-6 
## [13] compiler_3.2.0       quantreg_5.11        formatR_1.1         
## [16] SparseM_1.6          labeling_0.3         caTools_1.17.1      
## [19] scales_0.2.4         brglm_0.5-9          stringr_0.6.2       
## [22] digest_0.6.8         foreign_0.8-63       minqa_1.2.4         
## [25] rmarkdown_0.5.1      htmltools_0.2.6      lme4_1.1-7          
## [28] RSQLite_1.0.0        BiocParallel_1.1.21  gtools_3.4.1        
## [31] acepack_1.3-3.3      car_2.0-25           RCurl_1.95-4.5      
## [34] Formula_1.2-0        futile.logger_1.4    Matrix_1.2-0        
## [37] munsell_0.4.2        proto_0.3-10         yaml_2.1.13         
## [40] MASS_7.3-40          zlibbioc_1.13.3      plyr_1.8.1          
## [43] grid_3.2.0           gdata_2.13.3         splines_3.2.0       
## [46] annotate_1.45.4      locfit_1.5-9.1       knitr_1.9           
## [49] geneplotter_1.45.0   reshape2_1.4.1       codetools_0.2-11    
## [52] biomaRt_2.23.5       futile.options_1.0.0 XML_3.98-1.1        
## [55] evaluate_0.5.5       latticeExtra_0.6-26  lambda.r_1.1.7      
## [58] nloptr_1.0.4         foreach_1.4.2        gtable_0.1.2        
## [61] xtable_1.7-4         e1071_1.6-4          class_7.3-12        
## [64] survival_2.38-1      snow_0.3-13          iterators_1.0.7     
## [67] cluster_2.0.1

References

  1. Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Unver T and Ozturk A (2014). MLSeq: Machine learning interface for RNA-Seq data. R package version 1.3.0.
  2. Himes, E. B, Jiang, X., Wagner, P., Hu, R., Wang, Q., Klanderman, B., Whitaker, M. R, Duan, Q., Lasky-Su, J., Nikolos, C., Jester, W., Johnson, M., Panettieri, A. R, Tantisira, G. K, Weiss, T. S, Lu and Q. (2014). “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS ONE, 9(6), pp. e99625. http://www.ncbi.nlm.nih.gov/pubmed/24926665.
  3. An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
  4. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, Jerome Friedman