Sonali Arora (sarora@fredhutch.org)

Lets say that we are interested in predicting the run time of an athlete depending on his shoe size, height and weight in a study of 100 people. We can do so using a simple linear regression model where

`y = beta0 + beta1 * height + beta2 * weight + beta3 * shoe_size `

Here y is the response variable (run time), n is the number of observations (100 people), p is the number of variables/ features/ predictors (3 IE height, weight, shoe size), X is a nxp matrix

This data set is a low dimensional data where n >> p but most of the biological data sets coming out of modern biological techniques are high dimensional IE n << p This poses statistical challenge and simple linear regression can no longer help us.

For example,

- Identify the risk factors(genes) for prostrate cancer based on gene expression data
- Predict the chances of breast cancer survival in a patient.
- Identify patterns of gene expression among different sub types of breast cancer

In all of the 3 examples, listed above n, number of observations, is 30-40 patients whereas p, number of features, is approximately 30,000 genes. Try writing a linear regression formula for the outcome variable, y, in any of the above three scenarios..

Listed below are things that can go wrong with high dimensional data - some of these predictors are useful, some are not - if we include too many predictors, we can over fit the data

This is why we need Machine Learning. Lets first introduce some basic concepts and then dive into examples and a lab session.

**Supervised Learning** - Use a data set X to predict the association with a response variable Y. The response variable can be continuous or categorical. For example: Predicting the chances of breast cancer survival in a patient.

**Unsupervised Learning** - Discover the associations or patterns in X. No response variable is present. For example: Cluster similar genes into groups.

**Training & Test Datasets** - Usually we split observation into test and training data sets. We fit the model on the training data set and evaluate on the test data set. The test set error rate is an estimate of the models performance on future data sets.

**Model Selection** - We usually consider numerous models for a given problem. For example, we are trying to identify the genes responsible for a given disease using gene expression data set- we could have the following models a) model 1 - Use all 30000 genes from the array to build a model b) model 2 - we include only genes related to the pathway that we know is upregulated in that disease to build a model c) model 3 - include genes found in literature which are known to influence this disease It is highly recommended to use the test set only on our final model to see how our model will do with new, unseen data. So how do we pick the best model which can be tested on the test data set?

**Cross-validation** We can use different approaches to find the best model. Lets look at the commonly used approaches, namely, validation set, leave one out cross-validation, k-fold cross validation.

Briefly, the **validation set approach** deals with diving the full data sets into 3 groups - training set, validation set and the test set. We train the models on the training set, evaluate their performance on the validation set and then the best model is chosen to fit on the test set.

The **leave one out cross validation** starts with fitting n models (where n is number of observations in the training data set), each on n-1 observations, evaluating each model on the left-out observation. The best model is the one for which the total test error is the smallest and that is then used to predict the test set.

Lastly the **5 fold cross validation** (here k=5), is splitting the training data set into 5 sets and repeatedly training the model on the other 4 sets and evaluating the performance on the fifth.

**Bias, Variance, Overfitting** - Bias refers to the average difference between the actual betas and the predicted betas, Variance refers to the amount by which the betas differ across experiments. As the model complexity(no of variables) increases, the bias decreases and the variance increases. This is know as the Bias-Variance Tradeoff and a model that has too much of variance, is said to be over fit.

For **Unsupervised learning**, we will use RNA-Seq count data from the Biocoductor package, *airway*. From the abstract, a brief description of the RNA-Seq experiment on airway smooth muscle (ASM) cell lines: “Using RNA-Seq, a high-throughput sequencing method, we characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone - a potent synthetic glucocorticoid (1 micromolar for 18 hours).”

```
library(airway)
data("airway")
se <- airway
colData(se)
```

```
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
```

```
library("DESeq2")
dds <- DESeqDataSet(se, design = ~ cell + dex)
```

For **Supervised learning**, we will use cervical count data from the Biocoductor package, *MLSeq*. This data set contains expressions of 714 miRNA’s of human samples. There are 29 tumor and 29 non-tumor cervical samples. For learning purposes, we can treat these as two separate groups and run various classification algorithms.

```
library(MLSeq)
filepath = system.file("extdata/cervical.txt", package = "MLSeq")
cervical = read.table(filepath, header = TRUE)
```

Unsupervised Learning is a set of statistical tools intended for the setting in which we have only a set of ‘p’ features measured on ‘n’ observations. We are primarily interested in discovering interesting things about the ‘p’ features.

Unsupervised Learning is often performed as a part of Exploratory Data Analysis. These tools help us to get a good idea about the data set. Unlike a supervised learning problem, where we can use prediction to gain some confidence about our learning algorithm, there is no way to check our model. The learning algorithm is thus, aptly named “unsupervised”.

**RLOG TRANSFORMATION**

Many common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.g., principal-component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observed quantity (here, the expression strength of a gene) does not depend on the mean.

In RNA-Seq data, the variance grows with the mean.If one performs PCA (principal components analysis) directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples.

As a solution, DESeq2 offers the regularized-logarithm transformation, or rlog for short. See the help for ?rlog for more information and options.

The function rlog returns a SummarizedExperiment object which contains the rlog-transformed values in its assay slot:

```
rld <- rlog(dds)
head(assay(rld))
```

```
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 9.399151 9.142478 9.501695 9.320796 9.757212
## ENSG00000000005 0.000000 0.000000 0.000000 0.000000 0.000000
## ENSG00000000419 8.901283 9.113976 9.032567 9.063925 8.981930
## ENSG00000000457 7.949897 7.882371 7.834273 7.916459 7.773819
## ENSG00000000460 5.849521 5.882363 5.486937 5.770334 5.940407
## ENSG00000000938 -1.638084 -1.637483 -1.558248 -1.636072 -1.597606
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 9.512183 9.617378 9.315309
## ENSG00000000005 0.000000 0.000000 0.000000
## ENSG00000000419 9.108531 8.894830 9.052303
## ENSG00000000457 7.886645 7.946411 7.908338
## ENSG00000000460 5.663847 6.107733 5.907824
## ENSG00000000938 -1.639362 -1.637608 -1.637724
```

To assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design? We use the R function dist to calculate the Euclidean distance between samples. To avoid that the distance measure is dominated by a few highly variable genes, and have a roughly equal contribution from all genes, we use it on the rlog-transformed data

```
sampleDists <- dist( t( assay(rld) ) )
sampleDists
```

```
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## SRR1039509 40.89060
## SRR1039512 37.35231 50.07638
## SRR1039513 55.74569 41.49280 43.61052
## SRR1039516 41.98797 53.58929 40.99513 57.10447
## SRR1039517 57.69438 47.59326 53.52310 46.13742 42.10583
## SRR1039520 37.06633 51.80994 34.86653 52.54968 43.21786
## SRR1039521 56.04254 41.46514 51.90045 34.82975 58.40428
## SRR1039517 SRR1039520
## SRR1039509
## SRR1039512
## SRR1039513
## SRR1039516
## SRR1039517
## SRR1039520 57.13688
## SRR1039521 47.90244 44.78171
```

Note the use of the function t to transpose the data matrix. We need this because dist calculates distances between data rows and our samples constitute the columns.

**HEATMAP**

We visualize the sample-to-sample distances in a heatmap, using the function heatmap.2 from the gplots package. Note that we have changed the row names of the distance matrix to contain treatment type and patient number instead of sample ID, so that we have all this information in view when looking at the heatmap.

```
library("gplots")
library("RColorBrewer")
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
hc <- hclust(sampleDists)
heatmap.2( sampleDistMatrix, Rowv=as.dendrogram(hc),
symm=TRUE, trace="none", col=colors,
margins=c(2,10), labCol=FALSE )
```

**PCA**

Another way to visualize sample-to-sample distances is a principal-components analysis (PCA). In this ordination method, the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out in the two directions which explain most of the differences in the data. The x-axis is the direction (or principal component) which separates the data points the most. The amount of the total variance which is contained in the direction is printed in the axis label. Here, we have used the function plotPCA which comes with DESeq2. The two terms specified by intgroup are the interesting groups for labelling the samples; they tell the function to use them to choose colors.

`plotPCA(rld, intgroup = c("dex", "cell"))`

From both visualizations, we see that the differences between cells are considerable, though not stronger than the differences due to treatment with dexamethasone. This shows why it will be important to account for this in differential testing by using a paired design (“paired”, because each dex treated sample is paired with one untreated sample from the same cell line). We are already set up for this by using the design formula ~ cell + dex when setting up the data object in the beginning.

**MDS** Another plot, very similar to the PCA plot, can be made using the multidimensional scaling (MDS) function in base R. This is useful when we don’t have the original data, but only a matrix of distances. Here we have the MDS plot for the distances calculated from the rlog transformed counts:

```
library(ggplot2)
mds <- data.frame(cmdscale(sampleDistMatrix))
mds <- cbind(mds, colData(rld))
qplot(X1,X2,color=dex,shape=cell,data=as.data.frame(mds))
```

Use the plotMDS function from the limma package to make a simila plot. What is the advtange of using this function over base R’s cmdscale?

**Solutions:**

A similar plot can be made using the plotMDS() function in limma where the input is a matrix of log-fold expression values. Here the advantage is that the distances on plot are proportional to log2-fold change and not only is the plot created, but the object (with distance matrix) is also returned.

```
suppressPackageStartupMessages({
library(limma)
library(DESeq2)
library(airway)
})
plotMDS(assay(rld), col=as.integer(dds$dex), pch=as.integer(dds$cell))
```

In supervised learning, along with the ‘p’ features, we also have the a response Y measured on the same n observations. The goal is then to predict Y using X (n x p matrix) for new observations.

For the cervical data, we know that the first 29 are non-Tumor samples whereas the last 29 are Tumor samples. We will code these as 0 and 1 respectively. We will randomly sample 30% of our data and use that as a test set. The remaining 70% of the data will be used as training data

```
set.seed(9)
class = data.frame(condition = factor(rep(c(0, 1), c(29, 29))))
nTest = ceiling(ncol(cervical) * 0.2)
ind = sample(ncol(cervical), nTest, FALSE)
cervical.train = cervical[, -ind]
cervical.train = as.matrix(cervical.train + 1)
classtr = data.frame(condition = class[-ind, ])
cervical.test = cervical[, ind]
cervical.test = as.matrix(cervical.test + 1)
classts = data.frame(condition = class[ind, ])
```

MLSeq aims to make computation less complicated for a user and allows one to learn a model using various classifier’s with one single function.

The main function of this package is classify which requires data in the form of a DESeqDataSet instance. The DESeqDataSet is a subclass of SummarizedExperiment, used to store the input values, intermediate calculations and results of an analysis of differential expression.

So lets create DESeqDataSet object for both the training and test set, and run DESeq on it.

```
cervical.trainS4 = DESeqDataSetFromMatrix(countData = cervical.train,
colData = classtr, formula(~condition))
```

`## converting counts to integer mode`

`cervical.trainS4 = DESeq(cervical.trainS4, fitType = "local")`

```
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 72 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
```

```
cervical.testS4 = DESeqDataSetFromMatrix(countData = cervical.test, colData = classts,
formula(~condition))
```

`## converting counts to integer mode`

`cervical.testS4 = DESeq(cervical.testS4, fitType = "local")`

```
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 55 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
```

Classify using Support Vector Machines.

```
svm = classify(data = cervical.trainS4, method = "svm", normalize = "deseq",
deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
```

```
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:Biostrings':
##
## type
```

`svm`

```
##
## An object of class MLSeq
##
## Method : svm
##
## Accuracy(%) : 97.83
## Sensitivity(%) : 100
## Specificity(%) : 95
##
## Reference Class : 1
```

It returns an object of class ‘MLseq’ and we observe that it successfully fitted a model with 97.8% accuracy. We can access the slots of this S4 object by

`getSlots("MLSeq")`

```
## method deseqTransform normalization confusionMat
## "character" "character" "character" "confusionMatrix"
## trained ref
## "train" "character"
```

And also, ask about the model trained.

`trained(svm)`

```
## Support Vector Machines with Radial Basis Function Kernel
##
## 46 samples
## 714 predictors
## 2 classes: '1', '0'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
##
## Summary of sample sizes: 37, 37, 36, 37, 37, 37, ...
##
## Resampling results across tuning parameters:
##
## C Accuracy Kappa Accuracy SD Kappa SD
## 0.25 0.5644444 0.0000000 0.01840175 0.0000000
## 0.50 0.7955556 0.5553664 0.10836284 0.2471095
## 1.00 0.8770370 0.7329642 0.11396855 0.2544595
##
## Tuning parameter 'sigma' was held constant at a value of 0.001113943
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.001113943 and C = 1.
```

We can predict the class labels of our test data using “predict”

`pred.svm = predictClassify(svm, cervical.testS4)`

```
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
```

`table(pred.svm, relevel(cervical.testS4$condition, 2))`

```
##
## pred.svm 1 0
## 1 3 1
## 0 0 8
```

The other classification methods available are ‘randomforest’, ‘cart’ and ‘bagsvm’.

Train the same training data and test data using randomForest.

**Solutions:**

```
rf = classify(data = cervical.trainS4, method = "randomforest",
normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
```

```
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
```

`trained(rf)`

```
## Random Forest
##
## 46 samples
## 714 predictors
## 2 classes: '1', '0'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
##
## Summary of sample sizes: 37, 37, 36, 37, 37, 37, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.8044444 0.5877801 0.11061795 0.2326000
## 37 0.8614815 0.7105296 0.11399949 0.2379795
## 713 0.8474074 0.6871242 0.09186167 0.1882186
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 37.
```

`pred.rf = predictClassify(rf, cervical.testS4)`

```
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
```

`table(pred.rf, relevel(cervical.testS4$condition, 2))`

```
##
## pred.rf 1 0
## 1 3 1
## 0 0 8
```

`sessionInfo()`

```
## R version 3.2.0 alpha (2015-03-25 r68090)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] kernlab_0.9-20
## [2] RColorBrewer_1.1-2
## [3] gplots_2.16.0
## [4] MLSeq_1.3.0
## [5] edgeR_3.9.15
## [6] randomForest_4.6-10
## [7] limma_3.23.12
## [8] caret_6.0-41
## [9] ggplot2_1.0.1
## [10] lattice_0.20-31
## [11] DESeq2_1.7.50
## [12] RcppArmadillo_0.4.650.1.1
## [13] Rcpp_0.11.5
## [14] airway_0.101.1
## [15] genefilter_1.49.2
## [16] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
## [17] GenomicFeatures_1.19.36
## [18] AnnotationDbi_1.29.21
## [19] Biobase_2.27.3
## [20] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [21] BSgenome_1.35.20
## [22] rtracklayer_1.27.11
## [23] GenomicAlignments_1.3.33
## [24] Rsamtools_1.19.49
## [25] Biostrings_2.35.12
## [26] XVector_0.7.4
## [27] GenomicRanges_1.19.52
## [28] GenomeInfoDb_1.3.16
## [29] IRanges_2.1.43
## [30] S4Vectors_0.5.22
## [31] BiocGenerics_0.13.11
## [32] BiocStyle_1.5.3
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-120 bitops_1.0-6 pbkrtest_0.4-2
## [4] tools_3.2.0 rpart_4.1-9 KernSmooth_2.23-14
## [7] Hmisc_3.15-0 DBI_0.3.1 mgcv_1.8-6
## [10] colorspace_1.2-6 nnet_7.3-9 BradleyTerry2_1.0-6
## [13] compiler_3.2.0 quantreg_5.11 formatR_1.1
## [16] SparseM_1.6 labeling_0.3 caTools_1.17.1
## [19] scales_0.2.4 brglm_0.5-9 stringr_0.6.2
## [22] digest_0.6.8 foreign_0.8-63 minqa_1.2.4
## [25] rmarkdown_0.5.1 htmltools_0.2.6 lme4_1.1-7
## [28] RSQLite_1.0.0 BiocParallel_1.1.21 gtools_3.4.1
## [31] acepack_1.3-3.3 car_2.0-25 RCurl_1.95-4.5
## [34] Formula_1.2-0 futile.logger_1.4 Matrix_1.2-0
## [37] munsell_0.4.2 proto_0.3-10 yaml_2.1.13
## [40] MASS_7.3-40 zlibbioc_1.13.3 plyr_1.8.1
## [43] grid_3.2.0 gdata_2.13.3 splines_3.2.0
## [46] annotate_1.45.4 locfit_1.5-9.1 knitr_1.9
## [49] geneplotter_1.45.0 reshape2_1.4.1 codetools_0.2-11
## [52] biomaRt_2.23.5 futile.options_1.0.0 XML_3.98-1.1
## [55] evaluate_0.5.5 latticeExtra_0.6-26 lambda.r_1.1.7
## [58] nloptr_1.0.4 foreach_1.4.2 gtable_0.1.2
## [61] xtable_1.7-4 e1071_1.6-4 class_7.3-12
## [64] survival_2.38-1 snow_0.3-13 iterators_1.0.7
## [67] cluster_2.0.1
```

- Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Unver T and Ozturk A (2014). MLSeq: Machine learning interface for RNA-Seq data. R package version 1.3.0.
- Himes, E. B, Jiang, X., Wagner, P., Hu, R., Wang, Q., Klanderman, B., Whitaker, M. R, Duan, Q., Lasky-Su, J., Nikolos, C., Jester, W., Johnson, M., Panettieri, A. R, Tantisira, G. K, Weiss, T. S, Lu and Q. (2014). “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS ONE, 9(6), pp. e99625. http://www.ncbi.nlm.nih.gov/pubmed/24926665.
- An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, Jerome Friedman