- 1 Overview
- 2 Getting acquainted with machine learning via the crabs data
- 3 Learning with expression arrays
- 4 Embedding features selection in cross-validation
- 5 Session information

The term *machine learning* refers to a family of computational methods
for analyzing multivariate datasets. Each data point has a vector of
*features* in a shared *feature space*, and may have a *class label*
from some fixed finite set.

*Supervised learning* refers to processes that help articulate rules
that map *feature vectors* to *class labels*. The class labels are known
and function as supervisory information to guide rule construction.
*Unsupervised learning* refers to processes that discover structure in
collections of feature vectors. Typically the structure consists of a
grouping of objects into clusters.

This practical introduction to machine learning will begin with a survey of a low-dimensional dataset to fix concepts, and will then address problems coming from genomic data analysis, using RNA expression and chromatin state data.

Some basic points to consider at the start:

Distinguish predictive modeling from inference on model parameters. Typical work in epidemiology focuses on estimation of relative risks, and random samples are not required. Typical work with machine learning tools targets estimation (and minimization) of the misclassification rate. Representative samples are required for this task.

“Two cultures”: model fitters vs. algorithmic predictors. If statistical models are correct, parameter estimation based on the mass of data can yield optimal discriminators (e.g., LDA). Algorithmic discriminators tend to prefer to identify boundary cases and downweight the mass of data (e.g., boosting, svm).

Different learning tools have different capabilities. There is little

*a priori*guidance on matching learning algorithms to aspects of problems. While it is convenient to sift through a variety of approaches, one must pay a price for the model search.Data and model/learner visualization are important, but visualization of higher dimensional data structures is hard. Dynamic graphics can help; look at ggobi and Rggobi for this.

These notes provide very little mathematical background on the methods; see for example Ripley (

*Pattern recognition and neural networks*, 1995), Duda, Hart, Stork (*Pattern classification*), Hastie, Tibshirani and Friedman (2003,*Elements of statistical learning*) for copious background.

The following steps bring the crabs data into scope and illustrate aspects of its structure.

```
library("MASS")
data("crabs")
dim(crabs)
```

`## [1] 200 8`

`crabs[1:4,] `

```
## sp sex index FL RW CL CW BD
## 1 B M 1 8.1 6.7 16.1 19.0 7.0
## 2 B M 2 8.8 7.7 18.1 20.8 7.4
## 3 B M 3 9.2 7.8 19.0 22.4 7.7
## 4 B M 4 9.6 7.9 20.1 23.1 8.2
```

`table(crabs$sex)`

```
##
## F M
## 100 100
```

The plot is shown in Figure 1.

We will regard these data as providing five quantitative features
(FL, RW, CL, CW, BD)1 You may consult the manual page of `{crabs}`
for an explanation of these abbreviations. and a pair of class labels (sex, sp=species).
We may regard this as a four class problem, or as two two class
problems.

Our first problem does not involve any computations. If you want to write R code to solve the problem, do so, but use prose first.

*Question 1*. On the basis of the boxplots in Figure 1, comment on the prospects for predicting species on the basis of RW. State a rule for computing the predictions. Describe how to assess the performance of your rule.

A simple approach to prediction involves logistic regression.

```
m1 = glm(sp~RW, data=crabs, family=binomial)
summary(m1)
```

```
##
## Call:
## glm(formula = sp ~ RW, family = binomial, data = crabs)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.44908 0.82210 -4.195 2.72e-05 ***
## RW 0.27080 0.06349 4.265 2.00e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 277.26 on 199 degrees of freedom
## Residual deviance: 256.35 on 198 degrees of freedom
## AIC: 260.35
##
## Number of Fisher Scoring iterations: 4
```

*Question 2*. Write down the statistical model corresponding to the R expression above. How can we derive a classifier from this model?*Question 3*. Perform the following computations. Discuss their interpretation. What are the estimated error rates of the two models? Is the second model, on the subset, better?

`plot(predict(m1,type="response"), crabs$sp,)`

```
table(predict(m1,type="response")>.5, crabs$sp)
m2 = update(m1, subset=(sex=="F"))
table(predict(m2,type="response")>.5, crabs$sp[crabs$sex=="F"])
```

Cross-validation is a technique that is widely used for
reducing bias in the estimation of predictive accuracy. If no precautions are taken,
bias can be caused by *overfitting* a classification algorithm to a particular
dataset; the algorithm learns the classification ‘’by heart’’, but performs poorly
when asked to generalise it to new, unseen examples.
Briefly, in cross-validation the dataset is deterministically partitioned into
a series of training and test sets. The model is built
for each training set and evaluated on the test set.
The accuracy measures are averaged over this series
of fits. Leave-one-out cross-validation consists of N
fits, with N training sets of size N-1 and N test sets
of size 1.

First let us use `MLearn` from the
*MLInterfaces* package to fit a single logistic model.
`MLearn` requires you to specify an index set for training.
We use `c(1:30, 51:80)` to choose a training set of
size 60, balanced between two species (because we know the
ordering of records). This procedure also requires you
to specify a probability threshold for classification.
We use a typical default of 0.5. If the predicted probability
of being “O” exceeds 0.5, we classify to “O”, otherwise to “B”.

```
library(MLInterfaces)
fcrabs = crabs[crabs$sex == "F", ]
ml1 = MLearn( sp~RW, fcrabs,glmI.logistic(thresh=.5), c(1:30, 51:80), family=binomial)
ml1
```

```
## MLInterfaces classification output container
## The call was:
## MLearn(formula = sp ~ RW, data = fcrabs, .method = glmI.logistic(thresh = 0.5),
## trainInd = c(1:30, 51:80), family = binomial)
## Predicted outcome distribution for test set:
## O
## 40
## Summary of scores on test set (use testScores() method for details):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7553 0.8861 0.9803 0.9355 0.9917 0.9997
```

`confuMat(ml1)`

```
## predicted
## given B O
## B 0 20
## O 0 20
```

*Question 4.*What does the report on`ml1`tell you about predictions with this model? Can you reconcile this with the results in model`m2`? [Hint – non-randomness of the selection of the training set is a problem.]*Question 5.*Modify the MLearn call to obtain a predictor that is more successful on the test set.

Now we will illustrate cross-validation. First, we scramble the order of
records in the `ExpressionSet` so that sequentially formed groups are
approximately random samples.

```
set.seed(123)
sfcrabs = fcrabs[ sample(nrow(fcrabs)), ]
```

We invoke the `MLearn` method in two ways – first specifying a training
index set, then specifying a five-fold cross-validation.

```
sml1 = MLearn( sp~RW, sfcrabs, glmI.logistic(thresh=.5),c(1:30, 51:80),family=binomial)
confuMat(sml1)
```

```
## predicted
## given B O
## B 15 6
## O 8 11
```

```
smx1 = MLearn( sp~RW, sfcrabs, glmI.logistic(thresh=.5),xvalSpec("LOG", 5, function(data, clab, iternum) {which(rep(1:5, each=20) == iternum) }), family=binomial)
confuMat(smx1)
```

```
## predicted
## given B O
## B 36 14
## O 14 36
```

*Question 6.*Define clearly the difference between models sml1 and smx1 and state the misclassification rate estimates associated with each model.

*Question 7.*Interpret the following code, whose result is shown in Figure 2. Modify it to depict the pairwise configurations with different colors for crab genders.

`pairs(crabs[,-c(1:3)], col=ifelse(crabs$sp=="B", "blue", "orange"))`