Today we’ll cover statistical concepts and tests commonly used in cancer research. The dataset we’ll access is a subset of the ALL expression data whose patient information we worked with in the first day’s material. In addition to that information we’ll access 1000 associated expression microarray features that present the highest variance across the patient samples. The data have been saved in a binary format to reduce file sizes.

The following continues to explore the ‘ALL’ phenotypic data from this earlier in the course. Start by loading the data, as before.

`path <- file.choose() # look for ALLphenoData.tsv`

```
stopifnot(file.exists(path))
pdata <- read.csv(path)
```

Let’s look a little more closely at patient information in the `pdata`

object:

`median(pdata$age)`

`## [1] NA`

The value `NA`

shows up because some of the `pdata$age`

values are NA. We can verify this by asking *R* if there are any NA values

`any(is.na(pdata$age))`

`## [1] TRUE`

`anyNA(pdata$age) # same, but more efficient`

`## [1] TRUE`

Consulting the help page for `?median`

suggests a solution – specify the argument `na.rm=TRUE`

. Explore other aspects of age, like `range()`

and `quantile()`

.

`median(pdata$age, na.rm=TRUE)`

`## [1] 29`

`range(pdata$age, na.rm=TRUE)`

`## [1] 5 58`

`quantile(pdata$age, na.rm=TRUE)`

```
## 0% 25% 50% 75% 100%
## 5.0 19.0 29.0 45.5 58.0
```

Some simple plots of patient ages – note the nested functions!

```
plot(pdata$age)
plot(sort(pdata$age))
sortedAge = sort(pdata$age)
?plot
```

**Exercise**: Plot the`sortedAge`

with markers at each data point and connect the points with red lines. You’ll need to use the graphics parameters`type="b"`

and`col="red"`

, see`?plot.default`

.**Exercise**: Plot one variable (e.g.,`age`

) as a function of another (e.g.,`sex`

). Since`sex`

is a factor,*R*chooses to create a box plot; does this make sense?`plot(age ~ sex, pdata)`

Histograms, and their display options:

```
hist(pdata$age)
?hist
hist(pdata$age, br=25)
```

Cross tables use `formulas`

to describe the relationship between the data they present:

`xtabs(~sex, data=pdata, exclude=NA)`

```
## sex
## F M
## 42 83
```

`xtabs(~sex + remission, data=pdata, exclude=NA)`

```
## remission
## sex CR REF
## F 27 7
## M 71 8
```

**Exercise:**How many hyperdiploid (`kinet`

) males (`sex`

) are refractory (`remission`

)?

Use `plot()`

to visualize the distribution of female and male ages in the `pdata`

data set.

`plot(age ~ sex, pdata)`

It looks like females are on average older than males. Use `t.test()`

to find out.

`t.test(age ~ sex, pdata)`

```
##
## Welch Two Sample t-test
##
## data: age by sex
## t = 1.6034, df = 79.88, p-value = 0.1128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.022660 9.504142
## sample estimates:
## mean in group F mean in group M
## 35.16667 30.92593
```

Check out the help page for `t.test()`

`?t.test`

What are all those additional arguments to `t.test()`

? For example, what is the meaning of the `var.equal`

argument? Why are there 79.88 degrees of freedom?

`t.test(age ~ sex, pdata, var.equal=TRUE)`

```
##
## Two Sample t-test
##
## data: age by sex
## t = 1.6266, df = 121, p-value = 0.1064
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9207109 9.4021924
## sample estimates:
## mean in group F mean in group M
## 35.16667 30.92593
```

A t-test can also be viewed as an analysis of variance (ANOVA); analysis of variance is a form of linear model. Use `lm()`

to fit a linear model that describes how age changes with sex; the `anova()`

function summarizes the linear model in a perhaps more familiar ANOVA table.

`(fit <- lm(age ~ sex, pdata))`

```
##
## Call:
## lm(formula = age ~ sex, data = pdata)
##
## Coefficients:
## (Intercept) sexM
## 35.167 -4.241
```

`anova(fit)`

```
## Analysis of Variance Table
##
## Response: age
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 497.4 497.41 2.6459 0.1064
## Residuals 121 22747.4 187.99
```

What kinds of assumptions are being made in the linear model, e.g., about equality of variances? Try plotting `fit`

; what are the figures trying to tell you?

`plot(fit)`

`fit`

is an example of an *R* *object*. Find out it’s class

`class(fit)`

`## [1] "lm"`

`plot`

is an example of an *R* *generic*; it has different *methods* implemented for different classes of objects. Use `methods()`

to see available methods

`methods(plot)`

```
## [1] plot.HoltWinters* plot.TukeyHSD* plot.acf* plot.data.frame*
## [5] plot.decomposed.ts* plot.default plot.dendrogram* plot.density*
## [9] plot.ecdf plot.factor* plot.formula* plot.function
## [13] plot.hclust* plot.histogram* plot.isoreg* plot.lm*
## [17] plot.medpolish* plot.mlm* plot.ppr* plot.prcomp*
## [21] plot.princomp* plot.profile.nls* plot.raster* plot.spec*
## [25] plot.stepfun plot.stl* plot.table* plot.ts
## [29] plot.tskernel*
## see '?methods' for accessing help and source code
```

Look up the help page for the `plot`

generic, `lm`

method with

`?plot.lm`

Fitted models can be used in other functions, for instance to predict values for new data. Construct a `data.frame`

with a single column `sex`

with values `"M"`

and `"F"`

. Consult the help page for the `predict.lm`

method, and calculate the expected value of the fitted model for males and for females.

```
df = data.frame(sex=c("F", "M"))
predict(fit, df)
```

```
## 1 2
## 35.16667 30.92593
```

What do the predicted values correspond to in the `t.test()`

? Use `coefficients()`

to extract the coefficients of the fitted model.

`coefficients(fit)`

```
## (Intercept) sexM
## 35.166667 -4.240741
```

Interpret the `(Intercept)`

and `sexM`

coefficients in terms of female and male ages.

The article from which the `pdata`

object is derived states that “Although chromosome translocations and molecular rearrangements are relatively infrequent in T-lineage ALL, these events occur commonly in B-lineage ALL and reflect distinct mechanisms of transformation”. Let’s investigate this statement.

The relevant columns of data are summarized as

`summary(pdata[,c("BT", "cyto.normal")])`

```
## BT cyto.normal
## B2 :36 Mode :logical
## B3 :23 FALSE:69
## B1 :19 TRUE :24
## T2 :15 NA's :35
## B4 :12
## T3 :10
## (Other):13
```

Simplify the number of `BT`

levels by creating a new column with `B`

or `T`

. Do this by creating a `substr()`

ing from the original column, consisting of the first letter of each entry, and turning this vector into a `factor()`

. Assign it to a new column name

`pdata$BorT <- factor(substr(pdata$BT, 1, 1))`

Cross-tabulate the data

`xtabs(~ BorT + cyto.normal, pdata)`

```
## cyto.normal
## BorT FALSE TRUE
## B 56 17
## T 13 7
```

The data are qualitatively consistent with the statement that molecular rearrangements are more common in B-lineage ALL. Let’s test this with a chi-squared test

`chisq.test(pdata$BorT, pdata$cyto.normal)`

```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: pdata$BorT and pdata$cyto.normal
## X-squared = 0.59622, df = 1, p-value = 0.44
```

Interpret the results. What about additional parameters documented on `?chisq.test`

?

The 128 samples we’ve been exploring are from a micro-array experiment. Microarrays consist of ‘probesets’ that interogate genes for their level of expression. In the experiment we’re looking at, there are 12625 probesets measured on each of the 128 samples. The raw expression levels estimated by microarray assays require considerable pre-processing, the data we’ll work with has been pre-processed.

Start by finding the expression data file on disk.

```
path <- file.choose() # look for ALL-expression.csv
stopifnot(file.exists(path))
```

The data is stored in ‘comma-separate value’ format, with each probeset occupying a line, and the expression value for each sample in that probeset separated by a comma. Input the data using `read.csv()`

. There are three challenges:

- The row names are present in the first column of the data. Tell
*R*this by adding the argument`row.names=1`

to`read.csv()`

. - By default,
*R*checks that column names do not look like numbers, but our column names*do*look like numbers. Use the argument`check.colnames=FALSE`

to over-ride*R*’s default. `read.csv()`

returns a`data.frame`

. We could use a`data.frame`

to work with our data, but really it is a`matrix()`

– the columns are of the same type and measure the same thing. Use`as.matrix()`

to coerce the`data.frame`

we input to a`matrix`

.

```
exprs <- read.csv(path, row.names=1, check.names=FALSE)
exprs <- as.matrix(exprs)
```

We’ll also make use of the data describing the samples

```
path <- file.choose() # look for ALL-phenoData.csv
stopifnot(file.exists(path))
pdata <- read.csv(path, row.names=1)
```

We’ll add a column to `pdata`

, derived from the `BT`

column, to indicate whether the sample is B-cell or T-cell ALL.

```
pdata$BorT <- factor(substr(pdata$BT, 1, 1))
xtabs(~BorT + BT, pdata)
```

```
## BT
## BorT B B1 B2 B3 B4 T T1 T2 T3 T4
## B 5 19 36 23 12 0 0 0 0 0
## T 0 0 0 0 0 5 1 15 10 2
```

Some of the results below involve plots, and it’s convenient to choose pretty and functional colors. We use the RColorBrewer package; see colorbrewer.com

```
library(RColorBrewer)
divergent <- brewer.pal(11, "RdBu")
highlight <- brewer.pal(3, "Set2")[1:2]
```

‘divergent’ is a vector of colors that go from red (negative) to blue (positive). `highlight’ is a vector of length 2, light and dark green.

For more options see `?RColorBrewer`

and to view the predefined palettes `display.brewer.all()`

Verify that we have a matrix of appropriate `class()`

and `dim()`

ensions; take a peak at the first five rows and columns of the data.

`class(exprs)`

`## [1] "matrix"`

`dim(exprs)`

`## [1] 12625 128`

`exprs[1:5, 1:5]`

```
## 01005 01010 03002 04006 04007
## 1000_at 7.597323 7.479445 7.567593 7.384684 7.905312
## 1001_at 5.046194 4.932537 4.799294 4.922627 4.844565
## 1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923
## 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997
## 1004_at 5.925260 5.912780 5.893209 6.170245 5.615210
```

We’ll work with a subset of the data, specifically the 1000 probesets that show the most variance across samples. The variance of a single row can be calculated as

`var(exprs[1,])`

`## [1] 0.06768013`

The variance across each row can be calculated using the `apply()`

function. The first argument is the rectangular data structure that we’d like to summarize, the second argument is the dimension (`1`

for rows; `2`

for columns) over which we’d like to summarize, the third argument is the function that we’d like to apply to each row or column. Visualize the distribution of probeset variances using `hist()`

, perhaps on transformed data.

```
v <- apply(exprs, 1, var)
hist(sqrt(v))
```

Use `order(v, decreasing=TRUE)`

to determine how the elements of `v`

should be selected so that they are ordered from highest to lowest, visually verify that the values are actually ordered from highest to lowest…

```
o <- order(v, decreasing=TRUE)
plot(sqrt(v[o]))
```

Use `head()`

to select the indexes of the 1000 most variable probesets, and to create a subset of the `exprs`

matrix containing these probesets

`exprs1000 <- exprs[head(o, 1000), ]`

**Class exercise**: Use`apply()`

on`exprs1000`

to verify that we appear to have the 1000 most variable probesets.

`dist()`

and `hclust()`

We’d like to understand whether there are patterns in the data, e.g., consistent differences in overall patterns of expression between samples. One approach is *hierarchical clustering*

Calculate the Euclidean distance between samples using `dist()`

on the `t()`

ranspose of `exprs1000`

(check out `?dist`

for other distance metrics).

`d <- dist(t(exprs1000))`

Use `hclust()`

to cluster the data, and `plot()`

to visualize the clusters.

`plot(hclust(d))`

It’s informative to label the columns of `exprs1000`

with the ALL subtype (e.g., B1, B2, T1, etc). First, use `identical()`

to verify that the `colnames()`

of `exprs1000`

are the same as the `rownames()`

of `pdata`

.

`identical(colnames(exprs1000), rownames(pdata))`

`## [1] TRUE`

Now replace the column names of `exprs1000`

with `pdata$BT`

, and repeat the clustering.

```
colnames(exprs1000) <- pdata$BT
plot(hclust(dist(t(exprs1000))))
```

`colnames(exprs1000) <- rownames(pdata)`

It’s clear that overall gene expression patterns differ between B- and T-cell ALL.

The data supporting the dendrogram produced by `hclust()`

can be visualized as a `heatmap()`

. Here we calculate a correlation matrix between each sample; we will use this as the basis for our distance metric in ordering rows and columns of the heatmap.

`d <- dist(t(exprs1000))`

The following `heatmap()`

command performs hierarchical clustering of the distances, and adds a color bar to indicate whether the sample came from a B or T lineage sample.

```
color <- highlight[pdata$BorT]
heatmap(as.matrix(d), ColSideColor=color, col=divergent, symm=TRUE)
```

`cmdscale()`

and other approaches to dimension reductionWe have characterized our 128 samples in 1000 dimensions. Hierarchical clustering suggests that somehow there is pattern in this high-dimensional data. There are a number of statistical techniques that we can use to effectively reduce the dimensions of the data. The reduced-dimension representation might be used for visualization or downstream analysis.

One method for reducing the dimension of the data is classical *multi-dimensional scaling* (principle coordinates analysis). The idea is to take a measure of dissimilarity (like the distance matrix calculated above) and reduce it to a set of points (e.g., in 2 dimensions) such that the distance between points is approximately equal to the dissimilarity. This is very easy to implement in *R*, using the `cmdscale()`

function.

```
mds <- cmdscale(dist(t(exprs1000)))
plot(mds, col=color)
```

`mds`

is a 128 x 2 matrix, and the columns can be used as a reduced-dimension replacement for the 1000 expression values.

Principle components analysis is a similar approach to visualization and down-stream analysis in reduced dimensions; see `prcomp()`

.

`knn()`

Classification (‘supervised machine learning’) assigns ‘test’ samples into a group based on some measure of similarity to a set of ‘training’ samples. There are many varieties of classification. *knn* (‘k’ nearest neighbors) classification assigns the test sample to the majority-vote of it’s k nearest neighbors. `knn()`

and other common classifiers are defined in the `class`

package, so we start by adding that to our *R* session.

`library(class)`

We’ll start by creating a training set. We’ll choose the first half of the ‘B’ individuals and the first half of the ‘T’ individuals as members. First, we can get the index of each row using `seq_along()`

`seq_along(pdata$BorT)`

```
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
## [70] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
## [93] 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
## [116] 116 117 118 119 120 121 122 123 124 125 126 127 128
```

These values can be split into groups (a list-of-integers) using `split()`

```
idxByType <- split(seq_along(pdata$BorT), pdata$BorT)
idxByType
```

```
## $B
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## [65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
##
## $T
## [1] 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
## [25] 120 121 122 123 124 125 126 127 128
```

We can ask about the number of each element of `idxByType`

by extracting the element and using the `length()`

function, e.g., `length(idxByType[["B"]])`

. We can then use `head()`

to select the first half of the elements

```
Bidx <- idxByType[["B"]]
BTrainIdx <- head(Bidx, length(Bidx) / 2)
```

Likewise for T:

```
Tidx <- idxByType[["T"]]
TTrainIdx <- head(Tidx, length(Tidx) / 2)
```

There are two fun things to notice, and these help us expand our knowledge of *R*.

Notice that we are applying the same sequence of operations to the B and to the T indexes.

*R*lets us write a**function**to capture this repeated operation`firstHalf <- function(x) { head(x, length(x) / 2) }`

and we can verify that this works

`firstHalf(idxByType[["B"]])`

`## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 ## [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47`

`firstHalf(idxByType[["T"]])`

`## [1] 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111`

We’d really like to apply this to function to each element of the list

`idxByType`

. This can be done using`lapply()`

, where the first argument is a vector and the second argument is the function we’d like to apply`trainIdx <- lapply(idxByType, firstHalf) trainIdx`

`## $B ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 ## [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ## ## $T ## [1] 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111`

OK, let’s compose our training set from the first half of each group using `cbind()`

to create a single matrix of training individuals

```
Bhalf <- exprs1000[, trainIdx[["B"]] ]
Thalf <- exprs1000[, trainIdx[["T"]] ]
train <- cbind(Bhalf, Thalf)
```

Our test set is the individuals not included in the training set. One way to select these individuals is to write a `secondHalf()`

function; another way is to subset `exprs1000`

to include the columns that are not in `train`

`test <- exprs1000[, ! colnames(exprs1000) %in% colnames(train) ]`

We’ll need to know the classification of the training and testing sets

```
cl_train <- pdata[colnames(train), "BorT"]
cl_test <- pdata[colnames(test), "BorT"]
```

`knn()`

is invoked with the training and test sets, and the known classification of the training set. The argument `k`

is the number of nearest neighbors to consider; the default is `k=1`

. The return value is the estimated classification of each test sample

`cl_test_est <- knn(t(train), t(test), cl_train)`

A characterization of how well the classifier performs is the ‘confusion matrix’, which summarizes the known versus estimated classification of the test individuals in a cross-tabulation

`xtabs(~cl_test + cl_test_est)`

```
## cl_test_est
## cl_test B T
## B 48 0
## T 0 17
```

Here, the classifier is performing very well – there are no incorrect (off-diagonal) assignments.

We’ve been using the known data to create a classifier and then to evaluate it’s performance. The decisions to use half of the B and half of the T samples, and to include the first half of each group in the ‘training’ group, were arbitrary, and it turns out that a better strategy is ‘leave-one-out’ cross-validation. This works by using all but one sample (e.g., the first sample) in the training group, and then classifying the sample that was left out. One then repeats for the second, third, … samples. *R* implements this for knn with the `knn.cv()`

function (again with a default `k=1`

nearest neighbors).

```
knn_loo <- knn.cv(t(exprs1000), pdata$BorT)
xtabs(~pdata$BorT + knn_loo)
```

```
## knn_loo
## pdata$BorT B T
## B 95 0
## T 0 33
```

As an exercise, use `knn.cv()`

to develop a classifier for T-cell subtypes; explore the consequences of different values of `k`

. Here’s a sketch of the approach:

Create a subset of the expression and phenotype data with only the T cell types; remember to use

`factor()`

to remove unused levels`keep <- pdata$BorT == "T" exprT <- exprs1000[, keep] classT <- factor(pdata$BT[keep])`

Use

`knn.cv()`

to perform the cross-validation, and`xtabs()`

to summarize the confusion matrix. Interpret the confusion matrix.`xtabs(~classT + knn.cv(t(exprT), classT))`

`## knn.cv(t(exprT), classT) ## classT T T1 T2 T3 T4 ## T 0 0 4 1 0 ## T1 0 0 1 0 0 ## T2 5 0 4 3 3 ## T3 1 0 2 6 1 ## T4 0 0 0 1 1`

Experiment with different values of

`k`

. Can you interpret the consequences of`k`

for the statistical principles of*precision*and*accuracy*?

There are very flexible ways of working with data, e.g., 1- and 2-dimensional subsetting with

`[`

, accessing columns of a data.frame with`$`

or with two-dimensional subsetting, creating`factor()`

s, etc.Univariate statistical functions, e.g.,

`mean()`

,`median()`

,`t.test()`

,`lm()`

,`chisq.test()`

.Multivariate statistical functions, e.g.,

`dist()`

,`hclust()`

,`heatmap()`

,`knn()`

,`knn.cv()`

.Packages provide specialized functionality, e.g., RColorBrewer, [class][].

Built-in functions can process rows or columns of a matrix (

`apply()`

) or list elements (`lapply()`

; see also`sapply()`

,`vapply()`

,`mapply()`

,`tappyl()`

).We can write our own functions!

The help system is our friend!