Michael Love

June 8, 2015

(from PH525x notes and Barry, Nobel and Wright, 2008)

Consider the average *t*-statistic from \( N \) genes in a set \( G \):

\[ \bar{t}=\frac{1}{N} \sum_{i \in G} t_i \]

This statistic \( \bar{t} \) combines the information about DE from the set and might be a useful test statistic.

Under the null hypothesis, the \( t \) have mean 0. If the \( t \) are independent then \( \sqrt{N} \bar{t} \) has standard deviation 1 and is approximately normal:

\[ \sqrt{N} \bar{t} \sim N(0,1) \]

This comes from the following decomposition of the variance:

\[ \begin{eqnarray} \text{Var}(\bar{t}) &=& \frac{1}{N^2} \text{Var}(t_1 + \dots + t_N) \\ &=& \frac{1}{N^2}( \text{Var}(t_1) + \dots + \text{Var}(t_N) ) \\ &=& \frac{1}{N} \end{eqnarray} \]

Now consider the case that the test statistics \( t_i \) in a gene set are not independent but have correlation \( \rho \) *under the null hypothesis*.

\[ \bar{t}=\frac{1}{N} \sum_{i \in G} t_i \]

\[ \text{corr}(t_i, t_{i'}) = \rho, \quad i, i' \in G \]

The variance of the average *t*-statistics will be:

\[ \begin{eqnarray} \text{Var}(\bar{t}) &=& \frac{1}{N^2} \text{Var}( (1 \dots 1) (t_1 \dots t_N)' ) \\ &=& \frac{1}{N^2}(1 \dots 1) \begin{pmatrix} 1 & \rho & \dots & \rho & \rho \\ \rho & 1 & \rho & \dots & \rho \\ \dots & \dots & \dots & \dots & \dots \\ \rho & \rho & \dots & \rho & 1 \\ \end{pmatrix} (1 \dots 1) ' \\ &=& \frac{1}{N^2}\{N + (N-1) N \rho \} \\ &=& \frac{1}{N}\{1 + (N-1) \rho \} \end{eqnarray} \]

So the variance inflation factor (VIF) comparing the independent case to the case with correlation is:

\[ VIF = 1 + (N-1) \bar{\rho} \]

So the increased width (standard deviation) of the null distribution for a gene set with 20 genes and average correlation 0.1 will be:

```
sqrt(1 + 19 * 0.1)
```

```
[1] 1.702939
```

This VIF is approximately true also for testing the set statistics against the complement: the genes not in the set (see Barry, Nobel and Wright 2008).

Here, the expression of 5 samples vs 5 samples, no true difference but a correlation of gene expression.

If the test statistic \( T \) is a linear form of the data \( X \) (e.g. log fold change), then:

\[ \rho^T_{i,i'} = \rho^X_{i,i'} \]

For t-test, the relationship is monotone, approximately linear and:

\[ \rho^T_{i,i'} \approx \rho^X_{i,i'} \]

(Barry, Nobel and Wright, 2008)