Vince Carey

June 14, 2017

Road map:

- simulation of test size and power
- effects of sample contamination
- resistant statistics: robustification of key statistical procedures
- measuring influence of observations in DESeq2

Mystery functions?

`ts_zpm = function(d) sqrt(length(d))*mean(d)/sd(d)`

`contam1 = function(x, slip=5) {x[1] = x[1]+slip; x}`

```
set.seed(12345)
X = rnorm(50)
ts_zpm(X)
```

`## [1] 1.157889`

`ts_zpm(contam1(X))`

`## [1] 1.479476`

`ts_zpm(contam1(X,100))`

`## [1] 1.082075`

- Wolfgang: count events and check whether the count falls into a region prescribed to have low probability under the null hypothesis
- The probabilities of regions are prescribed by the binomial distribution
- We need probabilities and principles for choosing the critical region
- symmetric about null value?
- represent a specific direction of effect (one-sided?)
- based on theory or simulation?

*Exact*critical regions based on finite sample distribution of a statistic- \(t\), Wilcoxon
- the parent distribution must satisfy certain conditions
- data are a random sample from homogeneous population (iid)
- for \(t\): parent is Gaussian
- for Wilcoxon: parent is continuous, finite variance (?)

*Approximate*critical regions based on large sample theory- how large is large enough?
- data are a random sample from homogeneous population (iid)

*Approximate*critical regions based on simulation- the generative distribution accurately represents variation underlying the experiment or observations

```
set.seed(12345)
X = rnorm(50)
tst = t.test(X)
tst
```

```
##
## One Sample t-test
##
## data: X
## t = 1.1579, df = 49, p-value = 0.2525
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.1320800 0.4912126
## sample estimates:
## mean of x
## 0.1795663
```

`str(tst)`

```
## List of 9
## $ statistic : Named num 1.16
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 49
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.253
## $ conf.int : atomic [1:2] -0.132 0.491
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num 0.18
## ..- attr(*, "names")= chr "mean of x"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "mean"
## $ alternative: chr "two.sided"
## $ method : chr "One Sample t-test"
## $ data.name : chr "X"
## - attr(*, "class")= chr "htest"
```

`tst$statistic # from R`

```
## t
## 1.157889
```

`sqrt(50)*mean(X)/sd(X) # by hand`

`## [1] 1.157889`

```
set.seed(12345)
simdist = replicate(10000, ts_zpm( rnorm(50) ))
head(simdist)
```

`## [1] 1.15788926 1.92818028 -0.08173422 0.82143100 0.39880144 -1.21706025`

```
hist(simdist, freq=FALSE)
lines(seq(-3,3,.01), dt(seq(-3,3,.01), 49))
```

simulated:

`mean(simdist > 1.1579)`

`## [1] 0.123`

theoretical (exact):

`integrate(function(x) dt(x,49), 1.1579, Inf)$value`

`## [1] 0.1262589`

```
contsim = replicate(10000,
ts_zpm( contam1( rnorm(50) ) ) )
critval_1sided = qt(.95, 49)
mean(simdist > critval_1sided) # uncontaminated
```

`## [1] 0.0483`

`mean(contsim > critval_1sided) # contaminated`

`## [1] 0.0946`

`library(parody)`

`## Loading required package: tools`

```
robust_t = function(x) {
outchk = calout.detect(x, method="GESD")
if (!is.na(outchk$ind[1])) x = x[-outchk$ind]
sqrt(length(x))*mean(x)/sd(x)
}
set.seed(12345)
contsim_r = replicate(10000,
robust_t( contam1( rnorm(50) ) ) )
mean(contsim_r > critval_1sided) # robust test on contaminated
```

`## [1] 0.0531`

- the one-sided one sample t test for \(H_0: \mu = 0\) involves
- \(n\), \(\bar{X}\), \(s\) to form the test statistic, and
- the \(t\) density to form critical values

- simulation from the null distribution can be used to obtain an empirical p value
- simulation from the contaminated distribution with a single observation shifted by \(5\sigma\) leads to a Type I error rate (using the standard \(t\) critical value for \(\alpha = 0.05\)) of 0.095
- robustification of the test statistic using calibrated outlier removal can restore (approximately) the nominal Type I error rate
- can’t conclude too much from this example, of course … there is considerable theoretical literature on sensitivity of tests to contamination
- the t test is described as ‘robust’ in many texts, referring to
- insensitivity to violation of the assumption of Gaussian population
- insensitivity to violation of the assumption of equal variances in the two-sample case

```
power.t.test(n=50, type="one.sample",
alt="one.sided", delta=.4)
```

```
##
## One-sample t test power calculation
##
## n = 50
## delta = 0.4
## sd = 1
## sig.level = 0.05
## power = 0.8737242
## alternative = one.sided
```

```
# empirical
mean( replicate(10000,
ts_zpm(rnorm(50, .4))>qt(.95, 49)) )
```

`## [1] 0.8764`

```
mean( replicate(10000, ts_zpm(
contam1(rnorm(50, .4), slip=25))>qt(.95, 49)) )
```

`## [1] 0.5834`

Exercises:

- plot power against size of slip
- show that calibrated outlier removal can restore power lost with contamination

We’ll use a series of outlier magnitudes (0:10) and summarize distributions of estimators

```
set.seed(12345)
mns <- sapply(0:10, function(o)
median(replicate(5000, mean(contam1(rnorm(50),o)))))
set.seed(12345)
meds <- sapply(0:10, function(o)
median(replicate(5000, median(contam1(rnorm(50),o)))))
```

```
plot(0:10, mns, xlab="outlier magnitude", ylab="median of stat",
pch=19)
points(0:10, meds, pch=19, col="blue")
legend(0, .1, pch=19, col=c("black", "blue"), legend=c("mean", "median"))
```

```
set.seed(12345)
sds <- sapply(0:10, function(o)
median(replicate(5000, sd(contam1(rnorm(50),o)))))
set.seed(12345)
mads <- sapply(0:10, function(o)
median(replicate(5000, mad(contam1(rnorm(50),o)))))
```

```
plot(0:10, sds, xlab="outlier magnitude", ylab="median of stat",
pch=19)
points(0:10, mads, pch=19, col="blue")
legend(0, 1.2, pch=19, col=c("black", "blue"), legend=c("SD", "MAD"))
```

- Resistant estimators “peel away” extreme values
- These estimators achieve “high breakdown bound”
- if up to 50% of data are corrupted, median continues to estimate population median
- if up to 25% of data are corrupted, MAD continues to estimate a scaled SD

- Generalize
`contam1()`

to help demonstrate the breakdown concept - Assess the robustness of size and power of rank-based tests (such as Wilcoxon’s signed rank test) to contamination by outliers.
- Note that dsignrank, qsignrank are available.

`qsignrank(.95, 50)`

`## [1] 808`

```
set.seed(12345)
wilcox.test(rnorm(50, .4))$statistic
```

```
## V
## 977
```

Show that the (x,y) pairs have identical

```
- marginal means
- marginal SDs
- correlation coefficients
- linear regressions of y on x
```

Use MASS::rlm to get a model for y3, x3 that fits the majority of points exactly

- DESeq2 has extensive discussion of Cook’s distance for identifying and reducing effects of apparent outliers
- Let’s get a feel for what Cook’s distance is: source(“shi.R”); docook()

```
suppressMessages({set.seed(12345)
library(DESeq2)
S1 = makeExampleDESeqDataSet(betaSD=.75)
D1 = DESeq(S1)
R1 = results(D1)})
```

`summary(R1)`

```
##
## out of 997 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 99, 9.9%
## LFC < 0 (down) : 95, 9.5%
## outliers [1] : 2, 0.2%
## low counts [2] : 211, 21%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
```

- Find the outliers labeled by DESeq2
- Are these really suspect?
- For the homogeneous simulation process demonstrated here, estimate the frequency of outlier labeling by the default rules
- Define a process for injecting aberrant counts and assess the accuracy of the default rules for identifying them. How do the rules contribute to validity and power of the basic testing procedure?

- Robustness is commonly cited as a property of new methodologies
- the term is often used vaguely

- In statistics, there is robustness of validity (Type I error rate is maintained despite failure of certain assumptions) and robustness of efficiency (power does not decline in the presence of failure of certain assumptions)
*Resistance*is a related concept: a fraction of*arbitrarily aberrant*values may be present, but their effect on estimation or inference is bounded- Outlier labeling is carefully studied for generations; the outliers may be the most interesting points in your data
- Learn how to use simulation, with specific attention to realism and flexibility of implementation, so that you can explore a variety of scenarios