# Basic robustness concepts of statistics

June 14, 2017

• simulation of test size and power
• effects of sample contamination
• resistant statistics: robustification of key statistical procedures
• measuring influence of observations in DESeq2

# Some code to enter in your workspace

Mystery functions?

ts_zpm = function(d) sqrt(length(d))*mean(d)/sd(d)
contam1 = function(x, slip=5) {x[1] = x[1]+slip; x}

# Try them out

set.seed(12345)
X = rnorm(50)
ts_zpm(X)
## [1] 1.157889
ts_zpm(contam1(X))
## [1] 1.479476
ts_zpm(contam1(X,100))
## [1] 1.082075

# The concept of the critical region

• Wolfgang: count events and check whether the count falls into a region prescribed to have low probability under the null hypothesis
• The probabilities of regions are prescribed by the binomial distribution
• We need probabilities and principles for choosing the critical region
• represent a specific direction of effect (one-sided?)
• based on theory or simulation?

# Assumptions underlying a test procedure

• Exact critical regions based on finite sample distribution of a statistic
• $$t$$, Wilcoxon
• the parent distribution must satisfy certain conditions
• data are a random sample from homogeneous population (iid)
• for $$t$$: parent is Gaussian
• for Wilcoxon: parent is continuous, finite variance (?)
• Approximate critical regions based on large sample theory
• how large is large enough?
• data are a random sample from homogeneous population (iid)
• Approximate critical regions based on simulation
• the generative distribution accurately represents variation underlying the experiment or observations

# Computing and reporting a t test for a simple hypothesis

set.seed(12345)
X = rnorm(50)
tst = t.test(X)
tst
##
##  One Sample t-test
##
## data:  X
## t = 1.1579, df = 49, p-value = 0.2525
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.1320800  0.4912126
## sample estimates:
## mean of x
## 0.1795663

# Components of the result

str(tst)
## List of 9
##  $statistic : Named num 1.16 ## ..- attr(*, "names")= chr "t" ##$ parameter  : Named num 49
##   ..- attr(*, "names")= chr "df"
##  $p.value : num 0.253 ##$ conf.int   : atomic [1:2] -0.132 0.491
##   ..- attr(*, "conf.level")= num 0.95
##  $estimate : Named num 0.18 ## ..- attr(*, "names")= chr "mean of x" ##$ null.value : Named num 0
##   ..- attr(*, "names")= chr "mean"
##  $alternative: chr "two.sided" ##$ method     : chr "One Sample t-test"
##  $data.name : chr "X" ## - attr(*, "class")= chr "htest" # The structure of the t statistic: a ratio of measures of location and dispersion tst$statistic  # from R
##        t
## 1.157889
sqrt(50)*mean(X)/sd(X) # by hand
## [1] 1.157889

# Simulating the distribution of t under the null hypothesis: $$X \sim N(0,\sigma^2)$$

set.seed(12345)
simdist = replicate(10000,  ts_zpm( rnorm(50) ))
head(simdist)
## [1]  1.15788926  1.92818028 -0.08173422  0.82143100  0.39880144 -1.21706025

# Simulated data and theoretical density

hist(simdist, freq=FALSE)
lines(seq(-3,3,.01), dt(seq(-3,3,.01), 49))

# The one-sided p-value for $$\hat{t}$$ = 1.1579

simulated:

mean(simdist > 1.1579)
## [1] 0.123

theoretical (exact):

integrate(function(x) dt(x,49), 1.1579, Inf)$value ## [1] 0.1262589 # What if there is a contaminated observation? contsim = replicate(10000, ts_zpm( contam1( rnorm(50) ) ) ) critval_1sided = qt(.95, 49) mean(simdist > critval_1sided) # uncontaminated ## [1] 0.0483 mean(contsim > critval_1sided) # contaminated ## [1] 0.0946 # Can we fix this? library(parody) ## Loading required package: tools robust_t = function(x) { outchk = calout.detect(x, method="GESD") if (!is.na(outchk$ind[1])) x = x[-outchk$ind] sqrt(length(x))*mean(x)/sd(x) } set.seed(12345) contsim_r = replicate(10000, robust_t( contam1( rnorm(50) ) ) ) mean(contsim_r > critval_1sided) # robust test on contaminated ## [1] 0.0531 # Recap • the one-sided one sample t test for $$H_0: \mu = 0$$ involves • $$n$$, $$\bar{X}$$, $$s$$ to form the test statistic, and • the $$t$$ density to form critical values • simulation from the null distribution can be used to obtain an empirical p value • simulation from the contaminated distribution with a single observation shifted by $$5\sigma$$ leads to a Type I error rate (using the standard $$t$$ critical value for $$\alpha = 0.05$$) of 0.095 • robustification of the test statistic using calibrated outlier removal can restore (approximately) the nominal Type I error rate • can’t conclude too much from this example, of course … there is considerable theoretical literature on sensitivity of tests to contamination • the t test is described as ‘robust’ in many texts, referring to • insensitivity to violation of the assumption of Gaussian population • insensitivity to violation of the assumption of equal variances in the two-sample case # Power to reject $$H_0:\mu = 0$$ when $$\mu = 0.4$$ power.t.test(n=50, type="one.sample", alt="one.sided", delta=.4) ## ## One-sample t test power calculation ## ## n = 50 ## delta = 0.4 ## sd = 1 ## sig.level = 0.05 ## power = 0.8737242 ## alternative = one.sided # empirical mean( replicate(10000, ts_zpm(rnorm(50, .4))>qt(.95, 49)) ) ## [1] 0.8764 # Effect of contamination: a single large contaminant can slash power mean( replicate(10000, ts_zpm( contam1(rnorm(50, .4), slip=25))>qt(.95, 49)) ) ## [1] 0.5834 Exercises: • plot power against size of slip • show that calibrated outlier removal can restore power lost with contamination # Sample median is a resistant estimator of location We’ll use a series of outlier magnitudes (0:10) and summarize distributions of estimators set.seed(12345) mns <- sapply(0:10, function(o) median(replicate(5000, mean(contam1(rnorm(50),o))))) set.seed(12345) meds <- sapply(0:10, function(o) median(replicate(5000, median(contam1(rnorm(50),o)))))  # Note sensitivity of mean plot(0:10, mns, xlab="outlier magnitude", ylab="median of stat", pch=19) points(0:10, meds, pch=19, col="blue") legend(0, .1, pch=19, col=c("black", "blue"), legend=c("mean", "median")) # Sample MAD is a resistant estimator of dispersion set.seed(12345) sds <- sapply(0:10, function(o) median(replicate(5000, sd(contam1(rnorm(50),o))))) set.seed(12345) mads <- sapply(0:10, function(o) median(replicate(5000, mad(contam1(rnorm(50),o)))))  # Note sensitivity of SD plot(0:10, sds, xlab="outlier magnitude", ylab="median of stat", pch=19) points(0:10, mads, pch=19, col="blue") legend(0, 1.2, pch=19, col=c("black", "blue"), legend=c("SD", "MAD")) # Recap • Resistant estimators “peel away” extreme values • These estimators achieve “high breakdown bound” • if up to 50% of data are corrupted, median continues to estimate population median • if up to 25% of data are corrupted, MAD continues to estimate a scaled SD # Exercises • Generalize contam1() to help demonstrate the breakdown concept • Assess the robustness of size and power of rank-based tests (such as Wilcoxon’s signed rank test) to contamination by outliers. • Note that dsignrank, qsignrank are available. qsignrank(.95, 50) ## [1] 808 set.seed(12345) wilcox.test(rnorm(50, .4))$statistic
##   V
## 977

# Anscombe’s data

Show that the (x,y) pairs have identical

- marginal means
- marginal SDs
- correlation coefficients
- linear regressions of y on x

Use MASS::rlm to get a model for y3, x3 that fits the majority of points exactly

# Outliers in RNA-seq analysis

• DESeq2 has extensive discussion of Cook’s distance for identifying and reducing effects of apparent outliers
• Let’s get a feel for what Cook’s distance is: source(“shi.R”); docook()

# Simulating an RNA-seq experiment

suppressMessages({set.seed(12345)
library(DESeq2)
S1 = makeExampleDESeqDataSet(betaSD=.75)
D1 = DESeq(S1)
R1 = results(D1)})
summary(R1)
##
## out of 997 with nonzero total read count
## LFC > 0 (up)     : 99, 9.9%
## LFC < 0 (down)   : 95, 9.5%
## outliers [1]     : 2, 0.2%
## low counts [2]   : 211, 21%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

# Exercises

• Find the outliers labeled by DESeq2
• Are these really suspect?
• For the homogeneous simulation process demonstrated here, estimate the frequency of outlier labeling by the default rules
• Define a process for injecting aberrant counts and assess the accuracy of the default rules for identifying them. How do the rules contribute to validity and power of the basic testing procedure?

# Conclusion

• Robustness is commonly cited as a property of new methodologies
• the term is often used vaguely
• In statistics, there is robustness of validity (Type I error rate is maintained despite failure of certain assumptions) and robustness of efficiency (power does not decline in the presence of failure of certain assumptions)
• Resistance is a related concept: a fraction of arbitrarily aberrant values may be present, but their effect on estimation or inference is bounded
• Outlier labeling is carefully studied for generations; the outliers may be the most interesting points in your data
• Learn how to use simulation, with specific attention to realism and flexibility of implementation, so that you can explore a variety of scenarios