Susan Holmes, Wolfgang Huber
2022-06-24
Design of High Throughput Experiments
 |
To consult the statistician after an experiment
is finished is often merely to ask him to conduct a post mortem
examination. He can perhaps say what the experiment died of. |
(Fisher 1935) |
(Presidential Address to the First Indian Statistical
Congress, 1938. Sankhya 4, 14-17). |
Goals for this Lecture
- Resource allocation and experimental design: an iterative
process.
- Dealing with the different types of variability; partitioning
variability
- Transformations
- Types of experiments, studies, …
- Power, sample size and efficiency.
- Things to worry about: dependencies, batch effects, unwanted
variation.
- Compression, redundancy and sufficiency
- Computational best practices
The Art of “Good Enough”
- Experimental design rationalizes the tradeoffs imposed by having
finite resources.
- Our measurement instruments have limited resolution and precision;
often we don’t know these at the outset and have to collect preliminary
data providing estimates.
- Sample sizes are limited for practical, economic, and sometimes
ethical reasons.
- We may only be able to observe the phenomenon of interest indirectly
rather than directly.
- Our measurements may be overlaid with nuisance factors over which we
have limited control.
- There is little point in prescribing unrealistic ideals: we need to
make pragmatic choices that are feasible.
Types of studies / experiments
Experiment
- everything is exquisitely controlled
Retrospective observational
studies
- we take what we get, opportunistic; no control over study
participants, assignment of important factors, confounding
Prospective, controlled
studies
- e.g. clinical trials
- randomization, blinding
- ethical constraints (incl. money and time).
- We did not design the experiments or studies ourselves, nor collect
the data.
- Retrospective analysis of data that already happen to exist.
Illustration: experiment
Well-characterized cell line growing in laboratory conditions on
defined media, temperature and atmosphere.
We administer a precise amount of a drug, and after 72h we measure
the activity of a specific pathway reporter.
Illustration: challenges with studies
We recruited 200 patients that have a disease, fulfill inclusion
criteria (e.g. age, comorbidities, mental capacity) and ask them to take
a drug each day exactly at 6 am. After 3 months, we take an MRI scan and
lots of other biomarkers to see whether and how the disease has changed
or whether there were any other side effects.
- People may forget to take the pill or take it at the wrong
time.
- Some may feel that the disease got worse and stop taking the
drug.
- Some may feel that the disease got better and stop taking the
drug.
- Some may lead a healthy life-style, others eat junk food.
- They have varying levels of disease to start with.
- And all of these factors may be correlated with each other in
unpredictable ways.
What to do about this?
Examples
- We modeled the sampling noise in RNA-seq and 16S rRNA with a
Gamma-Poisson distribution.
- We estimated sequencing depth bias with the library size
factors.
- We modeled sampling biases caused by the two different RNA-Seq
protocols in the pasilla data (single-, paired-end) by introducing a
blocking factor into our (generalized) linear model.
What is a good normalization method?
library("DESeq2")
library("airway")
library("ggplot2")
library("dplyr")
library("gridExtra")
data("airway")
aw = DESeqDataSet(airway, design = ~ cell + dex) %>% estimateSizeFactors
sizeFactors(aw)
samples = c("SRR1039513", "SRR1039517")
myScatterplot = function(x) {
as_tibble(x) %>%
mutate(rs = rowSums(x)) %>%
filter(rs >= 2) %>%
ggplot(aes(x = asinh(SRR1039513),
y = asinh(SRR1039517))) + geom_hex(bins = 50) +
coord_fixed() +
geom_abline(slope = 1, intercept = 0, col = "orange") +
theme(legend.position = "none")
}
grid.arrange(
myScatterplot(counts(aw)),
myScatterplot(counts(aw, normalized = TRUE)),
ncol = 2)

- If the normalization is ‘off’, we can have increased variability
between replicates
- … and/or apparent systematic differences between different
conditions that are not real
- \(\to\) false positives, false
negatives
What do we want from a good normalization method:
- remove technical variation
- but keep biological variation
Possible figure of merit?
Occam’s razor
William of Ockham
If one can explain a phenomenon without assuming this or that
hypothetical entity, there is no ground for assuming it.
One should always opt for an explanation in terms of the fewest
possible causes, factors, or variables.
Error models: Noise is in the eye of the beholder
The efficiency of most biochemical or physical processes involving
DNA-polymers depends on their sequence content, for instance,
occurrences of long homopolymer stretches, palindromes, GC content.
These effects are not universal, but can also depend on
factors like concentration, temperature, which enzyme
is used, etc.
When looking at RNA-Seq data, should we treat GC content as noise or
as bias?
One person’s noise can be another’s bias
We may think that the outcome of tossing a coin
is completely random.
If we meticulously registered the initial conditions of the coin flip
and solved the mechanical equations, we could predict which side has a
higher probability of coming up: noise becomes bias.