June 14, 2017

# Outline for Introduction to Linear Models

• Multiple linear regression
• Continuous and categorical predictors
• Interactions
• Model formulae
• Design matrix
• Generalized Linear Models
• Poisson, Negative Binomial error distributions
• Zero inflation

Based on:

1. Love and Irizarry, Data Analysis for the Life Sciences, Chapter 5
2. Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models (Statistics for Biology and Health). Springer; 2005.

# Example: friction of spider legs

• (A) Barplot showing total claw tuft area of the corresponding legs.
• (B) Boxplot presenting friction coefficient data illustrating median, interquartile range and extreme values.

# Example: friction of spider legs

• Are the pulling and pushing friction coefficients different?
• Are the friction coefficients different for the different leg pairs?
• Does the difference between pulling and pushing friction coefficients vary by leg pair?

# Example: friction of spider legs

table(spider$leg,spider$type)
##
##      pull push
##   L1   34   34
##   L2   15   15
##   L3   52   52
##   L4   40   40
summary(spider)
##  leg        type        friction
##  L1: 68   pull:141   Min.   :0.1700
##  L2: 30   push:141   1st Qu.:0.3900
##  L3:104              Median :0.7600
##  L4: 80              Mean   :0.8217
##                      3rd Qu.:1.2400
##                      Max.   :1.8400

# What are linear models?

• Linear models model a response variable $$y_i$$ as a linear combination of predictors, plus randomly distributed noise.
• A common use case is where $$y_i$$ are log-transformed microarray data
• predictors are treatment vs. control, male vs. female, amount of exposure, etc

# Multiple linear regression model

• Linear models can have any number of predictors
• Systematic part of model:

$E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p$

• $$E[y|x]$$ is the expected value of $$y$$ given $$x$$
• $$y$$ is the outcome, response, or dependent variable
• $$x$$ is the vector of predictors / independent variables
• $$x_p$$ are the individual predictors or independent variables
• $$\beta_p$$ are the regression coefficients

# Multiple linear regression model

Systematic plus random components of model:

$$y_i = E[y_i|x_i] + \epsilon_i$$

Assumptions of linear models: $$\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)$$

• Normal distribution
• Mean zero at every value of predictors
• Constant variance at every value of predictors
• Values that are statistically independent

# Continuous predictors

• Coding: as-is, or may be scaled to unit variance (which results in adjusted regression coefficients)
• Interpretation for linear regression: An increase of one unit of the predictor results in this much difference in the continuous outcome variable

# Binary predictors (2 levels)

• Coding: indicator or dummy variable (0-1 coding)
• Interpretation for linear regression: the increase or decrease in average outcome levels in the group coded “1”, compared to the reference category (“0”)
• e.g. $$E(y|x) = \beta_0 + \beta_1 x$$
• where x={ 1 if push friction, 0 if pull friction }

# Multilevel categorical predictors (ordinal or nominal)

• Coding: $$K-1$$ dummy variables for $$K$$-level categorical variable
• Comparisons with respect to a reference category, e.g. L1:
• L2={1 if $$2^{nd}$$ leg pair, 0 otherwise},
• L3={1 if $$3^{nd}$$ leg pair, 0 otherwise},
• L4={1 if $$4^{th}$$ leg pair, 0 otherwise}.
• R re-codes factors to dummy variables automatically.
• Note that factors can be ordered or unordered

# Model formulae in R

Model formulae tutorial

• regression functions in R such as aov(), lm(), glm(), and coxph() use a “model formula” interface.
• The formula determines the model that will be built (and tested) by the R procedure. The basic format is:

> response variable ~ explanatory variables

• The tilde means “is modeled by” or “is modeled as a function of.”

# Regression with a single predictor

Model formula for simple linear regression:

> y ~ x

• where “x” is the explanatory (independent) variable
• “y” is the response (dependent) variable.

Friction coefficient for leg type of first leg pair:

spider.sub <- spider[spider$leg=="L1", ] fit <- lm(friction ~ type, data=spider.sub) summary(fit) ## ## Call: ## lm(formula = friction ~ type, data = spider.sub) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.33147 -0.10735 -0.04941 -0.00147 0.76853 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.92147 0.03827 24.078 < 2e-16 *** ## typepush -0.51412 0.05412 -9.499 5.7e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2232 on 66 degrees of freedom ## Multiple R-squared: 0.5776, Adjusted R-squared: 0.5711 ## F-statistic: 90.23 on 1 and 66 DF, p-value: 5.698e-14 # Regression on spider leg type Regression coefficients for friction ~ type for first set of spider legs: fit.table <- xtable::xtable(fit, label=NULL) print(fit.table, type="html") Estimate Std. Error t value Pr(>|t|) (Intercept) 0.9215 0.0383 24.08 0.0000 typepush -0.5141 0.0541 -9.50 0.0000 • How to interpret this table? • Coefficients for (Intercept) and typepush • Coefficients are t-distributed when assumptions are correct • Standard Error is the sampling variance of the estimates # Interpretation of coefficients # Regression on spider leg position Remember there are positions 1-4 fit <- lm(friction ~ leg, data=spider) fit.table <- xtable::xtable(fit, label=NULL) print(fit.table, type="html") Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6644 0.0538 12.34 0.0000 legL2 0.1719 0.0973 1.77 0.0784 legL3 0.1605 0.0693 2.32 0.0212 legL4 0.2813 0.0732 3.84 0.0002 • Interpretation of the dummy variables legL2, legL3, legL4 ? # Regression with multiple predictors Additional explanatory variables can be added as follows: > y ~ x + z Note that “+” does not have its usual meaning, which would be achieved by: > y ~ I(x + z) # Regression on spider leg type and position Remember there are positions 1-4 fit <- lm(friction ~ type + leg, data=spider) fit.table <- xtable::xtable(fit, label=NULL) print(fit.table, type="html") Estimate Std. Error t value Pr(>|t|) (Intercept) 1.0539 0.0282 37.43 0.0000 typepush -0.7790 0.0248 -31.38 0.0000 legL2 0.1719 0.0457 3.76 0.0002 legL3 0.1605 0.0325 4.94 0.0000 legL4 0.2813 0.0344 8.18 0.0000 • this model still doesn’t represent how the friction differences between different leg positions are modified by whether it is pulling or pushing # Interaction (effect modification) Image credit: http://personal.stevens.edu/~ysakamot/ # Interaction (effect modification) Interaction is modeled as the product of two covariates: $E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2$ # Summary: model formulae symbol example meaning + + x include this variable - - x delete this variable : x : z include the interaction * x * z include these variables and their interactions ^ (u + v + w)^3 include these variables and all interactions up to three way 1 -1 intercept: delete the intercept # Summary: types of standard linear models lm( y ~ u + v) u and v factors: ANOVA u and v numeric: multiple regression one factor, one numeric: ANCOVA • R does a lot for you based on your variable classes • be sure you know the classes of your variables • be sure all rows of your regression output make sense # The Design Matrix # The Design Matrix Recall the multiple linear regression model: $$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i$$ • $$x_{ji}$$ is the value of predictor $$x_j$$ for observation $$i$$ # The Design Matrix Matrix notation for the multiple linear regression model: $\, \begin{pmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix}$ or simply: $\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}$ • The design matrix is $$\mathbf{X}$$ • which the computer will take as a given when solving for $$\boldsymbol{\beta}$$ by minimizing the sum of squares of residuals $$\boldsymbol{\varepsilon}$$. # Choice of design matrix • there are multiple possible and reasonable design matrices for a given study design • the model formula encodes a default model matrix, e.g.: group <- factor( c(1, 1, 2, 2) ) model.matrix(~ group) ## (Intercept) group2 ## 1 1 0 ## 2 1 0 ## 3 1 1 ## 4 1 1 ## attr(,"assign") ## [1] 0 1 ## attr(,"contrasts") ## attr(,"contrasts")$group
## [1] "contr.treatment"

# Choice of design matrix

What if we forgot to code group as a factor?

group <- c(1, 1, 2, 2)
model.matrix(~ group)
##   (Intercept) group
## 1           1     1
## 2           1     1
## 3           1     2
## 4           1     2
## attr(,"assign")
## [1] 0 1

# More groups, still one variable

group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)
##   (Intercept) group2 group3
## 1           1      0      0
## 2           1      0      0
## 3           1      1      0
## 4           1      1      0
## 5           1      0      1
## 6           1      0      1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group ## [1] "contr.treatment" # Changing the baseline group group <- factor(c(1,1,2,2,3,3)) group <- relevel(x=group, ref=3) model.matrix(~ group) ## (Intercept) group1 group2 ## 1 1 1 0 ## 2 1 1 0 ## 3 1 0 1 ## 4 1 0 1 ## 5 1 0 0 ## 6 1 0 0 ## attr(,"assign") ## [1] 0 1 1 ## attr(,"contrasts") ## attr(,"contrasts")$group
## [1] "contr.treatment"

# More than one variable

diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)
##   (Intercept) diet2 sexm
## 1           1     0    0
## 2           1     0    0
## 3           1     0    1
## 4           1     0    1
## 5           1     1    0
## 6           1     1    0
## 7           1     1    1
## 8           1     1    1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$diet ## [1] "contr.treatment" ## ## attr(,"contrasts")$sex
## [1] "contr.treatment"

# With an interaction term

model.matrix(~ diet + sex + diet:sex)
##   (Intercept) diet2 sexm diet2:sexm
## 1           1     0    0          0
## 2           1     0    0          0
## 3           1     0    1          0
## 4           1     0    1          0
## 5           1     1    0          0
## 6           1     1    0          0
## 7           1     1    1          1
## 8           1     1    1          1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$diet ## [1] "contr.treatment" ## ## attr(,"contrasts")$sex
## [1] "contr.treatment"

# Design matrix to contrast what we want

• Spider leg friction example:
• The question of whether push vs. pull difference is different in L2 compared to L1 is answered by the term typepush:legL2 in a model with interaction terms:
fitX <- lm(friction ~ type * leg, data=spider)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9215 0.0327 28.21 0.0000
typepush -0.5141 0.0462 -11.13 0.0000
legL2 0.2239 0.0590 3.79 0.0002
legL3 0.3524 0.0420 8.39 0.0000
legL4 0.4793 0.0444 10.79 0.0000
typepush:legL2 -0.1039 0.0835 -1.24 0.2144
typepush:legL3 -0.3838 0.0594 -6.46 0.0000
typepush:legL4 -0.3959 0.0628 -6.30 0.0000

**What if we want to ask this question for L3 vs L2?

# Design matrix to contrast what we want

What if we want to contrast…

typepush:legL3 - typepush:legL2

There are many ways to construct this design, one is with library(multcomp):

names(coef(fitX))
## [1] "(Intercept)"    "typepush"       "legL2"          "legL3"
## [5] "legL4"          "typepush:legL2" "typepush:legL3" "typepush:legL4"
C <- matrix(c(0,0,0,0,0,-1,1,0), 1)
L3vsL2interaction <- multcomp::glht(fitX, linfct=C) 

# Design matrix to contrast what we want

Is there a difference in pushing friction for L3 vs L2?

summary(L3vsL2interaction)
##
##   Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = friction ~ type * leg, data = spider)
##
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.27988    0.07893  -3.546  0.00046 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

# Summary: applications of model matrices

• Major differential expression packages recognize them:
• LIMMA (VOOM for RNA-seq)
• DESeq2 for all kinds of count data
• EdgeR
• Can fit coefficients directly to your contrast of interest
• e.g.: what is the difference between push/pull friction for each spider-leg pair?

# Generalized Linear Models

• Linear regression is a special case of a broad family of models called “Generalized Linear Models” (GLM)
• This unifying approach allows to fit a large set of models using maximum likelihood estimation methods (MLE) (Nelder & Wedderburn, 1972)
• Can model many types of data directly using appropriate distributions, e.g. Poisson distribution for count data
• Transformations of $$y$$ not needed

# Components of a GLM

$g\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}$

• Random component specifies the conditional distribution for the response variable
• doesn’t have to be normal
• can be any distribution in the “exponential” family of distributions
• Systematic component specifies linear function of predictors (linear predictor)
• Link [denoted by g(.)] specifies the relationship between the expected value of the random component and the systematic component
• can be linear or nonlinear

# Linear Regression as GLM

• Useful for log-transformed microarray data

• The model: $$y_i = E[y|x] + \epsilon_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i$$

• Random component of $$y_i$$ is normally distributed: $$\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)$$

• Systematic component (linear predictor): $$\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}$$

• Link function here is the identity link: $$g(E(y | x)) = E(y | x)$$. We are modeling the mean directly, no transformation.

# Logistic Regression as GLM

• Useful for binary outcomes, e.g. Single Nucleotide Polymorphisms or somatic variants

• The model: $Logit(P(x)) = log \left( \frac{P(x)}{1-P(x)} \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}$

• Random component: $$y_i$$ follows a Binomial distribution (outcome is a binary variable)

• Systematic component: linear predictor $\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}$

• Link function: logit (log of the odds that the event occurs)

$g(P(x)) = logit(P(x)) = log\left( \frac{P(x)}{1-P(x)} \right)$

$P(x) = g^{-1}\left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right)$

# Log-linear GLM

The systematic part of the GLM is:

$log\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + log(t_i)$

• Common for count data
• can account for differences in sequencing depth by an offset $$log(t_i)$$
• guarantees non-negative expected number of counts
• often used in conjunction with Poisson or Negative Binomial error models

# Poisson error model

$f(k, \lambda) = e^{-\lambda} \frac{\lambda^k}{k!}$

• where $$f$$ is the probability of $$k$$ events (e.g. # of reads counted), and
• $$\lambda$$ is the mean number of events, so $$E[y|x]$$
• $$\lambda$$ is also the variance of the number of events

# Negative Binomial error model

• aka gamma–Poisson mixture distribution

$f(k, \lambda, \theta) = \frac{\Gamma(\frac{1 + \theta k}{\theta})}{k! \, \Gamma(\frac{1}{\theta})} \left(\frac{\theta m}{1+\theta m}\right)^k \left(1+\theta m\right)^\theta \quad\text{for }k = 0, 1, 2, \dotsc$ * where $$f$$ is still the probability of $$k$$ events (e.g. # of reads counted), * $$\lambda$$ is still the mean number of events, so $$E[y|x]$$ * An additional dispersion parameter $$\theta$$ is estimated: + $$\theta \rightarrow 0$$: Poisson distribution + $$\theta \rightarrow \infty$$: Gamma distribution
* The Poisson model can be considered as nested within the Negative Binomial model + A likelihood ratio test comparing the two models is possible

# Zero Inflation

• Can be used when count data have extra zeros generated by something other than the Poisson / Negative Binomial distribution
• Can think of it as a two-step model:
1. logistic model to determine whether count is zero or Poisson/NB
2. Poisson or Negative Binomial distribution for observations not set to zero by 1.
• Or as a mixture of:
1. a point mass at zero, and
2. a Poisson or Negative Binomial count distribution
• Warning: be aware of what your logistic model is
• in some implementations default includes same predictors as log-linear model, but
• more than intercept-only becomes hard to interpret and is probably not justified

• Linear regression is an additive model
• e.g. for two binary variables $$\beta_1 = 1.5$$, $$\beta_2 = 1.5$$.
• If $$x_1=1$$ and $$x_2=1$$, this adds 3.0 to $$E(y|x)$$
• Logistic and log-linear models are multiplicative:
• If $$x_1=1$$ and $$x_2=1$$, this adds 3.0 to $$log(\frac{P}{1-P})$$
• Odds-ratio $$\frac{P}{1-P}$$ increases 20-fold: $$exp(1.5+1.5)$$ or $$exp(1.5) * exp(1.5)$$

# Summary

• Generalized Linear Models extend linear regression to:
• binary $$y$$ (logistic regression)
• count $$y$$ (log-linear regression with e.g. Poisson or Negative Binomial link functions)
• The basis for identifying differential expression / differential abundance
• Know the model formula interface, but
• use model matrices to directly fit coefficients that you want to interpret