# 1 Introduction

Gene expression technologies are useful for the study of transcriptomics and their associated profiles amongst biological samples of interest. The technology is used worldwide to examine complex relationships between gene expression (which we will refer to as the response variable when performing statistical modelling) and the variables that influence the expression (referred to as explanatory variables). From the resulting datasets, careful statistical analysis can be used to find relationships that are of biological interest through the choice of appropriate statistical models applied to the data. The modelling process requires the use of a design matrix (or model matrix) that has two roles: 1) it defines the form of the model, or structure of the relationship between genes and explanatory variables, and 2) it is used to store values of the explanatory variable(s)(Smyth 2004, 2005; Glonek and Solomon 2004). Although design matrices are fundamental concepts that are covered in many undergraduate mathematics and statistics courses, their specific and multi-disciplinary application to the analysis of genomic data types through the use of the R programming language adds several layers of complexity, both theoretically and in practice.

This article describes the appropriate design matrix set up for differential expression analyses specific to using the limma(Ritchie et al. 2015) software package, one of the most popular open-source software packages for such analysis worldwide. Our examples have been written for gene expression data, specifically with the assumption that the expression values are genewise log-count per million (log-CPM) measurements from an RNA-sequencing (RNA-seq) experiment. However, most of the concepts and R code covered in this article can also be applied to differential analyses of other genomic data types, including microarrays, ChIP-seq, ATAC-seq, BS-seq, Hi-C and proteomics. The main requirements are that the response data represents abundance on a log-scale and that each row corresponds to an appropriate genomic feature. Typically, the data table from an RNA-seq experiment contains the gene expression measurements for tens of thousands of genes and a small number of samples (usually no more than 10 or 20, although much larger sample sizes are possible). In the modelling process, a single design matrix is defined and then simultaneously applied to each and every gene in the dataset. Rather than demonstrating the application of design matrices across multiple genes, where the modelling concepts are consistent between genes, we simply describe the process for a single gene in our examples. This allows us to illustrate clearly differences between varying models and the implications of adding or removing model parameters.

The article begins by introducing the basic concepts associated with design and contrast matrices. We cover common experimental designs used in genomics research, and move onto more complex study designs as we progress through the sections. We have approached our work from a practical stand-point, with a focus on the R programming inputs and outputs, accompanied by associated plots to illustrate the observed data that we begin with and the fitted models that are produced from a graphical perspective. By omitting most of the theory associated with design matrices, our article allows readers from various backgrounds to gain a better understanding of design matrices, without having statistics as a prerequisite. To enable readers to select the most appropriate design matrix set up for their study design, we also discuss the interpretation of the models and the differences between them.

In each of our examples, we will explicitly display the observed data and include the R code for associated design and contrast matrices that are used in the modelling process. This allows readers to quickly grasp modelling concepts and to apply the R code in their own datasets. Each example is also accompanied by a figure displaying the design matrix and both a written and graphical representation of the statistical model. Whilst the complete data analysis process, from pre-processing data to variance modelling and parameter estimation is not discussed in this article, the design matrices we describe can be implemented in conjunction with the “RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR” differential expression workflow article(Law et al. 2016) for an RNA-seq analysis beginning with a table of counts.

Other complementary work focusing on design matrices includes that of the CRAN codingMatrices package vignette(Venables 2018), which describes the theoretical aspects of design matrices, and the ExploreModelMatrix software package(Soneson et al. 2020), which allows interactive exploration of design matrices for specified explanatory variables. Although not focusing purely on design matrices, the user’s guides for the limma and edgeR(Robinson, McCarthy, and Smyth 2010; McCarthy, Chen, and Smyth 2012) software packages also contain many example case studies for different experimental designs.

# 2 Basic models

## 2.1 Background

In this section, we outline the general form of some basic models and introduce terminology that will be used in the remainder of the article. The concept of model equations and associated graphical illustrations for fitted models are also introduced here.

## 2.2 Regression model for covariates

To begin with, let us consider two types of explanatory variables: covariates and factors. Covariates contain numerical values that are quantitative measurements associated with samples in the experiment. These can be the age or weight of an individual, or other molecular or cellular phenotypes on a sample, such as measurements obtained from a polymerase chain reaction (PCR) experiment or fluorescence activated cell sorting (FACS). For covariates, it is generally of interest to know the rate of change between the response and the covariate, such as “how much does the expression of a particular gene increase/decrease per unit increase in age?”. We can use a straight line to model, or describe, this relationship, which takes the form of $\begin{equation*} \text{expression} = \beta_0 + \beta_1 \text{age} \end{equation*}$ where the line is defined by $$\beta_0$$ the y-intercept and $$\beta_1$$ the slope (Figure 1). In this model, the age covariate takes continuous, numerical values such as 0.8, 1.3, 2.0, 5.6, and so on. We refer to this model generally as a regression model, where the slope indicates the rate of change, or how much gene expression is expected to increase/decrease by per unit increase of the covariate. The y-intercept and slope of the line, or the $$\beta$$s ($$\beta_0$$ and $$\beta_1$$), are referred to as the model parameters. The true values of the parameters are unknown, but are estimated in the modelling process. A positive estimate of a model parameter indicates that an explanatory variable has a positive influence on gene expression (an increasing slope), whilst a negative value indicates that the explanatory variable has a negative influence on gene expression (a decreasing slope). In some cases, one may convert the age covariate into a factor by categorising the smaller values as “young” and larger values as “mature”, and instead use the models described below.

## 2.3 Means model for factors

Factors are categorical variables or classifiers associated with samples in the experiment. They are often separated into those that are of a biological nature (e.g. disease status, genotype, treatment, cell-type) and those that are of a technical nature (e.g. experiment time, sample batch, handling technician, sequencing lane). Unique values within a factor are referred to as levels. For example, genotype as a factor may contain two levels, “wildtype” and “mutant”. Here, it is generally of interest to determine the expected or mean gene expression for each state or level of the factor. The relationship between gene expression and the factor can be described, or modelled, in the form of $\begin{equation*} \text{expression} = \beta_1 \text{wildtype} + \beta_2 \text{mutant} \end{equation*}$ where $$\beta_1$$ represents the mean gene expression for wildtype, and $$\beta_2$$ represents the mean gene expression for mutant (Figure 1). Unlike that of the age covariate which can take on any non-negative numerical value in the model, the levels of genotype can only take on the values of 0 or 1. For example, wildtype is equal to 1 (and mutant is equal to 0) when determining the expected expression for the wildtype group, such that $\begin{equation*} \text{expression} = \beta_1 \qquad \text{for wildtype.} \end{equation*}$ Similarly, mutant is equal to 1 (and wildtype is equal to 0) when determining the expected expression for the mutant group, such that $\begin{equation*} \text{expression} = \beta_2 \qquad \text{for mutant.} \end{equation*}$ Notice that wildtype and mutant “take turns” in taking on the 0 and 1 values. This is because the categorisation of samples as wildtype or mutant are mutually exclusive, where a sample cannot be both wildtype and mutant, or neither wildtype nor mutant. The model estimates expected gene expression as $$\beta_1$$ or $$\beta_2$$, where $$\beta_1$$ is calculated as the mean of observed expression values in wildtype, and $$\beta_2$$ is calculated as the mean of observed expression values in mutant. In other words, the $$\beta$$s (or model parameters) are estimated as the mean of each level in the genotype factor, as depicted in Figure 1 as distinct solid lines.

Each of the horizontal lines in Figure 1 are defined by their y-intercept (and a slope of 0), and are themselves regression models. We, however, will refer specifically to models of this type as a means model since the model parameters represent the group means. This also allows us to differentiate these models from the general regression models applied to covariates where the y-intercept and slope can both be non-zero. As noted for covariates, the true values for the model parameters are unknown but estimable. Whilst the expected expression of each factor level is informative, it is often the difference in expression between levels that are of key interest, e.g. “what is the difference in expression between wildtype and mutant?”. These differences are calculated using linear combinations of the parameters (a fancy way to say that we multiply each parameter by a constant) which we refer to as contrasts. For example, a contrast of $$(1,-1)$$ calculates $$\beta_1-\beta_2$$, the difference in means between wildtype and mutant.

## 2.4 Mean-reference model for factors

An alternative parameterisation of the means model directly calculates the gene expression difference between mutant and wildtype. It does so by using one of the levels as a reference. Such a model is parameterised for the mean expression of the reference level (e.g. wildtype), and the rest of the levels are parameterised relative to the reference (e.g. the difference between mutant and wildtype). The relationship between gene expression and genotype is modelled in the form of $\begin{equation*} \text{expression} = \beta_1 + \beta_2 \text{mutant} \end{equation*}$ where $$\beta_1$$ represents the mean gene expression for wildtype, and $$\beta_2$$ is the difference between means of mutant and wildtype (Figure 1). Here, mutant in the equation takes the value of 0 when determining the expected expression for the wildtype group, such that $\begin{equation*} \text{expression} = \beta_1 \qquad \text{for wildtype.} \end{equation*}$ On the other hand, mutant is equal to 1 when determining the expected expression for the mutant group, such that $\begin{equation*} \text{expression} = \beta_1 + \beta_2 \qquad \text{for mutant.} \end{equation*}$ Expected gene expression for wildtype is represented directly by the first model parameter, $$\beta_1$$, and depicted as a solid line in Figure 1. Whilst the expected gene expression for mutant is calculated as the sum of both parameters, and represented by a dashed line in Figure 1. Like the means model, the model demonstrated here is a regression model in itself. We, however, refer to this model specifically as a mean-reference model to distinguish it from the general model form that we use for covariate explanatory variables. The means model and the mean-reference model are equivalent models that differ in parameterisation, such that the form of the model is different but one could obtain equivalent values for the expected gene expression of wildtype and mutant from both models.

## 2.5 Terminology

The terminology and concepts covered in this section are summarised in the table below, in the context of modelling gene expression data. The table also extends to some definitions and descriptions covered later in the article, and is a useful resource to refer to from time-to-time.