Common Workflows, with an RNA-seq example

useR! 2014
Author: Martin Morgan (mtmorgan@fhcrc.org), Sonali Arora
Date: 30 June, 2014

Overview

RNASeq

ChIPSeq – Chromatin immunopreciptation

Variants

Structural

In depth: RNASeq

Work flow

Statistical challenges

RNASeq in parathyroid adenomas

This work flow derives from a more comprehensive example developed in the parathyroidSE package and the CSAMA 2014 workshop. It uses DESeq2 to estimate gene-level differential expression. Analysis picks up after counting reads aligning to Ensembl gene regions; counting can be accomplished using summarizeOverlaps(), as described in the parathyroidSE vignette.

The data is from Haglund et al., Evidence of a Functional Estrogen Receptor in Parathyroid Adenomas, J Clin Endocrin Metab, Sep 2012. The experiment investigated the role of the estrogen receptor in parathyroid tumors. The investigators derived primary cultures of parathyroid adenoma cells from 4 patients. These primary cultures were treated with diarylpropionitrile (DPN), an estrogen receptor β agonist, or with 4-hydroxytamoxifen (OHT). RNA was extracted at 24 hours and 48 hours from cultures under treatment and control.

Start by loading the count data, which has been carefully curatd as a SummarizedExperiment object. Explore the object with accessors such as rowData(se), colData(se), exptData(se)$MIAME, and head(assay(se))

require("DESeq2")
## Loading required package: DESeq2
## Loading required package: Rcpp
## Loading required package: RcppArmadillo
require("parathyroidSE")
## Loading required package: parathyroidSE
data("parathyroidGenesSE")
se <- parathyroidGenesSE
colnames(se) <- se$run

The first step is to add an experimental design and perform additional preparatory steps. The experiment contains several technical replicates, and while these could be incorporated in the model, here we simply pool them (pooling technical replicates is often appropriate for RNASeq data).

ddsFull <- DESeqDataSet(se, design = ~ patient + treatment)
ddsCollapsed <-
    collapseReplicates(ddsFull, groupby = ddsFull$sample, 
                       run = ddsFull$run)
dds <- ddsCollapsed[, ddsCollapsed$time == "48h"]
dds$time <- droplevels(dds$time)
dds$treatment <- relevel(dds$treatment, "Control")

The entire work flow is executed with a call to the DESeq() function. This includes library size factor estimation, dispersion estimates, model fitting, independent filtering, and formulating test statistics based on the specified design. Here we generate the results for a comparison between DPN and control.

dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds, contrast = c("treatment", "DPN", "Control"))

The remaining steps provide some basic checks and visualizations of the data. We identify genes for which the adjusted (for multiple comparison) p value is less than 0.1, and order these by their log2 fold change.

resSig <- res[which(res$padj < 0.1 ),]
head(resSig[order(resSig$log2FoldChange),])
## log2 fold change (MAP): treatment DPN vs Control 
## Wald test p-value: treatment DPN vs Control 
## DataFrame with 6 rows and 6 columns
##                  baseMean log2FoldChange     lfcSE      stat    pvalue
##                 <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000163631     233.3        -0.9307    0.2842    -3.275 1.057e-03
## ENSG00000119946     151.6        -0.6902    0.1564    -4.412 1.022e-05
## ENSG00000041982    1377.0        -0.6859    0.1845    -3.717 2.018e-04
## ENSG00000155111     530.9        -0.6756    0.2109    -3.203 1.359e-03
## ENSG00000233705     198.6        -0.6727    0.1446    -4.651 3.301e-06
## ENSG00000091137    1143.5        -0.6544    0.1046    -6.254 3.991e-10
##                      padj
##                 <numeric>
## ENSG00000163631 5.685e-02
## ENSG00000119946 2.448e-03
## ENSG00000041982 1.986e-02
## ENSG00000155111 6.687e-02
## ENSG00000233705 1.053e-03
## ENSG00000091137 9.228e-07

An 'MA' plot represents each gene as a point, with average expression over all samples on the x-axis, and log2 fold change between treatment groups on the y-axis; highlighted values are genes with adjusted p values less than 0.1.

plotMA(res, ylim = c(-1, 1))

plot of chunk deseq-MA

The strategy taken by DESeq2 for dispersion estimation is summarized by the plotDispEsts() function. It shows (a) black per-gene dispersion estimates, (b) a red trend line representing the global relationship between dispersion and normalized count, © blue 'shrunken' values moderating individual dispersion estimates by the global relationship, and (d) blue-circled dispersion outliers with high gene-wise dispersion that were not adjusted.

plotDispEsts(dds, ylim = c(1e-6, 1e1))

plot of chunk deseq-dispest

A final diagnostic is a plot of the histogram of p values, which should be uniform under the null hypothesis; skew to the right may indicate batch or other effects not accommodated in the model

hist(res$pvalue, breaks=40, col="grey")

plot of chunk deseq-hist