Common Workflows, with an RNA-seq example

useR! 2014
Author: Martin Morgan (mtmorgan@fhcrc.org), Sonali Arora
Date: 30 June, 2014

Overview

RNASeq

Counting – overlapping alignments
Known genes
Known transcripts
(Novel transcripts)

ChIPSeq – Chromatin immunopreciptation

(Peak calling)
Peak identification
Differential binding
Motif and other down-stream analysis

Variants

(Calling)
Annotating
Effect prediction
GWAS
Integration

Structural

Copy number
Rearrangements
…

In depth: RNASeq

Work flow

Experimental design
- Replication
- Batch effects
Sequencing & alignment
Differential expression
- Counting
- Statistical evaluation
Annotation & down-stream analysis

Statistical challenges

Designed experiments
Library size normalization
Counts (vs. RPKM)
- Negative binomial distribution
- Dispersion
Borrowing information – shrinkage
Independent filtering

RNASeq in parathyroid adenomas

This work flow derives from a more comprehensive example developed in the parathyroidSE package and the CSAMA 2014 workshop. It uses DESeq2 to estimate gene-level differential expression. Analysis picks up after counting reads aligning to Ensembl gene regions; counting can be accomplished using summarizeOverlaps(), as described in the parathyroidSE vignette.

The data is from Haglund et al., Evidence of a Functional Estrogen Receptor in Parathyroid Adenomas, J Clin Endocrin Metab, Sep 2012. The experiment investigated the role of the estrogen receptor in parathyroid tumors. The investigators derived primary cultures of parathyroid adenoma cells from 4 patients. These primary cultures were treated with diarylpropionitrile (DPN), an estrogen receptor β agonist, or with 4-hydroxytamoxifen (OHT). RNA was extracted at 24 hours and 48 hours from cultures under treatment and control.

Start by loading the count data, which has been carefully curatd as a SummarizedExperiment object. Explore the object with accessors such as rowData(se), colData(se), exptData(se)$MIAME, and head(assay(se))

require("DESeq2")

## Loading required package: DESeq2
## Loading required package: Rcpp
## Loading required package: RcppArmadillo

require("parathyroidSE")

## Loading required package: parathyroidSE

data("parathyroidGenesSE")
se <- parathyroidGenesSE
colnames(se) <- se$run

The first step is to add an experimental design and perform additional preparatory steps. The experiment contains several technical replicates, and while these could be incorporated in the model, here we simply pool them (pooling technical replicates is often appropriate for RNASeq data).

ddsFull <- DESeqDataSet(se, design = ~ patient + treatment)
ddsCollapsed <-
    collapseReplicates(ddsFull, groupby = ddsFull$sample, 
                       run = ddsFull$run)
dds <- ddsCollapsed[, ddsCollapsed$time == "48h"]
dds$time <- droplevels(dds$time)
dds$treatment <- relevel(dds$treatment, "Control")

The entire work flow is executed with a call to the DESeq() function. This includes library size factor estimation, dispersion estimates, model fitting, independent filtering, and formulating test statistics based on the specified design. Here we generate the results for a comparison between DPN and control.

dds <- DESeq(dds)

## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

res <- results(dds, contrast = c("treatment", "DPN", "Control"))

The remaining steps provide some basic checks and visualizations of the data. We identify genes for which the adjusted (for multiple comparison) p value is less than 0.1, and order these by their log2 fold change.

resSig <- res[which(res$padj < 0.1 ),]
head(resSig[order(resSig$log2FoldChange),])

## log2 fold change (MAP): treatment DPN vs Control 
## Wald test p-value: treatment DPN vs Control 
## DataFrame with 6 rows and 6 columns
##                  baseMean log2FoldChange     lfcSE      stat    pvalue
##                 <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000163631     233.3        -0.9307    0.2842    -3.275 1.057e-03
## ENSG00000119946     151.6        -0.6902    0.1564    -4.412 1.022e-05
## ENSG00000041982    1377.0        -0.6859    0.1845    -3.717 2.018e-04
## ENSG00000155111     530.9        -0.6756    0.2109    -3.203 1.359e-03
## ENSG00000233705     198.6        -0.6727    0.1446    -4.651 3.301e-06
## ENSG00000091137    1143.5        -0.6544    0.1046    -6.254 3.991e-10
##                      padj
##                 <numeric>
## ENSG00000163631 5.685e-02
## ENSG00000119946 2.448e-03
## ENSG00000041982 1.986e-02
## ENSG00000155111 6.687e-02
## ENSG00000233705 1.053e-03
## ENSG00000091137 9.228e-07

An 'MA' plot represents each gene as a point, with average expression over all samples on the x-axis, and log2 fold change between treatment groups on the y-axis; highlighted values are genes with adjusted p values less than 0.1.

plotMA(res, ylim = c(-1, 1))

plot of chunk deseq-MA

The strategy taken by DESeq2 for dispersion estimation is summarized by the plotDispEsts() function. It shows (a) black per-gene dispersion estimates, (b) a red trend line representing the global relationship between dispersion and normalized count, © blue 'shrunken' values moderating individual dispersion estimates by the global relationship, and (d) blue-circled dispersion outliers with high gene-wise dispersion that were not adjusted.

plotDispEsts(dds, ylim = c(1e-6, 1e1))

plot of chunk deseq-dispest

A final diagnostic is a plot of the histogram of p values, which should be uniform under the null hypothesis; skew to the right may indicate batch or other effects not accommodated in the model

hist(res$pvalue, breaks=40, col="grey")

$plot of chunk deseq-hist$