--- title: "6. Advanced Lab for R & Bioconductor - RNA-Seq Analysis" author: "Sonali Arora" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{6. Advanced Lab for R & Bioconductor - RNA-Seq Analysis} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), error=FALSE) ``` Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor version 3.2 ## Advanced lab for Bioconductor - RNA-Seq Analysis This lab will walk you through an end-to-end RNA-Seq differential expression workflow, using `r Biocpkg("DESeq2")` along with other _Bioconductor_ packages. Note: a number of other _Bioconductor_ packages can also be used for statistical inference of differential expression at the gene level including `r Biocpkg("edgeR")`, `r Biocpkg("BaySeq")`, `r Biocpkg("DSS")` and `r Biocpkg("limma")`. ## Exercise Using the data from `r Biocpkg("airway")`, design and implement an end-to-end RNA-Seq differential expression analysis, using `r Biocpkg("DESeq2")` Steps include - - Load the data package - Create the *DESeqDataSet* from *SummarizedExperiment* - Run the Differential Expression Pipeline - Build the results table - Building some Diagnostic Plots/ Visualize Results ## Data for the analysis The data used in this Lab is an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is: Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 13;9(6):e99625. PMID: [24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665). GEO: [GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778). For our analysis, we wil use data from the data package `r Biocpkg("airway")`. ```{r load-data} library("airway") data(airway) ``` ## Solutions ### Answer 1 : Load the Data The data stored inside `r Biocpkg("airway")` is a *SummarizedExperiment* object. ```{r play} library("airway") data(airway) se <- airway se ``` ### Answer 2 : Create the *DESeqDataSet* Once we have our fully annotated *SummarizedExperiment* object, we can construct a *DESeqDataSet* object from it, which will then form the starting point of the actual *DESeq2* package, described in the following sections. We add an appropriate design for the analysis. ```{r} library("DESeq2") dds <- DESeqDataSet(se, design = ~ cell + dex) ``` ### Answer 3 : Differential Expression Pipeline It will be convenient to make sure that `untrt` is the first level in the `dex` factor, so that the default log2 fold changes are calculated as treated over untreated (by default R will chose the first alphabetical level, remember: computers don't know what to do unless you tell them). The function *relevel* achieves this: ```{r} dds$dex <- relevel(dds$dex, "untrt") ``` Finally, we are ready to run the differential expression pipeline. With the data object prepared, the *DESeq2* analysis can now be run with a single call to the function *DESeq*: ```{r} dds <- DESeq(dds) ``` This function will print out a message for the various steps it performs. These are described in more detail in the manual page for *DESeq*, which can be accessed by typing `?DESeq`. Briefly these are: the estimation of size factors (which control for differences in the library size of the sequencing experiments), the estimation of dispersion for each gene, and fitting a generalized linear model. A *DESeqDataSet* is returned which contains all the fitted information within it, and the following section describes how to extract out results tables of interest from this object. ### Answer 4 : Build the results table Calling *results* without any arguments will extract the estimated log2 fold changes and *p* values for the last variable in the design formula. If there are more than 2 levels for this variable, *results* will extract the results table for a comparison of the last level over the first level. ```{r} (res <- results(dds)) ``` As `res` is a *DataFrame* object, it carries metadata with information on the meaning of the columns: ```{r} mcols(res, use.names=TRUE) ``` The first column, `baseMean`, is a just the average of the normalized count values, dividing by size factors, taken over all samples. The remaining four columns refer to a specific contrast, namely the comparison of the `trt` level over the `untrt` level for the factor variable `dex`. See the help page for *results* (by typing `?results`) for information on how to obtain other contrasts. The column `log2FoldChange` is the effect size estimate. It tells us how much the gene's expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples. This value is reported on a logarithmic scale to base 2: for example, a log2 fold change of 1.5 means that the gene's expression is increased by a multiplicative factor of $2^{1.5} \approx 2.82$. We can also summarize the results with the following line of code, which reports some additional information ```{r} summary(res) ``` ### Answer 5 : Visualize Results A quick way to visualize the counts for a particular gene is to use the *plotCounts* function, which takes as arguments the *DESeqDataSet*, a gene name, and the group over which to plot the counts. ```{r plotcounts, fig.width=5, fig.height=5} topGene <- rownames(res)[which.min(res$padj)] plotCounts(dds, gene=topGene, intgroup=c("dex")) ``` ## References For a much detailed analysis see - [Case Study- How to build a SummarizedExperiment - airway dataset](http://bioconductor.org/packages/devel/data/experiment/vignettes/airway/inst/doc/airway.html) - [Differential Expression Lab](http://bioconductor.org/help/course-materials/2015/SeattleApr2015/C_DifferentialExpression.html#practical-rna-seq-gene-level-differential-expression) ## What to not miss at BioC2015 ! If you liked this lab and want to learn more in this area, do not miss the following labs at BioC2015 - _Differential expression, manipulation, and visualization of RNA-seq reads_ by Mike Love. (Session 1, Tuesday, 1:00pm -2:45pm) - _Automated NGS workflows with systemPipeR running on clusters or single machines, with a focus on VAR-seq_ by Thomas Girke. (Session 4, Tuesday, 3:15pm - 5:00pm) ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ```