--- title: "Reproducible Research in R / Bioconductor" author: "Martin Morgan" date: "October 31, 2016" output: html_document vignette: > %\VignetteIndexEntry{Reproducible Research in R / Bioconductor} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, echo=FALSE} suppressPackageStartupMessages({ library(DESeq2) library(org.Hs.eg.db) }) colDataFile <- system.file(package="Sydney2016", "extdata", "airway-colData.tab") assayFile <- system.file(package="Sydney2016", "extdata", "airway-assay.tab") ``` # Introduction This workshop uses data from a gene-level RNA-seq experiment involving airway smooth muscle cells; details are provided in the vignette accompanying the [airway][airway] package. The original data is from Himes et al., "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 13;9(6):e99625. PMID: [24926665][] GEO: [GSE52778][]. From the Abstract: "Using RNA-Seq [...] we characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone - a potent synthetic glucocorticoid (1 micromolar for 18 hours)." We join the analysis after the sequencing, read aligment, and summary of aligned reads to a table of counts of reads overlapping regions of interest (genes) in each sample. We focus on a subset of the experiment, with 4 cell lines each treated with dexamethasone or a control. [airway]: https://bioconductor.org/packages/airway [24926665]: (http://www.ncbi.nlm.nih.gov/pubmed/24926665) [GSE52778]: (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778) ## Setup (not necessary during the workshop) We'll use two packages from _Bioconductor_. These depend in turn on several other packages. The packages and their dependencies are already installed on the Amazon Machine Instance (AMI) used in the course. For your own computers after the course, install packages with ```{r dependencies, eval=FALSE} source("https://bioconductor.org/biocLite.R") biocLite(c("DESeq2", "org.Hs.eg.db")) ``` Installation needs to be performed once per computer, not every time the packages are used. We use two data files in the analysis. The data files are in the [Sydney2016][1] github repository. Install the github repository with ```{r Sydney2016, eval=FALSE} biocLite("Bioconductor/Syndey2016") ``` Once the package is installed, the location of the files (to be used in `file.choose()`, below) is given by ```{r Sydney2016-filepaths, eval=FALSE} system.file(package="Sydney2016", "extdata") ``` [1]: https://github.com/Bioconductor/Sydney2016 # Data input The first challenge is to input the data. We start with the 'phenotypic' data, describing the samples used in the experiment. The data is a simple table of 8 rows and several columns; it could be created in Excel and exported as a tab-delimited file. Find the location of the file on your Amazon machine instance ```{r, eval=FALSE} colDataFile <- file.chooose() # find 'airway-colData.tab' ``` and read the data in to _R_ using the `read.table()` function. The data is small enough to be viewed in the _R_ session by typing the name of the variable) or in _RStudio_ (by using `View()` or double-clicking on the variable in the 'Environment' tab). ```{r} colData <- read.table(colDataFile) colData ``` This should go smoothly; with real data one often needs to spend considerable time adjusting arguments to `read.table()` to account for the presence of a header, row names, comments, etc. The next challenge is to input the expression estimates. This is a matrix of rows representing regions of interest (genes) and columns representing samples. Entries in the matrix are the number of reads overlapping each region in each sample. It is important that the values are raw counts, rather than scaled measures such as FPKM. Find the file ```{r, eval=FALSE} assayFile <- file.chooose() # find 'airway-assay.tab' ``` Input the data and use `head()` to view the first few rows of the data. ```{r} assay <- read.table(assayFile) head(assay) ``` # Exploration Calculate the 'library size' (total number of mapped reads) of each sample using `colSums()` ```{r} colSums(assay) ``` Create a density plot of the average asinh-transformed (asinh is log-like, expect near zero) read counts of each gene using the following series of commands. ```{r} plot(density(rowMeans(asinh(assay)))) ``` Multi-dimensional scaling (MDS) is a dimensionality reduction method that takes vectors in n-space and projects them into two (or more) dimensions. Use the `dist()` function to calculate the (Euclidean) distance bewteen each sample, and the base _R_ function `cmdscale()` to perform MDS on the distance matrix. We can use `plot()` to visualize the results and see the approximate location of each of the 8 samples. Use the argument `col` to color the points based on cell line (`colData$cell`) or experimental treatment `colData$dex`. ```{r} d <- dist(t(asinh(assay))) plot(cmdscale(d), pch=19, cex=2) plot(cmdscale(d), pch=19, cex=2, col=colData$cell) plot(cmdscale(d), pch=19, cex=2, col=colData$dex) ``` Note that cell lines are relatively similar to one another. This suggests that cell line should be used as a covariate in subsequent analysis. # Differential expression We will use the [DESeq2][] package for differential expression analysis; other choices are possible, notably [edgeR][] and [limma][]. [DESeq2]: https://bioconductor.org/packages/DESeq2 [edgeR]: https://bioconductor.org/packages/edgeR [limma]: https://bioconductor.org/packages/limma The analysis starts by providing the expression count data, a description of the experiment, and a 'model' that describes the statistical relationship we'd like to estimate. For our model and based in part on the exploratory analysis of the previous section, we suppose that count is determined cell line and dexamethasone treatment. We include cell line primarily as a covariate; our primary interest is in dexamethasone. ```{r} library(DESeq2) dds <- DESeqDataSetFromMatrix(assay, colData, ~ cell + dex) ``` The analysis is extremely straight-forward to invoke, but the calculations involve a number of sophisticated statistical issues, including: - Robust estimate of library size - Use of a negative binomial model to describe the relationship between experimental design and counts - Use of moderated test statistics that balance gene-level parameter estimates with experiment wide estimates - Data-driven approaches to outlier detection - Filtering and control for multiple testing The code is invoked as: ```{r} dds <- DESeq(dds) dds ``` The `DESeq()` function returns an object that can be used as a starting point for further analysis, for instance generating a 'top table' of differentially expressed genes, orderd by adjusted (for multiple comparison) _P_ values. ```{r} result <- results(dds) result ridx <- head(order(result$padj), 10) top = result[ridx,] top ``` # Comprehension There are many opportunities to place the statistical results into biological context. An initial step is to map the cryptic Ensembl gene identifiers used to label regions of interest to more famliar HGNC gene symbols. For this we use the [org.Hs.eg.db][] package, an example of a _Bioconductor_ 'annotation' package containing curated data derived from public-domain resources and updated semi-annually. The `mapId()` function maps between identifier types, in our case to SYMBOL gene ids from ENSEMBL ids. We add these to the top table results so that they can be processed together with the statistical results. ```{r} library(org.Hs.eg.db) top$Symbol <- mapIds(org.Hs.eg.db, rownames(top), "SYMBOL", "ENSEMBL") top ``` [org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db # Reproducibility The calculations here are made more reproducible by reporting the version of software used in the analysis, as follows: ```{r} sessionInfo() ```