--- title: "2. Working With Data: `SummarizedExperiment`" author: "Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{2. Working With Data: SummarizedExperiment} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({ library(ALL) library(airway) }) ``` The material in this course requires R version 3.2 and Bioconductor version 3.2 ```{r configure-test} stopifnot( getRversion() >= '3.2' && getRversion() < '3.3', BiocInstaller::biocVersion() == "3.2" ) ``` Your boss has been working on Acute Lymphocytic Lukemia (ALL) for many years. One data set consists of microarray gene expression values for 12625 genes in 128 different samples. Your boss would like to analyze different subsets of the data, and has given you a couple of tab-delimited files. One file (_ALLphenoData.tsv_) describes the samples, the other (_ALLassay.tsv_) contains pre-processed gene expression data. You are supposed to come up with a way to create the subsets your boss asks you for. You realize that you could read the data in to Excel and manipulate it there, but you're concerned about being able to do reproducible research and you're nervous about the bookkeeping errors that always seem to come up. So you think you'll give _Bioconductor_ a try... # Read the data in to _R_ Download the [ALLphenoData.tsv][] and [ALLassay.tsv][] files to the current workign directory, `getwd()`. ## Use `read.table()` to read _ALLphenoData.tsv_ ```{r read.table} fname = "ALLphenoData.tsv" ## use file.choose() to find the file pdata = read.table(fname) ``` Check out the help page `?read.delim` for input options, and explore basic properties of the object you've created, for instance... ```{r ALL-properties} class(pdata) colnames(pdata) dim(pdata) head(pdata) summary(pdata$sex) summary(pdata$cyto.normal) ``` Remind yourselves about various ways to subset and access columns of a data.frame ```{r ALL-subset} pdata[1:5, 3:4] pdata[1:5, ] head(pdata[, 3:5]) tail(pdata[, 3:5], 3) head(pdata$age) head(pdata$sex) head(pdata[pdata$age > 21,]) ``` ## Use `read.table()` to read the expression values ```{r exprs} fname <- "ALLassay.tsv" exprs <- as.matrix(read.table(fname, check.names=FALSE)) ``` Use `dim()` to figure out the number of rows and columns in the expression data. Use subscripts to look at the first few rows and columns `exprs[1:5, 1:5]`. What are the row names? Do the column names agree with the row names of the `pdata` object? What is the `range()` of the expression data? Can you create a histogram (hint: `hist()`) of the data? What is `plot(density(exprs))`? Can you use `plot()` and `lines()` to plot the density of each sample, in a single figure? # Make a _SummarizedExperiment_ object You could work with the matrix and data frame directly, but it is better to put these related parts of the data into a single object, a _SummarizedExperiment_. Load the appropriate _Bioconductor_ package ```{r SummarizedExperiment} if (BiocInstaller::biocVersion() >= "3.2") { library(SummarizedExperiment) } else { library(GenomicRanges) } ``` and create a single _SummarizedExperiment_ object from the two parts of the data. Some _Bioconductor_ objects enhance the behavior of base _R_ objects; an example of this is `DataFrame()` ```{r make-SE} se <- SummarizedExperiment(exprs, colData=DataFrame(pdata)) ``` Explore the object, noting that you can retrieve the original elements, and can subset in a coordinated fashion. ```{r se-ops} head(colData(se)) assay(se)[1:5, 1:5] se$sex %in% "M" males <- se[,se$sex %in% "M"] males assay(males)[1:5, 1:5] ``` Use `vignette("SummarizedExperiment")` to read about other operations on _SummarizedExperiment_. # Show off your skills Quickly create the following subsets of data for your boss: 1. All women in the study. 2. All women over 40 3. An object `bcrabl` containing individuals with `mol.biol` belonging either to "BCR/ABL" or "NEG". Can you...? 1. Create a new column that simplifies the `BT` column (which lists different B- and T-cell subtypes) to contain just `B` or `T`, e.g., re-coding B, B1, B2, B3 and B4 to simply `B`, and likewise for `T`? 2. Use `aggregate()` to calculate the average age of males and females in the BCR/ABL and NEG treatment groups? 3. Use `t.test()` to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using `boxplot()`. In both cases, use the `formula` interface. Consult the help page `?t.test` and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change? # Document your work Summarize the exercises above in a simple script. Can you figure out how to write a 'markdown' document that includes R code chunks, as well as text describing what you did, and figures and tables showing the results? # Resources Acknowledgements - Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum. - The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation. ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ``` [ALLphenoData.tsv]: https://raw.githubusercontent.com/Bioconductor/BiocUruguay2015/master/vignettes/ALLphenoData.tsv [ALLassay.tsv]: https://raw.githubusercontent.com/Bioconductor/BiocUruguay2015/master/vignettes/ALLassay.tsv