---
title: "04: Practical: Organizing data with _SummarizedExperiment_"
author:
- name: Martin Morgan
affiliation: Roswell Park Comprehensive Cancer Center
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
%\VignetteIndexEntry{04: Practical: Organizing data with SummarizedExperiment}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output:
BiocStyle::html_document:
number_sections: yes
toc: true
---
```{r style, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))
)
options(width = 75)
```
# What is a `SummarizedExperiment`?
![Alt SummarizedExperiment](our_figures/SummarizedExperiment.png)
- `assay()`, `assays()`: A `matrix`-like or list of `matrix`-like
objects of identical dimension
- `matrix`-like: implements `dim()`, `dimnames()`, and 2-dimensional
`[`, `[<-` methods.
- rows: genes, genomic coordinates, etc.
- columns: samples, cells, etc.
- `colData()`: Annotations on each column, as a `DataFrame`.
- E.g., description of each sample
- `rowData()` and / or `rowRanges()`: Annotations on each row.
- E.g., `rowRanges()`: coordinates of gene / exons in transcripts /
etc.
- E.g., `rowData()`:_P_-values and log-fold change of each gene
after differential expresssion analysis.
- `metadata()`: List of unstructured metadata describing the overall
content of the object.
# Working with an existing `SummarizedExperiment` object
```{r, message = FALSE}
library(SummarizedExperiment)
```
The [airway][] experiment data package summarizes an RNA-seq
experiment investigating human smooth-muscle airway cell lines treated
with dexamethasone. Load the library and data set.
```{r}
library(airway)
data(airway)
airway
```
## Three main parts of a `SummarizedExperiment`
**Exercise** What are the `dim()` and `dimnames()` of the airway
object? What do the rows and columns correspond to in terms of the
original experiment? Hint: use `dplyr::glimpse()` to look at the first
few elements of each `dimnames`.
**Exercise** Use the functions `assay()`, `colData()`, `rowRanges()`
to extract the three major components of the `airway` data
set. `assay()` is large, so interogate it using `class()`, `dim()` and
`head()` rather than simply printing it to the screen. What do each of
these parts corresspond to in terms of the RNASeq experiment
summarized in this object?
## The matrix-like behavior of `SummarizedExperiment`
As a quick refresher, an _R_ `matrix`
```{r}
set.seed(123)
m <- matrix(
rnorm(12), nrow = 4, ncol = 3,
dimnames = list(letters[1:4], LETTERS[1:3])
)
m
```
has three main components to it's interface, `dim()`, `dimnames()`,
and two-dimensional `[`-subsetting.
```{r}
dim(m)
dimnames(m)
m[1:2, 2:1]
m[1:2,]
```
A tricky, and unfortunate, behavior is that subsetting a single row or
column of a matrix results, by default in the matrix being coerced to
a vector
```{r}
m[1,]
m[,1]
```
Avoid this behavior using the `drop = FALSE` argument.
```{r}
m[ 1, , drop = FALSE]
```
`SummarizedExperiment` implements the 'matrix-like' interface. This
means that it supports the functions `dim()`, `dimnames()`, and
two-dimensional `[`-subsetting.
**Exercise** Subset `airway` to contain the first 5 rows and first 4
columns. Check out the `colData()`, `rowRanges()`, and `assay()` of
the resulting object.
## Basic summaries of `assay()` values
Returning to our simple matrix
```{r}
m
```
It is straight-forward to transform a matrix, e.g., by adding 1
```{r}
m + 1
```
or applying a transformation
```{r}
abs(m + 1)
```
There are special functions for row- and column-wise operations that
are often performed on matrices (check out the [matrixStats][] package
for further mathematical operations), e.g.,
```{r}
colSums(m)
rowMeans(m)
```
[matrixStats]: https://cran.r-project.org/package=matrixStats
**Exercise** Extract the `assay()` component of `airway` and calculate
the column sums. What do these numbers represent?
**Exercise** Transform the `assay()` matrix by adding 1 and taking the
natural `log()`, then create a vector representing the average of each
row of this transformed matrix. What does this vector represent?
Visualize it in a histogram, density plot, or other representation.
## Subsetting `SummarizedExperiment`
We mentioned that the summarized experiment implements the matrix-like
two-dimensional `[` subsetting interface.
The `colData()` of `airway` contains information about experimental
design, for instance the cell line and dexamethasone treatment level
of each sample.
```{r}
colData(airway)
```
There's a short-cut to accessing each column, e.g.,
```{r}
airway$dex
```
**Exercies** Use this shortcut to quickly subset `airway` so that it
has only the `untrt` samples present. Verify that the `assay()` data
have also been subset correctly.
When working with a matrix, common scenario is to manipulate groups
of rows based on some criterion, e.g., if one wished to keep rows with
mean above 0, one might
```{r}
ridx <- rowMeans(m) > 0
m[ridx, , drop = FALSE]
```
**Exercise** Remove all the rows from `airway` that had no gene
expression across all samples. How many rows had no expression?
Re-calculate the histogram of log-transoformed average expression with
these rows excluded.
## The list-like interface of `GRangesList`
An _R_ `list()` is an object that contain different types of vectors,
including other lists. Here's a simple example
```{r}
l <- list(a = 1:5, b = month.abb)
```
The interface to a list allows for `length()`, `[` (to return a subset
of the original list) and `[[` (to extract a single element of the
list, either by name or by position). The elements of the list can,
but do not have to be, named.
```{r}
names(l)
length(l)
l[c(2, 1)]
l[2] # list of length 1, containing element 2 of original list
l[[2]] # element 2 of original list
```
One useful function is `lengths()`, which returns a vector of the
length of each element in the lsit
```{r}
lengths(l)
```
The [IRanges][], [S4Vectors][], and [GenomicRanges][] packages
implement several classes that implement this list-like interface; the
classes all end with name `*List`, e.g., `GRangesList`.
**Exercise** Extract the row ranges from `airway`, use `class()` to
discover the type of object. What does each element of the list
represent? What does the entire list represent?
```{r}
r <- rowRanges(airway)
```
**Exercise** Prove to yourself that the object implements a list-like
interface, supporting `length()`, `[` and `[[` subsetting.
**Exercise** Use `lengths()` to determine the number of exons in each
gene. How many genes are there with a single exon? What gene has the
largest number of exons?
## Provenance in `GRangesList`
`GRanges` / `GRangesList` objects contain provenance information about
the genome to which the ranges refer, accessible using `seqinfo()`.
**Exercise** Use `seqinfo(r)` to extract the information about the
sequences the genomic ranges are based on.
**Exercise** Note that the `genome` is always `NA`. Take a quick look
at the vignette `browseVignettes("airway")`, scaning down to the
'Aligning reads' section. We're told that the reference genome used
for alignments is "GRCh37". Update the row range to contain this
information
```{r}
genome(r) <- "GRCh37"
seqinfo(r)
```
Update the `rowRanges()` of `airway` with this more complete
information.
```{r}
rowRanges(airway) <- r
```
Note that `SummarizedExperiment` has convenience functions that
allow direct access to provenance
```{r}
seqinfo(airway)
```
**Exercise** Use `as()` to coerce the sequence info to a `GRanges`
object, and select the genomic range corressponding to chromosome 14.
```{r, message = FALSE}
library(dplyr) # %>%
```
```{r}
chr14 <-
seqinfo(r) %>%
as("GRanges") %>%
subset(seqnames == "14")
```
## Subsetting by overlaps
The `%over%` function returns TRUE for each range on the left of the
operator that overlaps a range on the right of the operator. Thus the
row ranges overlapping chromosome 14 are
```{r}
idx <- r %over% chr14
r[idx]
```
**Exercise** How many ranges of `r` overlap chromosome 14?
**Exercise** Use `idx` to subset `airway` to contain only genes on
chromosome 14.
**Exercise** We took a long route to get to this subset-by-overlap;
summarize our story with two lines of code that allow us to subset
`airway` by overlap.
# Constructing a `SummarizedExperiment` object 'by hand'
```{r, message = FALSE}
library(readr)
library(tibble)
```
```{r, eval = FALSE}
pdata_file <- file.choos() # airway-sample-sheet.csv
counts_file <- file.choose() # airway-read-counts.csv
```
```{r, echo = FALSE}
pdata_file <-
system.file(package="BiocIntro", "extdata", "airway-sample-sheet.csv")
counts_file <-
system.file(package="BiocIntro", "extdata", "airway-read-counts.csv")
```
**Exercise** Use `read_csv()` to import the sample sheet and
read counts.
```{r}
pdata <- read_csv(pdata_file)
counts <- read_csv(counts_file)
```
**Exercise** Assemble `SummarizedExperiment` from transposed `counts`
and `pdata`. The `"Run"` column needs to be made into rownames on both
the `pdata` and `counts` object.
```{r}
pdata <- column_to_rownames(pdata, "Run")
counts <- column_to_rownames(counts, "Run")
se <- SummarizedExperiment(t(counts), colData = pdata)
se
```
**Exercise** Following examples from the lectures, calculate total
library size and plot average log gene expression from you
newly-created `SummarizedExperiment`.
# Provenance
```{r}
sessionInfo()
```
[IRanges]: https://bioconductor.org/packages/IRanges
[S4Vectors]: https://bioconductor.org/packages/S4Vectors
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment
[MultiAssayExperiment]: https://bioconductor.org/packages/MultiAssayExperiment
[airway]: https://bioconductor.org/packages/airway