--- title: "A.3 -- Statistics and Graphics" author: Martin Morgan
date: "11 - 12 September 2017" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{A.3 -- Statistics and Graphics} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")) ) suppressPackageStartupMessages({ library(tidyverse) }) ``` # Exploration, univariate, and bivariate statistics and visualization Input clean data, with `Sex` and `Year` as factors. ```{r ALL-choose, eval=FALSE} path <- file.choose() # look for BRFSS-subset.csv ``` ```{r ALL-input} stopifnot(file.exists(path)) library(tidyverse) col_types <- cols( Age = col_integer(), Weight = col_double(), Sex = col_factor(c("Female", "Male")), Height = col_double(), Year = col_factor(c("1990", "2010")) ) brfss <- read_csv(path, col_types = col_types) brfss ``` ## Univariate: `t.test()` for Weight in 1990 vs. 2010 Females Filter the data to include females only, and use base `plot()` function and the formula interface to visualize the relationship between `Weight` and `Year`. ```{r brfss-female-plot} brfss %>% filter(Sex %in% "Female") %>% plot(Weight ~ Year, data = .) ``` Use a `t.test()` to test the hypothesis that female weight is the same in 2010 as in 1990 ```{r brfss-female-t-test} brfss %>% filter(Sex %in% "Female") %>% t.test(Weight ~ Year, data = .) ``` ## Bivariate: Weight and height in 2010 Males Filter the data to contain 2010 Males. Use `plot()` to visualize the relationship, and `lm()` to model it. ```{r brfss-male} male2010 <- brfss %>% filter(Year %in% "2010", Sex %in% "Male") male2010 %>% plot( Weight ~ Height, data = .) fit <- male2010 %>% lm( Weight ~ Height, data = .) fit summary(fit) ``` Multiple regression: Weight and Height, accounting for difference between years ```{r brfss-male-year-and-height} male <- brfss %>% filter(Sex %in% "Male") male %>% lm(Weight ~ Year + Height, data = .) %>% summary() ``` Is there an interaction between `Year` and `Height`? ```{r brfss-male-interaction} male %>% lm(Weight ~ Year * Height, data = .) %>% summary() ``` Check out other things to do with fitted model: - `broom::tidy()`: P-value, etc., as data.frame - `broom::augment()`: fitted values, residuals, etc ```{r brfss-male-augment, warning=FALSE} library(broom) male %>% lm(Weight ~ Year + Height, data = .) %>% augment() %>% as.tibble() ``` ## Visualization: [ggplot2][] *gg*plot: "Grammar of Graphics" - data: `ggplot2()` - *aes*thetics: `aes()`, 'x' and 'y' values, point colors, etc. - *geom*metric summaries, layered - `geom_point()`: points - `geom_smooth()`: fitted line - `geom_*`: ... - *facet* plots (e.g., `facet_grid()`) to create 'panels' based on factor levels, with shared axes. Create a plot with data points ```{r male-geom_point, warning = FALSE} ggplot(male, aes(x=Height, y = Weight)) + geom_point() ``` Capture the base plot and points, and explore different smoothed relationships, e.g., linear model, non-parameteric smoother ```{r male-ggplot, warning = FALSE} plt <- ggplot(male, aes(x=Height, y = Weight)) + geom_point() plt + geom_smooth(method = "lm") plt + geom_smooth() # default: generalized additive model ``` Use an `aes()`thetic to color smoothed lines based on `Year`, or `facet_grid()` to separate years. ```{r male-facet, warning = FALSE} ggplot(male, aes(x = Weight)) + geom_density(aes(fill = Year), alpha = .2) plt + geom_smooth(method = "lm", aes(color = Year)) plt + facet_grid( ~ Year ) + geom_smooth(method = "lm") ``` [ggplot2]: https://cran.r-project.org/package=ggplot2 # Multivariate analysis This is a classic microarray experiment. Microarrays consist of 'probesets' that interogate genes for their level of expression. In the experiment we're looking at, there are 12625 probesets measured on each of the 128 samples. The raw expression levels estimated by microarray assays require considerable pre-processing, the data we'll work with has been pre-processed. ## Input and setup Start by finding the expression data file on disk. ```{r ALL-choose-again, eval=FALSE} path <- file.choose() # look for ALL-expression.csv stopifnot(file.exists(path)) ``` The data is stored in 'comma-separate value' format, with each probeset occupying a line, and the expression value for each sample in that probeset separated by a comma. Input the data using `read_csv()`. The sample identifiers are present in the first column. ```{r ALL-input-exprs} exprs <- read_csv(path) ``` We'll also input the data that describes each column ```{r ALL-phenoData.csv-clustering-student, eval=FALSE} path <- file.choose() # look for ALL-phenoData.csv stopifnot(file.exists(path)) ``` ```{r} pdata <- read_csv(path) pdata ``` ## Cleaning and Exploration The expression data is presented in what is sometimes called 'wide' format; a different format is 'tall', where Sample and Gene group the single observation Expression. Use `tidyr::gather()` to gather the columns of the wide format into two columns representing the tall format, excluding the `Gene` column from the gather operation. ```{r ALL-gather} exprs <- exprs %>% gather("Sample", "Expression", -Gene) ``` Explore the data a little, e.g., a summary and histogram of the expression values, and a histogram of average expression values of each gene. ```{r} exprs %>% select(Expression) %>% summary() exprs $ Expression %>% hist() exprs %>% group_by(Gene) %>% summarize(AveExprs = mean(Expression)) %$% AveExprs %>% hist(breaks=50) ``` For subsequent analysis, we also want to simplify the 'B or T' cell type classification ```{r B_or_T} pdata <- pdata %>% mutate(B_or_T = factor(substr(BT, 1, 1))) ``` ## Unsupervised machine learning -- multi-dimensional scaling We'd like to reduce high-dimensional data to lower dimension for visualization. To do so, we need the `dist()`ance between samples. From `?dist`, the input can be a data.frame where rows represent `Sample` and columns represent `Expression` values. Use `spread()` to create appropriate data from `exprs`, and pipe the result to `dist()`ance.x ```{r spread} input <- exprs %>% spread(Gene, Expression) samples <- input $ Sample input <- input %>% select(-Sample) %>% as.matrix rownames(input) <- samples ``` Calculate distance between samples, and use that for MDS scaling ```{r cmdscale} mds <- dist(input) %>% cmdscale() ``` The result is a matrix; make it 'tidy' by coercing to a tibble; add the Sample identifiers as a distinct column. ```{r mds-to-tibble} mds <- mds %>% as.tibble() %>% mutate(Sample = rownames(mds)) ``` Visualize the result ```{r} ggplot(mds, aes(x=V1, y = V2)) + geom_point() ``` With the 'eye of faith', it seems like there are two groups of points. To explore this, join the MDS scaling with the phenotypic data ```{r join} joined <- inner_join(mds, pdata) ``` and use the `B_or_T` column as an aesthetic to color points ```{r mds-color} ggplot(joined, aes(x = V1, y = V2)) + geom_point(aes(color = B_or_T)) ```