---
title: "A.2 -- Data Input and Manipulation"
author: Martin Morgan
date: "16 - 17 May, 2016"
output:
BiocStyle::html_document:
toc: true
toc_depth: 2
vignette: >
% \VignetteIndexEntry{A.2 -- Data Input and Manipulation}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
options(width=100)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
# Extended Exercise 1: BRFSS Survey Data
We will explore a subset of data collected by the CDC through its
extensive Behavioral Risk Factor Surveillance System ([BRFSS][])
telephone survey. Check out the link for more information. We'll look
at a subset of the data.
1. Use `file.choose()` to find the path to the file 'BRFSS-subset.csv'
```{r file.choose, eval=FALSE}
path <- file.choose()
```
```{r system.file, echo=FALSE}
path <- system.file(package="BiocIntroRPCI", "extdata", "BRFSS-subset.csv")
```
2. Input the data using `read.csv()`, assigning to a variable `brfss`
```{r read.csv}
brfss <- read.csv(path)
```
3. Use command like `class()`, `head()`, `dim()`, `colnames()`,
`summary()` to explore the data.
- What variables have been measured?
- Can you guess at the units used for, e.g., Weight and Height?
4. Use the `$` operator to extract the 'Sex' column, and summarize the
number of males and females in the survey using `table()`. Do the
same for 'Year'.
```{r brfss-sex}
table(brfss$Sex)
```
5. The `xtabs()` function performs cross-tabulation using a
formula-like interface; summarize the number of males and female in
each year of the study.
```{r brfss-xtabs}
xtabs(~ Year + Sex, brfss)
```
6. Use `aggregate()` to summarize the mean weight of each group. What
about the median weight of each group?
```{r brfss-aggregate}
aggregate(Weight ~ Year + Sex, brfss, mean)
```
7. Create a subset of the data consisting of only the 1990
observations. Perform a t-test comparing the weight of males and
females ("'Weight' as a function of 'Sex'", `Weight ~ Sex`)
```{r t-test-1990}
brfss_1990 = brfss[brfss$Year == 1990,]
t.test(Weight ~ Sex, brfss_1990)
```
What about differences between weights of males (or females) in
1990 versus 2010? Check out the help page `?t.test.formula`. Is
there a way of performing a t-test on `brfss` without explicitly
creating the object `brfss_1990`?
8. Use `boxplot()` to plot the weights of the Male individuals.
Can you transform weight, e.g., `sqrt(Weight) ~ Year`?
Interpret the results. Do similar boxplots for the t-tests of the
previous question.
```{r brfss-boxplot, fig.width=5, fig.height=5}
boxplot(Weight ~ Year, brfss, subset = (Sex == "Male"),
main="Males")
```
9. Use `hist()` to plot a histogram of weights of the 1990 Female individuals.
```{r brfss-hist, fig.width=5, fig.height=5}
hist(brfss_1990[brfss_1990$Sex == "Female", "Weight"],
main="Females, 1990", xlab="Weight" )
```
[BRFSS]: http://www.cdc.gov/brfss/about/index.htm
# Extended Exercise 2: ALL Phenotypic Data
This data comes from an (old) Acute Lymphoid Leukemia microarray data
set.
Choose the file that contains ALL (acute lymphoblastic leukemia)
patient information
```{r echo=FALSE}
path <- system.file(package="BiocIntroRPCI", "extdata", "ALL-phenoData.csv")
```
```{r ALL-choose, eval=FALSE}
path <- file.choose() # look for ALL-phenoData.csv
```
```{r ALL-input}
stopifnot(file.exists(path))
pdata <- read.csv(path)
```
Check out the help page `?read.delim` for input options. The exercises
use `?read.csv`; Can you guess why? Explore basic properties of the
object you've created, for instance...
```{r ALL-properties}
class(pdata)
colnames(pdata)
dim(pdata)
head(pdata)
summary(pdata$sex)
summary(pdata$cyto.normal)
```
Remind yourselves about various ways to subset and access columns of a
data.frame
```{r ALL-subset}
pdata[1:5, 3:4]
pdata[1:5, ]
head(pdata[, 3:5])
tail(pdata[, 3:5], 3)
head(pdata$age)
head(pdata$sex)
head(pdata[pdata$age > 21,])
```
It seems from below that there are 17 females over 40 in the data
set. However, some individuals have `NA` for the age and / or sex, and
these `NA` values propagate through some computations. Use `table()`
to summarize the number of females over 40, and the number of samples
for which this classification cannot be determined. When _R_
encounters an `NA` value in a subscript index, it introduces an `NA`
into the result. Observe this (rows of `NA` values introduced into the
result) when subsetting using `[` versus using the `subset()`
function.
```{r ALL-subset-NA}
idx <- pdata$sex == "F" & pdata$age > 40
table(idx, useNA="ifany")
dim(pdata[idx,]) # WARNING: 'NA' rows introduced
tail(pdata[idx,])
dim(subset(pdata, idx)) # BETTER: no NA rows
tail(subset(pdata,idx))
## work-around for `[`: set NA values to FALSE
idx[is.na(idx)] <- FALSE
dim(pdata[idx,])
```
Use the `mol.biol` column to subset the data to contain just
individuals with 'BCR/ABL' or 'NEG', e.g.,
```{r ALL-BCR/ABL-subset}
bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]
```
The `mol.biol` column is a factor, and retains all levels even after
subsetting. It is sometimes convenient to retain factor levels, but in
our case we use `droplevels()` to removed unused levels
```{r ALL-BCR/ABL-drop-unused}
bcrabl$mol.biol <- droplevels(bcrabl$mol.biol)
```
The `BT` column is a factor describing B- and T-cell subtypes
```{r ALL-BT}
levels(bcrabl$BT)
```
How might one collapse B1, B2, ... to a single type B, and likewise
for T1, T2, ..., so there are only two subtypes, B and T? One strategy
is to replace two-letter level (e.g., `B1`) with the single-letter
level (e.g., `B`). Do this using `substring()` to select the first
letter of level, and update the previous levels with the new value
using `levels<-`.
```{r ALL-BT-recode}
table(bcrabl$BT)
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
```
Use `xtabs()` (cross-tabulation) to count the number of samples with
B- and T-cell types in each of the BCR/ABL and NEG groups
```{r ALL-BCR/ABL-BT}
xtabs(~ BT + mol.biol, bcrabl)
```
Use `aggregate()` to calculate the average age of males and females in
the BCR/ABL and NEG treatment groups.
```{r ALL-aggregate}
aggregate(age ~ mol.biol + sex, bcrabl, mean)
```
Use `t.test()` to compare the age of individuals in the BCR/ABL versus
NEG groups; visualize the results using `boxplot()`. In both cases,
use the `formula` interface. Consult the help page `?t.test` and re-do
the test assuming that variance of ages in the two groups is
identical. What parts of the test output change?
```{r ALL-age}
t.test(age ~ mol.biol, bcrabl)
boxplot(age ~ mol.biol, bcrabl)
```