--- title: "A.2 -- Data Input and Manipulation" author: Martin Morgan date: "16 - 17 May, 2016" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{A.2 -- Data Input and Manipulation} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} options(width=100) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` # Extended Exercise 1: BRFSS Survey Data We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System ([BRFSS][]) telephone survey. Check out the link for more information. We'll look at a subset of the data. 1. Use `file.choose()` to find the path to the file 'BRFSS-subset.csv' ```{r file.choose, eval=FALSE} path <- file.choose() ``` ```{r system.file, echo=FALSE} path <- system.file(package="BiocIntroRPCI", "extdata", "BRFSS-subset.csv") ``` 2. Input the data using `read.csv()`, assigning to a variable `brfss` ```{r read.csv} brfss <- read.csv(path) ``` 3. Use command like `class()`, `head()`, `dim()`, `colnames()`, `summary()` to explore the data. - What variables have been measured? - Can you guess at the units used for, e.g., Weight and Height? 4. Use the `$` operator to extract the 'Sex' column, and summarize the number of males and females in the survey using `table()`. Do the same for 'Year'. ```{r brfss-sex} table(brfss$Sex) ``` 5. The `xtabs()` function performs cross-tabulation using a formula-like interface; summarize the number of males and female in each year of the study. ```{r brfss-xtabs} xtabs(~ Year + Sex, brfss) ``` 6. Use `aggregate()` to summarize the mean weight of each group. What about the median weight of each group? ```{r brfss-aggregate} aggregate(Weight ~ Year + Sex, brfss, mean) ``` 7. Create a subset of the data consisting of only the 1990 observations. Perform a t-test comparing the weight of males and females ("'Weight' as a function of 'Sex'", `Weight ~ Sex`) ```{r t-test-1990} brfss_1990 = brfss[brfss$Year == 1990,] t.test(Weight ~ Sex, brfss_1990) ``` What about differences between weights of males (or females) in 1990 versus 2010? Check out the help page `?t.test.formula`. Is there a way of performing a t-test on `brfss` without explicitly creating the object `brfss_1990`? 8. Use `boxplot()` to plot the weights of the Male individuals. Can you transform weight, e.g., `sqrt(Weight) ~ Year`? Interpret the results. Do similar boxplots for the t-tests of the previous question. ```{r brfss-boxplot, fig.width=5, fig.height=5} boxplot(Weight ~ Year, brfss, subset = (Sex == "Male"), main="Males") ``` 9. Use `hist()` to plot a histogram of weights of the 1990 Female individuals. ```{r brfss-hist, fig.width=5, fig.height=5} hist(brfss_1990[brfss_1990$Sex == "Female", "Weight"], main="Females, 1990", xlab="Weight" ) ``` [BRFSS]: http://www.cdc.gov/brfss/about/index.htm # Extended Exercise 2: ALL Phenotypic Data This data comes from an (old) Acute Lymphoid Leukemia microarray data set. Choose the file that contains ALL (acute lymphoblastic leukemia) patient information ```{r echo=FALSE} path <- system.file(package="BiocIntroRPCI", "extdata", "ALL-phenoData.csv") ``` ```{r ALL-choose, eval=FALSE} path <- file.choose() # look for ALL-phenoData.csv ``` ```{r ALL-input} stopifnot(file.exists(path)) pdata <- read.csv(path) ``` Check out the help page `?read.delim` for input options. The exercises use `?read.csv`; Can you guess why? Explore basic properties of the object you've created, for instance... ```{r ALL-properties} class(pdata) colnames(pdata) dim(pdata) head(pdata) summary(pdata$sex) summary(pdata$cyto.normal) ``` Remind yourselves about various ways to subset and access columns of a data.frame ```{r ALL-subset} pdata[1:5, 3:4] pdata[1:5, ] head(pdata[, 3:5]) tail(pdata[, 3:5], 3) head(pdata$age) head(pdata$sex) head(pdata[pdata$age > 21,]) ``` It seems from below that there are 17 females over 40 in the data set. However, some individuals have `NA` for the age and / or sex, and these `NA` values propagate through some computations. Use `table()` to summarize the number of females over 40, and the number of samples for which this classification cannot be determined. When _R_ encounters an `NA` value in a subscript index, it introduces an `NA` into the result. Observe this (rows of `NA` values introduced into the result) when subsetting using `[` versus using the `subset()` function. ```{r ALL-subset-NA} idx <- pdata$sex == "F" & pdata$age > 40 table(idx, useNA="ifany") dim(pdata[idx,]) # WARNING: 'NA' rows introduced tail(pdata[idx,]) dim(subset(pdata, idx)) # BETTER: no NA rows tail(subset(pdata,idx)) ## work-around for `[`: set NA values to FALSE idx[is.na(idx)] <- FALSE dim(pdata[idx,]) ``` Use the `mol.biol` column to subset the data to contain just individuals with 'BCR/ABL' or 'NEG', e.g., ```{r ALL-BCR/ABL-subset} bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),] ``` The `mol.biol` column is a factor, and retains all levels even after subsetting. It is sometimes convenient to retain factor levels, but in our case we use `droplevels()` to removed unused levels ```{r ALL-BCR/ABL-drop-unused} bcrabl$mol.biol <- droplevels(bcrabl$mol.biol) ``` The `BT` column is a factor describing B- and T-cell subtypes ```{r ALL-BT} levels(bcrabl$BT) ``` How might one collapse B1, B2, ... to a single type B, and likewise for T1, T2, ..., so there are only two subtypes, B and T? One strategy is to replace two-letter level (e.g., `B1`) with the single-letter level (e.g., `B`). Do this using `substring()` to select the first letter of level, and update the previous levels with the new value using `levels<-`. ```{r ALL-BT-recode} table(bcrabl$BT) levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1) table(bcrabl$BT) ``` Use `xtabs()` (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups ```{r ALL-BCR/ABL-BT} xtabs(~ BT + mol.biol, bcrabl) ``` Use `aggregate()` to calculate the average age of males and females in the BCR/ABL and NEG treatment groups. ```{r ALL-aggregate} aggregate(age ~ mol.biol + sex, bcrabl, mean) ``` Use `t.test()` to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using `boxplot()`. In both cases, use the `formula` interface. Consult the help page `?t.test` and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change? ```{r ALL-age} t.test(age ~ mol.biol, bcrabl) boxplot(age ~ mol.biol, bcrabl) ```