Original Authors: Martin Morgan, Sonali Arora, Lori Shepherd
Presenting Author: Martin Morgan
Date: 20 June, 2022
Back: Monday labs
Objective: Gain confidence working with ‘tidy’ R commands and data structures.
Lessons learned:
pivot_longer() to ‘tidy’ data.The ‘tidyverse’ is a a recent approach to working with data in R. The basic principle is that data is often best represented in ‘long-form’ data.frames, with data transformations implemented with a few central functions.
Start by loading some important packages in the tidyverse, readr for data input, tibble for data representation, dplyr for manipulation, and ggplot2 for visualization.
library(readr)
library(tibble)
library(dplyr)
library(ggplot2)Read the BRFSS data into a tibble (data.frame) using read_csv();
this function is similar to read.csv() but standardized the argument
naming convention.
fname <- file.choose()   ## BRFSS-subset.csv
stopifnot(file.exists(fname))
brfss <- read_csv(fname)Note that by default character values were not interpreted as factors, that the display is more informative (the class of each column is indicated under the title), and only a ‘preview’ of all the data is displayed.
brfss## # A tibble: 20,000 × 5
##      Age Weight Sex    Height  Year
##    <dbl>  <dbl> <chr>   <dbl> <dbl>
##  1    31   49.0 Female   157.  1990
##  2    57   81.6 Female   157.  1990
##  3    43   80.3 Male     178.  1990
##  4    72   70.3 Male     170.  1990
##  5    31   49.9 Female   155.  1990
##  6    58   54.4 Female   155.  1990
##  7    45   69.9 Male     173.  1990
##  8    37   68.0 Male     180.  1990
##  9    33   65.8 Female   170.  1990
## 10    75   70.8 Female   152.  1990
## # … with 19,990 more rowsA tibble() can be manipulated like a data.frame, but usually
operations in the tidyverse are ‘piped’ using the |> symbol from
one representation to another. We start with some clean-up using the
mutate() function to update or add individual columns. Start with an
interactive exploration…
brfss |>
    mutate( Sex = factor(Sex), Year = factor(Year) )## # A tibble: 20,000 × 5
##      Age Weight Sex    Height Year 
##    <dbl>  <dbl> <fct>   <dbl> <fct>
##  1    31   49.0 Female   157. 1990 
##  2    57   81.6 Female   157. 1990 
##  3    43   80.3 Male     178. 1990 
##  4    72   70.3 Male     170. 1990 
##  5    31   49.9 Female   155. 1990 
##  6    58   54.4 Female   155. 1990 
##  7    45   69.9 Male     173. 1990 
##  8    37   68.0 Male     180. 1990 
##  9    33   65.8 Female   170. 1990 
## 10    75   70.8 Female   152. 1990 
## # … with 19,990 more rows…and when the pipeline looks good re-assign the updated tibble.
brfss <-
    brfss |>
    mutate( Sex = factor(Sex), Year = factor(Year) )Common operations are to filter() rows to those containing only
certain values, and to select() a subset of columns.
brfss |>
    filter(Sex == "Female", Year == "1990") |> 
    select(Age, Weight, Height)## # A tibble: 5,718 × 3
##      Age Weight Height
##    <dbl>  <dbl>  <dbl>
##  1    31   49.0   157.
##  2    57   81.6   157.
##  3    31   49.9   155.
##  4    58   54.4   155.
##  5    33   65.8   170.
##  6    75   70.8   152.
##  7    68   66.2   160.
##  8    87   49.9   152.
##  9    44  115.    160.
## 10    57   63.5   157.
## # … with 5,708 more rowsAnother common operation is to group_by() one or more columns, and
summarize() data in other columns, e.g.,
brfss |>
    group_by(Sex, Year) |> 
    summarize(
        AveAge = mean(Age, na.rm=TRUE),
        AveWeight = mean(Weight, na.rm=TRUE)
    )## `summarise()` has grouped output by 'Sex'. You can override using the `.groups`
## argument.## # A tibble: 4 × 4
## # Groups:   Sex [2]
##   Sex    Year  AveAge AveWeight
##   <fct>  <fct>  <dbl>     <dbl>
## 1 Female 1990    46.2      64.8
## 2 Female 2010    57.1      73.0
## 3 Male   1990    43.9      81.2
## 4 Male   2010    56.2      88.8Note that the output of each pipe is itself a tibble, so that the output can be further transformed using tidyverse functions.
The main features of ‘tidy’ data are
Standard representation as a long-format data.frame-like tibble()
Restricted vocabulary of core functions – mutate(), filter(),
select(), group_by(), summarize(), … These functions are
‘isomorphic’, meaning that the return value is the same type (a
tibble!) as the first argument of the function.
Short ‘pipes’ summarizing data transformation steps.
Revisit case study 2.1, ALL Phenotype Data in Lab 1.1: Introduction to R using a tidy approach to data exploration.
Input the data “ALLphenoData.tsv” using read_tsv() (note: we use
_tsv() because the columns are _t_ab _s_eparated). Compare column
types, display, etc with base R functions. Note that columns are
never factor() by default.
Use filter() to create a subset of the data consisting only of
female individuals over 40. Compare this approach with that used in
base R. Likewise, create a subsect with only “BCR/ABL” and “NEG”
in the mol.biol column.
Use mutate() to further transform the bcrabl subset by recoding
the BT column to be either B or T, based on the first letter of
the column.
Use group_by() and summarize() to deterimine the number of
individuals in each combination of BT and mol.biol. Can you
perform this same computation using only count()?
t.test()We’d like to compare the average age of males and females in the study
using t.test(). In base R , we can write
pdata <- read_tsv("ALLphenoData.tsv")## Rows: 127 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (14): id, diagnosis, sex, BT, remission, CR, date.cr, citog, mol.biol, f...
## dbl  (1): age
## lgl  (6): t(4;11), t(9;22), cyto.normal, ccr, relapse, transplant
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.t.test(age ~ sex, pdata)## 
##  Welch Two Sample t-test
## 
## data:  age by sex
## t = 1.6034, df = 79.88, p-value = 0.1128
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -1.022660  9.504142
## sample estimates:
## mean in group F mean in group M 
##        35.16667        30.92593t.test() takes a formula age ~ sex (age as a
function of sex) describing the comparison that we’d like to
make. It also takes an argument data= that contains the data we’d
like to perform the t-test on. Unlike functions we’ve encountered so
far where the data to be processed is the first argument, t.test()
expects the data as the second argument. To adapt t.test() for use,
we need to explicitly indicate that the data should be the second
argument. One way of doing this is to use the special symbol . to
represent the location of the incoming data, invoking t.test(age ~ sex, data = .):
pdata |>
    t.test(age ~ sex, data = _)Exercise Perform a t-test to ask whether there is evidence of
differences in ages between the sexes. How can we change the default
value of var.equal to TRUE? Is this appropriate?
Exercise Write a function that makes it easier to use t.test()
in a ‘tidy’ work flow. Do this by arranging it so that the first
argument of your function is the data set, the second argument the
formula, and then allow for any number of additional arguments. Pass
these to t.test(), along the lines of
t_test <- function(data, formula, ...) {
    t.test(formula, data, ...)
}Verify that you can use your t_test() function in a straight-forward way
pdata |>
    t_test(age ~ sex)## 
##  Welch Two Sample t-test
## 
## data:  age by sex
## t = 1.6034, df = 79.88, p-value = 0.1128
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -1.022660  9.504142
## sample estimates:
## mean in group F mean in group M 
##        35.16667        30.92593Exercise (advanced) The return value of t.test() doesn’t fit
well with tidy data analysis, because it is a complicated object
that is not represented as a tibble and hence cannot be computed
upon using the common tidy verbs. Update t_test() so that it is more
tidy-friendly, accepting data = as it’s first argument, using
t.test() internally to compute results, and returning a tibble
containing results formatted for subsequent computation. One way to
accomplish the last task is through use of broom::tidy()
function to transform many base R objects into tidy-friendly data
structures.
We’ll encounter the ‘airway’ data set extensively later in the
course. Here we read in the description of 8 samples used in a RNAseq
analysis, usnig select() to choose specific columns to work with.
pdata_file <- file.choose()    # airway-sample-sheet.csv
count_file <- file.choose()    # airway-read-counts.csvpdata <- read_csv(pdata_file)## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.pdata <- 
    pdata |> 
    select(Run, cell, dex)
pdata## # A tibble: 8 × 3
##   Run        cell    dex  
##   <chr>      <chr>   <chr>
## 1 SRR1039508 N61311  untrt
## 2 SRR1039509 N61311  trt  
## 3 SRR1039512 N052611 untrt
## 4 SRR1039513 N052611 trt  
## 5 SRR1039516 N080611 untrt
## 6 SRR1039517 N080611 trt  
## 7 SRR1039520 N061011 untrt
## 8 SRR1039521 N061011 trtNow read in the ‘airway-read_counts.csv’ file. Each row is a sample, each column (other than the first, the gene identifier) is a gene, and each entry the number of RNA-seq reads overlapping each gene and sample – a measure of gene expression.
count_file <- "airway-read-counts.csv"
counts <- read_csv(count_file)## Rows: 8 Columns: 33470
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr     (1): Run
## dbl (33469): ENSG00000000003, ENSG00000000419, ENSG00000000457, ENSG00000000...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.eg <- counts[, 1:6]    # make it easy to work with
eg## # A tibble: 8 × 6
##   Run        ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460
##   <chr>                <dbl>           <dbl>           <dbl>           <dbl>
## 1 SRR1039508             679             467             260              60
## 2 SRR1039509             448             515             211              55
## 3 SRR1039512             873             621             263              40
## 4 SRR1039513             408             365             164              35
## 5 SRR1039516            1138             587             245              78
## 6 SRR1039517            1047             799             331              63
## 7 SRR1039520             770             417             233              76
## 8 SRR1039521             572             508             229              60
## # … with 1 more variable: ENSG00000000938 <dbl>An advanced tidy concept is to join to data sets, as described on
the help page ?left_join. Use left_join() to combine the
phenotypic and expression data.
data <- left_join(pdata, eg, by = "Run")
data## # A tibble: 8 × 8
##   Run        cell    dex   ENSG00000000003 ENSG00000000419 ENSG00000000457
##   <chr>      <chr>   <chr>           <dbl>           <dbl>           <dbl>
## 1 SRR1039508 N61311  untrt             679             467             260
## 2 SRR1039509 N61311  trt               448             515             211
## 3 SRR1039512 N052611 untrt             873             621             263
## 4 SRR1039513 N052611 trt               408             365             164
## 5 SRR1039516 N080611 untrt            1138             587             245
## 6 SRR1039517 N080611 trt              1047             799             331
## 7 SRR1039520 N061011 untrt             770             417             233
## 8 SRR1039521 N061011 trt               572             508             229
## # … with 2 more variables: ENSG00000000460 <dbl>, ENSG00000000938 <dbl>This is how a statistician might organize their data – in a ‘wide’ format, with each line representing a sample and each column a variable measured on that sample.
Tidy data is usually transformed into ‘long’ format, where ‘Count’ is interpreted as a measurement, with ‘Gene’ and ‘Run’ as indicator variables describing which experimental entity the count is associated with.
Use pivot_longer() to transform the wide format data into
long-format ‘tidy’ data. pivot_longer() is a tricky function to work
with, so study the help page carefully.
library(tidyr)tbl <- pivot_longer(data, -(1:3), names_to = "Gene", values_to = "Count")
tbl## # A tibble: 40 × 5
##    Run        cell   dex   Gene            Count
##    <chr>      <chr>  <chr> <chr>           <dbl>
##  1 SRR1039508 N61311 untrt ENSG00000000003   679
##  2 SRR1039508 N61311 untrt ENSG00000000419   467
##  3 SRR1039508 N61311 untrt ENSG00000000457   260
##  4 SRR1039508 N61311 untrt ENSG00000000460    60
##  5 SRR1039508 N61311 untrt ENSG00000000938     0
##  6 SRR1039509 N61311 trt   ENSG00000000003   448
##  7 SRR1039509 N61311 trt   ENSG00000000419   515
##  8 SRR1039509 N61311 trt   ENSG00000000457   211
##  9 SRR1039509 N61311 trt   ENSG00000000460    55
## 10 SRR1039509 N61311 trt   ENSG00000000938     0
## # … with 30 more rowsUse group_by() and summarize() to calculate the library size,
i.e., the total number of reads mapped per run. Likewise use
group_by() and summarize() to descrbe average and log-transformed
counts.
tbl |>
    group_by(Run) |>
    summarize(lib_size = sum(Count))## # A tibble: 8 × 2
##   Run        lib_size
##   <chr>         <dbl>
## 1 SRR1039508     1466
## 2 SRR1039509     1229
## 3 SRR1039512     1799
## 4 SRR1039513      972
## 5 SRR1039516     2049
## 6 SRR1039517     2240
## 7 SRR1039520     1496
## 8 SRR1039521     1369tbl |>
    group_by(Gene) |>
    summarize(
        ave_count = mean(Count),
        ave_log_count = mean(log(1 + Count))
    )## # A tibble: 5 × 3
##   Gene            ave_count ave_log_count
##   <chr>               <dbl>         <dbl>
## 1 ENSG00000000003   742.            6.55 
## 2 ENSG00000000419   535.            6.26 
## 3 ENSG00000000457   242             5.48 
## 4 ENSG00000000460    58.4           4.05 
## 5 ENSG00000000938     0.375         0.224Now tidy all the data.
counts_tbl <- pivot_longer(counts, -Run, names_to = "Gene", values_to = "Count")and join to the pdata
data_tbl <- left_join(pdata, counts_tbl)## Joining, by = "Run"data_tbl## # A tibble: 267,752 × 5
##    Run        cell   dex   Gene            Count
##    <chr>      <chr>  <chr> <chr>           <dbl>
##  1 SRR1039508 N61311 untrt ENSG00000000003   679
##  2 SRR1039508 N61311 untrt ENSG00000000419   467
##  3 SRR1039508 N61311 untrt ENSG00000000457   260
##  4 SRR1039508 N61311 untrt ENSG00000000460    60
##  5 SRR1039508 N61311 untrt ENSG00000000938     0
##  6 SRR1039508 N61311 untrt ENSG00000000971  3251
##  7 SRR1039508 N61311 untrt ENSG00000001036  1433
##  8 SRR1039508 N61311 untrt ENSG00000001084   519
##  9 SRR1039508 N61311 untrt ENSG00000001167   394
## 10 SRR1039508 N61311 untrt ENSG00000001460   172
## # … with 267,742 more rowsSummarize library size (what are the maximum and minimum library sizes?)
data_tbl |>
    group_by(Run) |>
    summarize(lib_size = sum(Count))## # A tibble: 8 × 2
##   Run        lib_size
##   <chr>         <dbl>
## 1 SRR1039508 20637971
## 2 SRR1039509 18809481
## 3 SRR1039512 25348649
## 4 SRR1039513 15163415
## 5 SRR1039516 24448408
## 6 SRR1039517 30818215
## 7 SRR1039520 19126151
## 8 SRR1039521 21164133and average ‘expression’ of each gene.
gene_summaries <-
    data_tbl |>
    group_by(Gene) |>
    summarize(
        ave_count = mean(Count),
        ave_log_count = mean(log(1 + Count))
    )
gene_summaries## # A tibble: 33,469 × 3
##    Gene            ave_count ave_log_count
##    <chr>               <dbl>         <dbl>
##  1 ENSG00000000003   742.            6.55 
##  2 ENSG00000000419   535.            6.26 
##  3 ENSG00000000457   242             5.48 
##  4 ENSG00000000460    58.4           4.05 
##  5 ENSG00000000938     0.375         0.224
##  6 ENSG00000000971  6035.            8.63 
##  7 ENSG00000001036  1305             7.15 
##  8 ENSG00000001084   615.            6.40 
##  9 ENSG00000001167   392.            5.89 
## 10 ENSG00000001460   188.            5.21 
## # … with 33,459 more rowsAnd visualize using ggplot2
library(ggplot2)ggplot(gene_summaries, aes(ave_log_count)) +
    geom_density()sessionInfo()## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_1.2.0      ggplot2_3.3.6    dplyr_1.0.9      tibble_3.1.7    
## [5] readr_2.1.2      BiocStyle_2.24.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2    xfun_0.31           bslib_0.3.1        
##  [4] purrr_0.3.4         colorspace_2.0-3    vctrs_0.4.1        
##  [7] generics_0.1.2      htmltools_0.5.2     yaml_2.3.5         
## [10] utf8_1.2.2          rlang_1.0.2         jquerylib_0.1.4    
## [13] pillar_1.7.0        glue_1.6.2          withr_2.5.0        
## [16] DBI_1.1.2           bit64_4.0.5         lifecycle_1.0.1    
## [19] stringr_1.4.0       munsell_0.5.0       gtable_0.3.0       
## [22] codetools_0.2-18    evaluate_0.15       labeling_0.4.2     
## [25] knitr_1.39          tzdb_0.3.0          fastmap_1.1.0      
## [28] parallel_4.2.0      fansi_1.0.3         highr_0.9          
## [31] Rcpp_1.0.8.3        scales_1.2.0        BiocManager_1.30.18
## [34] vroom_1.5.7         magick_2.7.3        jsonlite_1.8.0     
## [37] farver_2.1.0        bit_4.0.4           hms_1.1.1          
## [40] digest_0.6.29       stringi_1.7.6       bookdown_0.27      
## [43] grid_4.2.0          cli_3.3.0           tools_4.2.0        
## [46] magrittr_2.0.3      sass_0.4.1          crayon_1.5.1       
## [49] pkgconfig_2.0.3     ellipsis_0.3.2      assertthat_0.2.1   
## [52] rmarkdown_2.14      R6_2.5.1            compiler_4.2.0Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U24HG004059 (Bioconductor), U24HG010263 (AnVIL) and U24CA180996 (ITCR).