--- title: "Lecture 1 -- Introduction to R / Bioconductor" author: "Martin Morgan " output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{Lecture 1 -- Introduction to R / Bioconductor} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` Author: [Martin Morgan][]
Date: 22 July, 2019
[Martin Morgan]: mailto: Martin.Morgan@RoswellPark.org # _R_ ## History of _R_ and _CRAN_ - Statistical programming language. Concieved 1992, initial version 1996, stable beta version in 2000; an implementation of _S_. CRAN started in 1997. - 'Free' software: no cost, open source, broad use. - Extensible: packages (15,000 on [CRAN][], 1750 on [Bioconductor][]) - Key features - Intrinsic _statistical_ concepts - _Vectorized_ computation - 'Old-school' scripts rather than graphical user interface -- great for reproducibility! - (Advanced) _copy-on-change_ semanatics ## Vectors and data frames ```{r} 1 + 2 x = c(1, 2, 3) 1:3 # sequence of integers from 1 to 3 x + c(4, 5, 6) # vectorized x + 4 # recycling ``` Vectors - `numeric()`, `character()`, `logical()`, `integer()`, `complex()`, ... - `NA`: 'not available' - `factor()`: values from restricted set of 'levels'. Operations - numeric: `==`, `<`, `<=`, `>`, `>=`, ... - logical: `|` (or), `&` (and), `!` (not) - subset: `[`, e.g., `x[c(2, 3)]` - assignment: `[<-`, e.g., `x[c(1, 3)] = x[c(1, 3)]` - other: `is.na()` Functions ```{r} x = rnorm(100) y = x + rnorm(100) plot(x, y) ``` - Many! `data.frame` ```{r} df <- data.frame(Independent = x, Dependent = y) head(df) df[1:5, 1:2] df[1:5, ] plot(Dependent ~ Independent, df) # 'formula' interface ``` - List of equal-length vectors - Vectors can be of different type - Two-dimensional subset and assignment - Column access: `df[, 1]`, `df[, "Indep"]`, `df[[1]]`, `df[["Indep"]]`, `df$Indep` Exercise: plot only values with `Dependent > 0`, `Independent > 0` 1. Select rows ```{r} ridx <- (df$Dependent > 0) & (df$Independent > 0) ``` 2. Plot subset ```{r} plot(Dependent ~ Independent, df[ridx, ]) ``` 3. Skin the cat another way ```{r} plot( Dependent ~ Independent, df, subset = (Dependent > 0) & (Independent > 0) ) ``` ## Analysis: functions, classes, methods ```{r} fit <- lm(Dependent ~ Independent, df) # linear model -- regression anova(fit) # summary table plot(Dependent ~ Independent, df) abline(fit) ``` - `lm()`: plain-old function - `fit`: an _object_ of class "lm" - `anova()`: a _generic_ with a specific _method_ for class "lm" ```{r} class(fit) methods(class="lm") ``` ## Help! ```{r, eval=FALSE} ?"plot" # plain-old-function or generic ?"plot.formula" # method ?"plot.lm" # method for object of class 'lm', plot(fit) ``` ## Packages ```{r} library(ggplot2) ggplot(df, aes(x = Independent, y = Dependent)) + geom_point() + geom_smooth(method = "lm") ``` - General purpose: >15,000 packages on [CRAN][] - Gain contributor's domain expertise _and_ weird (or other) idiosyncracies - _Installation_ (once only per computer) versus _load_ (via `library(ggplot2)`, once per session) [CRAN]: https://cran.r-project.org [Bioconductor]: https://bioconductor.org # _Bioconductor_ Started 2002 as a platform for understanding analysis of microarray data ## Packages 1,750 packages. Domains of expertise: - Sequencing (RNASeq, ChIPSeq, single-cell, called variants, ...) - Microarrays (methylation, expression, copy number, ...) - flow cytometry - proteomics - ... Important themes - Reproducible research - Interoperability between packages & work kflows - Usability Resources - https://bioconductor.org - https://bioconductor.org/packages -- software, annotation, experiment, workflow - https://support.bioconductor.org - Community [slack][] ([sign-up][slack-sign-up]) [slack]: https://community-bioc.slack.com [slack-sign-up]: https://bioc-community.herokuapp.com/ ## Objects A distinctive feature of _Bioconductor_ -- use of _objects_ for representing data ```{r, message = FALSE} library(Biostrings) dna <- DNAStringSet(c("AACTCC", "CTGCA")) dna reverseComplement(dna) ``` - [Biostrings][]: DNA, RNA, AA representation and manipulation - [GenomicRanges][]: Coordinates in genome space - [SummarizedExperiment][]: coordinating 'assay' data (e.g., counts from an RNASeq experiment) with row and column annotations (e.g., information about samples and experimental treatments). [Biostrings]: https://bioconductor.org/packages/Biostrings [GenomicRanges]: https://bioconductor.org/packages/GenomicRanges [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment ## High-throughput sequence work flow ![](lecture-01-figures/SequencingEcosystem.png) Web site, https://bioconductor.org 1750 'software' packages, https://bioconductor.org/packages - Sequence analysis (RNASeq, ChIPSeq, called variants, copy number, single cell) - Microarrays (methylation, copy number, classical expression, ...) - Annotation (more about annotations later this morning...) - Flow cytometry - Proteomics, image analysis, ... Discovery and use, e.g., [DESeq2][] - Landing pages: title, description (abstract), installation instructions, badges - Vignettes! Also: - 'Annotation' packages - 'Experiment data' packages - Workflows - [Course material][], ... [DESeq2]: https://bioconductor.org/packages/DESeq2 [Course material]: https://bioconductor.org/help/course-materials/ # End matter ## Session Info ```{r} sessionInfo() ``` ## Acknowledgements Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996. This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement number 633974)