--- title: "Bioconductor Essentials" author: - name: Martin Morgan affiliation: Roswell Park Cancer Institute, Buffalo, NY output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default package: BiocIntro vignette: | %\VignetteIndexEntry{A01 -- Bioconductor Essentials} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` # _R_ Extensible statistical programming language - https://r-project.org ## _R_ Langauge ```{r} x <- rnorm(50) y <- x + rnorm(50) df <- data.frame(Indep = x, Dep = y) fit <- lm(Dep ~ Indep, df) summary(fit) ``` Vectors - `numeric()`, `character()`, `integer()`, `logical()`, `list()`, ... - Statistical concepts: `NA`, `factor()` Objects - Class: `data.frame`, `lm`, `matrix`, ... Function, generic, method - `rnorm()`, `lm()`; `summary()` generic, `summary.lm()` method. Programming constructs - `apply()` (array), `lapply()` vector or list --> list, `sapply()`; `if () {} else {}`, `for () {}` / `repeat {}` - `function() {}` - _Garbage collection_ ## Packages ```{r, warning=FALSE} library(ggplot2) ggplot(df, aes(x = Dep, y = Indep)) + geom_point() + geom_smooth(method="lm") ``` - Base, recommended, contributed - [CRAN][] - Domain expertise + author 'opinion' about statistics, programming, ... ## Help! - `?"summary"`, `?"summary.lm"` - CRAN [Task views][] - [StackOverflow][] [CRAN]: https://cran.r-project.org [Task views]: https://cran.r-project.org [StackOverflow]: https://stackoverflow.com # _Bioconductor_ Statistical analysis and comprehension of high-throughput genomic data - https://bioconductor.org - https://support.bioconductor.org ## A First Workflow ```{r} suppressPackageStartupMessages({ library(Biostrings) }) dna <- DNAStringSet( c("AAACTG", "CCCAACCA") ) dna reverseComplement(dna) ``` - Biological context - 'S4' classes and methods - Inter-operable packages -- learning to work with [Biostrings][] and `DNAStringSet` pays off in other packages. Help! - `class(dna)` - `methods(class = "DNAStringSet")` - `?"DNAStringSet"`, `?"reverseComplement,DNAStringSet-method"` (tab completion!) - `browseVignettes(package="Biostrings")` - https://bioconductor.org/packages - https://support.bioconductor.org [Biostrings]: https://bioconductor.org/packages/Biostrings ## [GenomicRanges][] ```{r} suppressPackageStartupMessages({ library(GenomicRanges) }) gr <- GRanges( c("chr1:10-19", "chr1:15-24", "chr1:30-39") ) gr ``` - Closed-interval (start and end coordinates include in range, like Ensembl; UCSC uses 1/2-open) - 1-based (like Ensembl; UCSC uses 0-based) Operations - Accessors: `seqnames()`, `start()`, `end()`, `width()`, `strand()` ```{r} width(gr) ``` - Intra-range: `shift()`, `narrow()`, `resize()`, `flank()`, `restrict()`, `trim()`... See `?"intra-range-methods"`. ```{r} shift(gr, 1) shift(gr, c(1, 2, 3)) ``` - Inter-range: `range()`, `gaps()`, `reduce()`, `disjoin()`, `coverage()`, ... See `?"inter-range-methods"`. ```{r} gaps(gr) reduce(gr) disjoin(gr) as(coverage(gr), "GRanges") ``` - Between-ranges: `findOverlaps()` / `countOverlaps()`; set operations (e.g., `union()`, `punion()`); `subsetByOverlaps()`; ... ```{r} snps <- GRanges("chr1", IRanges(c(7, 12, 17, 22), width = 1)) snps countOverlaps(gr, snps) subsetByOverlaps(snps, gr) ``` Data associated with ranges - `mcols()` or `$` ```{r} gr$p.value <- runif(3) gr ``` `GRangesList` - List of `GRanges`, e.g., exons ```{r} gene <- c("A", "A", "B") grl <- splitAsList(gr, gene) grl ``` - A common paradigm -- `unlist()`, transform, `relist()` ```{r} gr1 <- unlist(grl, use.names=FALSE) gr1$neg.log10.pvalue <- -log10(gr1$p.value) relist(gr1, grl) ``` - Other `*List`, e.g., `NumericList()` `seqinfo()` - Sequences associated with `GRanges` or `GRangesList`, analogous to factor levels. [GenomicRanges]: https://bioconductor.org/packages/GenomicRanges ## [SummarizedExperiment][] ![](our_figures/SummarizedExperiment.png) - Coordinate `assay()` data (e.g., RNAseq expression counts) with information about rows (`rowData()`, e.g., genes) and columns (`colData()`, e.g., samples). Example: 'airway' RNAseq data ```{r} suppressPackageStartupMessages({ library(airway) }) data(airway) airway ``` - Information about samples, and coordinated manipulation ```{r} colData(airway) airway[, airway$dex == "trt"] ``` - Information about regions of interest (genes, including exon coordinates) ```{r} rowRanges(airway) rowData(airway)$p.value <- runif(nrow(airway)) ``` - Access to assay information (matrix of counts of RNAseq reads in each region of interest and sample). ```{r} libSize <- colSums(assay(airway)) libSize airway$libSize <- libSize table(rowSums(assay(airway)) != 0) ``` [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment ## Additional 'Core' Infrastructure BED, GFF, GTF, ... import - [rtracklayer][] `import()` - Also [Biostrings][] (Fasta), [VariantAnnotation][] (VCF), [GenomicAlignments][] (bam), [ShortRead][] (Fastq), ... Gene sets - [GSEABase][] File management - [BiocFileCache][] -- e.g., retrieve remote files to disk, subsequent references read from disk. - [GenomicFiles][] -- iteration and other operations on large files or collections of files (e.g., by-chromosome VCF files). Parallel evaluation - [BiocParallel][] -- cores, clusters, clouds Annotation resources - [biomaRt][], [KEGGREST][], ...: - `org.*` (e.g., [org.Hs.eg.db][]) packages: 6-month snapshots for symbol mapping ```{r} suppressPackageStartupMessages({ library(org.Hs.eg.db) }) rowData(airway)$Symbol <- mapIds( org.Hs.eg.db, rownames(airway), "SYMBOL", "ENSEMBL" ) ``` - `TxDb.*` (e.g., [TxDb.Hsapiens.UCSC.hg38.knownGene][]) packages: gene models; [GenomicFeatures][]: `makeTxDbFrom...` - `BSgenome.*` (e.g., [BSgenome.Hsapiens.UCSC.hg38][] - [AnnotationHub][], [ExperimentHub][] - Ready access to lightly or heavily curated resources ```{r, warning=FALSE, message=FALSE} suppressPackageStartupMessages({ library(AnnotationHub) library(ExperimentHub) }) query(AnnotationHub(), c("Homo sapiens", "gtf", "release-90")) query(AnnotationHub(), "EnsDb") query(ExperimentHub(), "curatedMetagenomic") ``` [rtracklayer]: https://bioconductor.org/packages/rtracklayer [VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation [GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments [ShortRead]: https://bioconductor.org/packages/ShortRead [GSEABase]: https://bioconductor.org/packages/GSEABase [BiocFileCache]: https://bioconductor.org/packages/BiocFileCache [GenomicFiles]: https://bioconductor.org/packages/GenomicFiles [BiocParallel]: https://bioconductor.org/packages/BiocParallel [biomaRt]: https://bioconductor.org/packages/biomaRt [KEGGREST]: https://bioconductor.org/packages/KEGGREST [BSgenome.Hsapiens.UCSC.hg38]: https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg38 [org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db [TxDb.Hsapiens.UCSC.hg38.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg38.knownGene [GenomicFeatures]: https://bioconductor.org/packages/GenomicFeatures [AnnotationHub]: https://bioconductor.org/packages/AnnotationHub [ExperimentHub]: https://bioconductor.org/packages/ExperimentHub ## Emerging Area: Very Large Data - [DelayedArray][], [HDF5Array][], [DelayedMatrixStats][] - [SingleCellExperiment][] NB: requires R-devel / Bioc-devel - Library accessing 10x genomics 'million neuron' data set ``` if (!"TENxBrainData" %in% rownames(installed.packages())) biocLite("LTLA/TENxBrainData") # install from github library(TENxBrainData) ``` - Retrieve data from ExperimentHub - Represent 'big' data as DelayedArray (HDF5Array) - Incorporate into SingleCellExperiment ``` se <- TENxBrainData() se ``` ``` ## class: SingleCellExperiment ## dim: 27998 1306127 ## metadata(0): ## assays(1): counts ## rownames: NULL ## rowData names(2): Ensembl Symbol ## colnames(1306127): AAACCTGAGATAGGAG-1 AAACCTGAGCGGCTTC-1 ... ## TTTGTCAGTTAAAGTG-133 TTTGTCATCTGAAAGA-133 ## colData names(4): Barcode Sequence Library Mouse ## reducedDimNames(0): ## spikeNames(0): ``` - Familiar operation on very large data, e.g., subset & calculate ``` libSize <- colSums(assay(se)[, 1:1000]) range(libSize) ``` ``` ## [1] 1453 34233 ``` [DelayedArray]: https://bioconductor.org/packages/DelayedArray [HDF5Array]: https://bioconductor.org/packages/HDF5Array [DelayedMatrixStats]: https://bioconductor.org/packages/DelayedMatrixStats [SingleCellExperiment]: https://bioconductor.org/packages/SingleCellExperiment