Contents

The material in this course requires R version 3.2 and Bioconductor version 3.2

stopifnot(
    getRversion() >= '3.2' && getRversion() < '3.3',
    BiocInstaller::biocVersion() == "3.2"
)

1 What is Bioconductor?

Physically

Conceptually

2 Core principles

2.1 High-throughput analysis needs statistics!

Volume of data

Type of research question

Technological artifacts

2.2 Scientific research needs to be reproducible

2.2.1 A motivating case study

  • Cisplatin-resistant non-small-cell lung cancer gene sets

  • Hsu et al. 2007 J Clin Oncol 25: 4350-4357 retracted

Lessons

  • Record each step of the analysis
  • Coordinated manipulation of feature, sample, and assay data
  • Informative labels on visualizations

2.2.2 How to be reproducible?

  • Use software ‘objects’ that take care of some of the tedious book-keeping
  • Document our analysis in scripts and ‘markdown’ documents

2.2.3 Example: SummarizedExperiment

Underlying data is a matrix

  • Regions of interest (e.g., genes) x samples
  • assay() – e.g., matrix of counts of reads overlapping genes

Include information about rows

  • rowRanges() – gene identifiers, or genomic ranges describing the coordinates of each gene

Include information about columns

  • colData() – describing samples, experimental design, …
library(airway)         # An 'ExperimentData' package...
data(airway)            # ...with a sample data set...
airway                  # ...that is a SummarizedExperiment
## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowRanges metadata column names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
head(assay(airway))     # contains a matrix of counts
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
## ENSG00000000003        679        448        873        408       1138       1047        770
## ENSG00000000005          0          0          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587        799        417
## ENSG00000000457        260        211        263        164        245        331        233
## ENSG00000000460         60         55         40         35         78         63         76
## ENSG00000000938          0          0          2          0          1          0          0
##                 SRR1039521
## ENSG00000000003        572
## ENSG00000000005          0
## ENSG00000000419        508
## ENSG00000000457        229
## ENSG00000000460         60
## ENSG00000000938          0
head(rowRanges(airway)) # information about the genes...
## GRangesList object of length 6:
## $ENSG00000000003 
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames               ranges strand   |   exon_id       exon_name
##           <Rle>            <IRanges>  <Rle>   | <integer>     <character>
##    [1]        X [99883667, 99884983]      -   |    667145 ENSE00001459322
##    [2]        X [99885756, 99885863]      -   |    667146 ENSE00000868868
##    [3]        X [99887482, 99887565]      -   |    667147 ENSE00000401072
##    [4]        X [99887538, 99887565]      -   |    667148 ENSE00001849132
##    [5]        X [99888402, 99888536]      -   |    667149 ENSE00003554016
##    ...      ...                  ...    ... ...       ...             ...
##   [13]        X [99890555, 99890743]      -   |    667156 ENSE00003512331
##   [14]        X [99891188, 99891686]      -   |    667158 ENSE00001886883
##   [15]        X [99891605, 99891803]      -   |    667159 ENSE00001855382
##   [16]        X [99891790, 99892101]      -   |    667160 ENSE00001863395
##   [17]        X [99894942, 99894988]      -   |    667161 ENSE00001828996
## 
## ...
## <5 more elements>
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome
colData(airway)[, 1:3]  # ...and samples
## DataFrame with 8 rows and 3 columns
##            SampleName     cell      dex
##              <factor> <factor> <factor>
## SRR1039508 GSM1275862   N61311    untrt
## SRR1039509 GSM1275863   N61311      trt
## SRR1039512 GSM1275866  N052611    untrt
## SRR1039513 GSM1275867  N052611      trt
## SRR1039516 GSM1275870  N080611    untrt
## SRR1039517 GSM1275871  N080611      trt
## SRR1039520 GSM1275874  N061011    untrt
## SRR1039521 GSM1275875  N061011      trt
## coordinated subsetting
untrt <- airway[, airway$dex == 'untrt']
head(assay(untrt))
##                 SRR1039508 SRR1039512 SRR1039516 SRR1039520
## ENSG00000000003        679        873       1138        770
## ENSG00000000005          0          0          0          0
## ENSG00000000419        467        621        587        417
## ENSG00000000457        260        263        245        233
## ENSG00000000460         60         40         78         76
## ENSG00000000938          0          2          1          0
colData(untrt)[, 1:3]
## DataFrame with 4 rows and 3 columns
##            SampleName     cell      dex
##              <factor> <factor> <factor>
## SRR1039508 GSM1275862   N61311    untrt
## SRR1039512 GSM1275866  N052611    untrt
## SRR1039516 GSM1275870  N080611    untrt
## SRR1039520 GSM1275874  N061011    untrt

2.3 We can ‘stand on the shoulders of giants’

Packages!

2.4 We should explore our data

Visualization

Inter-operability between packages

Examples (details later)

2.5 Comprehension is more than statistical analysis

Annotation

Case studies

3 Bioconductor’s role in sequence analysis

3.1 Overall work flow

3.1.1 General steps

  1. Experimental design
  • Keep it simple!
  • Replication!
  • Avoid or track batch effects
  1. Wet-lab preparation

  2. High-throughput sequencing

  • Output: FASTQ files of reads and their quality scores

    @ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
    CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
    +
    HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
    @ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
    GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
    +
    DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################
  1. Alignment
  • Many different aligners, some specialized for different purposes
  • Output: BAM files of aligned reads

    ERR127306.7941162       403     chr14   19653689        3       72M             =       19652348        -1413  ...
    ERR127306.22648137      145     chr14   19653692        1       72M             =       19650044        -3720  ...
    
    ... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC        *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%))
    ... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG        '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)****
    
    ... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:2  CC:Z:chr22      CP:i:16189276   HI:i:0
    ... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:3  CC:Z:=  CP:i:19921600   HI:i:0
  1. Summary
  • e.g., RNA-Seq: count of reads overlapping regions of interest (e.g., genes)
  • e.g., ChIP-Seq: ranges where regulatory elements bind
  • Output: ‘.csv’, BED, or WIG files
  1. Statistical analysis

  2. Comprehension

3.1.2 An example: RNA-Seq differential expression of known genes

More detail later!

Example: ‘airway’ data set used in a later lab

  • Airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used in asthma patients to prevent or reduce inflammation of the airways.
  • Four primary human airway smooth muscle cell lines
  • Each cell line: a control sample and a treated sample. Treatment: 1 micromolar dexamethasone for 18 hours.
  • Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.

Steps

  1. Experimental design
  • One covariate: cell line
  • One experimental factor with two levels: control, and treated with dexamethasone

    library(airway)         # An 'ExperimentData' package...
    data(airway)            # ...with a sample data set...
    colData(airway)[, 1:3]  # ...represented as a SummarizedExperiment
    ## DataFrame with 8 rows and 3 columns
    ##            SampleName     cell      dex
    ##              <factor> <factor> <factor>
    ## SRR1039508 GSM1275862   N61311    untrt
    ## SRR1039509 GSM1275863   N61311      trt
    ## SRR1039512 GSM1275866  N052611    untrt
    ## SRR1039513 GSM1275867  N052611      trt
    ## SRR1039516 GSM1275870  N080611    untrt
    ## SRR1039517 GSM1275871  N080611      trt
    ## SRR1039520 GSM1275874  N061011    untrt
    ## SRR1039521 GSM1275875  N061011      trt
  1. Wet-lab preparation

  2. High-throughput sequencing

  • Paired-end reads
  • Output: FASTQ files
  1. Alignment
  • STAR aligner
  • Aligned to Ensembl release 75 of the human reference genome
  • Output: BAM files
  1. Summarization
  • GenomicRanges::summarizeOverlaps()
  • Output: matrix of the count of reads overlapping regions of interest. Each row is a gene. Each column is a sample.

    head(assay(airway))
    ##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
    ## ENSG00000000003        679        448        873        408       1138       1047        770
    ## ENSG00000000005          0          0          0          0          0          0          0
    ## ENSG00000000419        467        515        621        365        587        799        417
    ## ENSG00000000457        260        211        263        164        245        331        233
    ## ENSG00000000460         60         55         40         35         78         63         76
    ## ENSG00000000938          0          0          2          0          1          0          0
    ##                 SRR1039521
    ## ENSG00000000003        572
    ## ENSG00000000005          0
    ## ENSG00000000419        508
    ## ENSG00000000457        229
    ## ENSG00000000460         60
    ## ENSG00000000938          0
  1. Statistical analysis
  • Test each gene for statistical difference between control and treatement group
  • Output: top table of differentially expressed genes. For each gene: ‘log fold change’ describing how large a change occurred, and a test statistic (e.g., adjusted p-value) summarizing statistical evidence for the change

    library(DESeq2)     # package implementing statistical methods
    dds <-              # data and experimental design
        DESeqDataSet(airway, design = ~ cell + dex)
    dds <- DESeq(dds)   # initial analysis
    ## estimating size factors
    ## estimating dispersions
    ## gene-wise dispersion estimates
    ## mean-dispersion relationship
    ## final dispersion estimates
    ## fitting model and testing
    res <- results(dds) # summary results
    ridx <-             # order from largest to smallest absolute log fold change
        order(abs(res$log2FoldChange), decreasing=TRUE)
    res <- res[ridx,]
    head(res)           # top-table
    ## log2 fold change (MAP): dex untrt vs trt 
    ## Wald test p-value: dex untrt vs trt 
    ## DataFrame with 6 rows and 6 columns
    ##                  baseMean log2FoldChange     lfcSE      stat        pvalue          padj
    ##                 <numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric>
    ## ENSG00000179593  67.24305      -4.884729 0.3312024 -14.74847  3.147170e-49  1.031585e-46
    ## ENSG00000109906 385.07103      -4.865899 0.3324555 -14.63624  1.649293e-48  5.126459e-46
    ## ENSG00000152583 997.43977      -4.316100 0.1724125 -25.03357 2.636198e-138 4.752538e-134
    ## ENSG00000250978  56.31819      -4.093661 0.3291518 -12.43700  1.645709e-35  2.798948e-33
    ## ENSG00000163884 561.10717      -4.079127 0.2103817 -19.38917  9.525449e-84  1.073280e-80
    ## ENSG00000168309 159.52692      -3.992793 0.2549089 -15.66361  2.682234e-55  1.239880e-52
  1. Comprehension
  • Visualization

    library(ggplot2)
    ggplot(as.data.frame(res), 
           aes(x=log2FoldChange, y=-10 * log10(pvalue))) +
        geom_point()
    ## Warning: Removed 30633 rows containing missing values (geom_point).

  • From Ensembl gene identifiers to gene symbols, pathways, …

    library(org.Hs.eg.db)
    ensid <- head(rownames(res))
    select(org.Hs.eg.db, ensid, c("SYMBOL", "GENENAME"), "ENSEMBL")
    ## 'select()' returned 1:1 mapping between keys and columns
    ##           ENSEMBL  SYMBOL                                      GENENAME
    ## 1 ENSG00000179593 ALOX15B          arachidonate 15-lipoxygenase, type B
    ## 2 ENSG00000109906  ZBTB16      zinc finger and BTB domain containing 16
    ## 3 ENSG00000152583 SPARCL1                          SPARC-like 1 (hevin)
    ## 4 ENSG00000250978    <NA>                                          <NA>
    ## 5 ENSG00000163884   KLF15                        Kruppel-like factor 15
    ## 6 ENSG00000168309 FAM107A family with sequence similarity 107, member A

3.2 Bioinformatic steps and Bioconductor packages

Alt Sequencing Ecosystem

3.3 Two shiny examples

BAMSpector – display gene models and underlying support across BAM (aligned read) files

app <- system.file(package="BiocUruguay2015", "BAMSpector")
shiny::runApp(app)

MAPlotExplorer – summarize two-group differential expression, including drill-down of individual genes. Based on CSAMA 2015 lab by Andrzej Oles.

app <- system.file(package="BiocUruguay2015", "MAPlotExplorer")
shiny::runApp(app)

Some uses illustrated by these applications

4 Resources

Acknowledgements

4.1 Key references

4.2 sessionInfo()

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux stretch/sid
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.2.3                      RSQLite_1.0.0                          
##  [3] DBI_0.3.1                               ggplot2_1.0.1                          
##  [5] airway_0.103.1                          limma_3.25.18                          
##  [7] DESeq2_1.9.51                           RcppArmadillo_0.6.100.0.0              
##  [9] Rcpp_0.12.1                             BSgenome.Hsapiens.UCSC.hg19_1.4.0      
## [11] BSgenome_1.37.6                         rtracklayer_1.29.28                    
## [13] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.21.33                
## [15] AnnotationDbi_1.31.19                   SummarizedExperiment_0.3.11            
## [17] Biobase_2.29.1                          GenomicRanges_1.21.32                  
## [19] GenomeInfoDb_1.5.16                     microbenchmark_1.4-2                   
## [21] Biostrings_2.37.8                       XVector_0.9.4                          
## [23] IRanges_2.3.26                          S4Vectors_0.7.23                       
## [25] BiocGenerics_0.15.11                    BiocStyle_1.7.9                        
## 
## loaded via a namespace (and not attached):
##  [1] splines_3.2.2            Formula_1.2-1            latticeExtra_0.6-26     
##  [4] Rsamtools_1.21.21        yaml_2.1.13              lattice_0.20-33         
##  [7] digest_0.6.8             RColorBrewer_1.1-2       colorspace_1.2-6        
## [10] sandwich_2.3-4           htmltools_0.2.6          plyr_1.8.3              
## [13] XML_3.98-1.3             biomaRt_2.25.3           genefilter_1.51.1       
## [16] zlibbioc_1.15.0          xtable_1.7-4             mvtnorm_1.0-3           
## [19] scales_0.3.0             BiocParallel_1.3.54      annotate_1.47.4         
## [22] TH.data_1.0-6            nnet_7.3-11              proto_0.3-10            
## [25] survival_2.38-3          magrittr_1.5             evaluate_0.8            
## [28] MASS_7.3-44              foreign_0.8-66           BiocInstaller_1.19.14   
## [31] tools_3.2.2              formatR_1.2.1            multcomp_1.4-1          
## [34] stringr_1.0.0            munsell_0.4.2            locfit_1.5-9.1          
## [37] cluster_2.0.3            lambda.r_1.1.7           futile.logger_1.4.1     
## [40] grid_3.2.2               RCurl_1.95-4.7           labeling_0.3            
## [43] bitops_1.0-6             rmarkdown_0.8.1          gtable_0.1.2            
## [46] codetools_0.2-14         reshape2_1.4.1           GenomicAlignments_1.5.18
## [49] gridExtra_2.0.0          zoo_1.7-12               knitr_1.11              
## [52] Hmisc_3.17-0             futile.options_1.0.0     stringi_0.5-5           
## [55] geneplotter_1.47.0       rpart_4.1-10             acepack_1.3-3.3