- Supplement 1: RNA-Seq Workflow

Roswell Park Cancer Institute, Buffalo, NY

5 - 9 October, 2015

The material in this course requires R version 3.2 and Bioconductor version 3.2

```
stopifnot(
getRversion() >= '3.2' && getRversion() < '3.3',
BiocInstaller::biocVersion() == "3.2"
)
```

Keep it simple

- Classical experimental designs
- Time series
- Without missing values, where possible
- Intended analysis must be feasbile – can the available samples and hypothesis of interest be combined to formulate a testable statistical hypothesis?

Replicate

- Extent of replication determines nuance of biological question.
- No replication (1 sample per treatment): qualitative description with limited statistical options.
- 3-5 replicates per treatment: designed experimental manipulation with cell lines or other well-defined entities; 2-fold (?) change in average expression between groups.
- 10-50 replicates per treatment: population studies, e.g., cancer cell lines.
- 1000’s of replicates: prospective studies, e.g., SNP discovery
- One resource:
*RNASeqPower*

Avoid confounding experimental factors with other factors

- Common problems: samples from one treatment all on the same flow cell; samples from treatment 1 processed first, treatment 2 processed second, etc.

Record co-variates

Be aware of *batch effects*

Known

- Phenotypic covariates, e.g., age, gender
- Experimental covariates, e.g., lab or date of processing
- Incorporate into linear model, at least approximately

Unknown

- Or just unexpected / undetected
- Characterize using, e.g.,
*sva*.

Surrogate variable analysis

- Leek et al., 2010, Nature Reviews Genetics 11 733-739, Leek & Story PLoS Genet 3(9): e161.
- Scientific finding: pervasive batch effects
- Statistical insights: surrogate variable analysis: identify and build surrogate variables; remove known batch effects
- Benefits: reduce dependence, stabilize error rate estimates, and improve reproducibility
*combat*software /*sva**Bioconductor*package

HapMap samples from one facility, ordered by date of processing.

Confounding factors

- Record or avoid

Artifacts of your *particular* protocols

- Sequence contaminants
- Enrichment bias, e.g., non-uniform transcript representation.
- PCR artifacts – adapter contaminants, sequence-specific amplification bias, …

Axes of variation

- Single- versus paired-end
- Length: 50-200nt
- Number of reads per sample

Application-specific, e.g.,

- ChIP-seq: short, single-end reads are usually sufficient
- RNA-seq, known genes: single- or paired-end reads
- RNA-seq, transcripts or novel variants: paired-end reads
- Copy number: single- or paired-end reads
- Structural variants: paired-end reads
- Variants: depth via longer, paired-end reads
- Microbiome: long paired-end reads (overlapping ends)

Alignment strategies

*de novo*- No reference genome; considerable sequencing and computational resources
- Genome
- Established reference genome
- Splice-aware aligners
- Novel transcript discovery
- Transcriptome
- Established reference genome; reliable gene model
- Simple aligners
- Known gene / transcript expression

Splice-aware aligners (and *Bioconductor* wrappers)

- Use known gene model to count aligned reads overlapping regions of interest / gene models
- Gene model can be public (e.g., UCSC, NCBI, ENSEMBL) or
*ad hoc*(gff file) `GenomicAlignments::summarizeOverlaps()`

`Rsubread::featureCount()`

- HTSeq, htseq-count

- tophat uses Bowtie2 to perform basic single- and paired-end alignments, then uses algorithms to place difficult-to-align reads near to their well-aligned mates.
- Cufflinks (doi) takes
*tophat*output and estimate existing and novel transcript abundance. How Cufflinks Works - Cuffdiff assesses statistical significance of estimated abundances between experimental groups
- RSEM includes de novo assembly and quantification

- ‘Next generation’ differential expression tools; transcriptome alignment
- E.g., kallisto takes a radically different approach: from FASTQ to count table without BAM files.
- Very fast, almost as accurate.

Unique statistical aspects

- Large data, few samples
- Comparison of each gene, across samples;
*univariate*measures - Each gene is analyzed by the
*same*experimental design, under the*same*null hypothesis

Summarization

- Counts
*per se*, rather than a summary (RPKM, FRPKM, …), are relevant for analysis - For a given gene, larger counts imply more information; RPKM etc., treat all estimates as equally informative.
- Comparison is across samples at
*each*region of interest; all samples have the same region of interest, so modulo library size there is no need to correct for, e.g., gene length or mapability.

Normalization

- Libraries differ in size (total counted reads per sample) for un-interesting reasons; we need to account for differences in library size in statistical analysis.
- Total number of counted reads per sample is
*not*a good estimate of library size. It is un-necessarily influenced by regions with large counts, and can introduce bias and correlation across genes. Instead, use a robust measure of library size that takes account of skew in the distribution of counts (simplest: trimmed geometric mean; more advanced / appropriate encountered in the lab). - Library size (total number of counted reads) differs between samples, and should be included
*as a statistical offset*in analysis of differential expression, rather than ‘dividing by’ the library size early in an analysis.

Appropriate error model

- Count data is
*not*distributed normally or as a Poisson process, but rather as negative binomial. - Result of a combination Poisson (`shot’ noise, i.e., within-sample technical and sampling variation in read counts) with variation between biological samples.
- A negative binomial model requires estimation of an additional parameter (‘dispersion’), which is estimated poorly in small samples.
- Basic strategy is to moderate per-gene estimates with more robust local estimates derived from genes with similar expression values (a little more on borrowing information is provided below).

Pre-filtering

- Naively, a statistical test (e.g., t-test) could be applied to each row of a counts table. However, we have relatively few samples (10’s) and very many comparisons (10,000’s) so a naive approach is likely to be very underpowered, resulting in a very high
*false discovery rate* - A simple approach is perform fewer tests by removing regions that could not possibly result in statistical significance, regardless of hypothesis under consideration.
- Example: a region with 0 counts in all samples could not possibly be significant regradless of hypothesis, so exclude from further analysis.
- Basic approaches: ‘K over A’-style filter – require a minimum of A (normalized) read counts in at least K samples. Variance filter, e.g., IQR (inter-quartile range) provides a robust estimate of variability; can be used to rank and discard least-varying regions.
- More nuanced approaches:
*edgeR*vignette; work flow today.

Borrowing information

- Why does low statistical power elevate false discovery rate?
- One way of developing intuition is to recognize a t-test (for example) as a ratio of variances. The numerator is treatment-specific, but the denominator is a measure of overall variability.
- Variances are measured with uncertainty; over- or under-estimating the denominator variance has an asymmetric effect on a t-statistic or similar ratio, with an underestimate
*inflating*the statistic more dramatically than an overestimate deflates the statistic. Hence elevated false discovery rate. - Under the typical null hypothesis used in microarray or RNA-seq experiments, each gene may respond differently to the treatment (numerator variance) but the overall variability of a gene is the same, at least for genes with similar average expression
- The strategy is to estimate the denominator variance as the between-group variance for the gene,
*moderated*by the average between-group variance across all genes. - This strategy exploits the fact that the same experimental design has been applied to all genes assayed, and is effective at moderating false discovery rate.

*DESeq2* `estimateSizeFactors()`

, Anders and Huber, 2010

- For each gene: geometric mean of all samples.
- For each sample: median ratio of the sample gene over the geometric mean of all samples
- Functions other than the median can be used; control genes can be used instead

*edgeR* `calcNormFactors()`

TMM method of Robinson and Oshlack, 2010

- Identify reference sample: library with upper quartile closest to the mean upper quartile of all libraries
- Calculate M-value of each gene (log-fold change relative to reference)
- Summarize library size as weighted trimmed mean of M-values.

*DESeq2* `estimateDispersions()`

- Estimate per-gene dispersion
- Fit a smoothed relationship between dispersion and abundance

*edgeR* `estimateDisp()`

- Common: single dispersion for all genes; appropriate for small experiments (<10? samples)
- Tagwise: different dispersion for all genes; appropriate for larger / well-behaved experiments
- Trended: bin based on abundance, estimate common dispersion within bin, fit a loess-smoothed relationship between binned dispersion and abundance

Placing differentially expressed regions in context

- Gene names associated with genomic ranges
- Gene set enrichment and similar analysis
- Proximity to regulatory marks
- Integrate with other analyses, e.g., methylation, copy number, variants, …

Correlation between genomic copy number and mRNA expression identified 38 mis-labeled samples in the TCGA ovarian cancer Affymetrix microarray dataset.