Common Sequence Analysis Work Flows
Martin Morgan, Sonali Arora
February 3, 2015
See the lecture notes and lab.
RNA-seq differential expression of known genes
- Simplest scenario
- Experimental design: simple, replicated; track covariates and be
aware of batch effects
- Sequencing: moderate length and number of reads; single or
paired-end (though probably paired-end).
- Alignment: basic splice-aware aligner, e.g., Bowtie2,
STAR. Viable Bioconductor approaches: Rsubread,
Rbowtie (especially via the QuasR
GenomicRanges::summarizeOverlaps() or external tools,
using gene model from
TxDb.* package or GFF / GTF files. End
result: matrix of counts.
- Analysis: DESeq2, edgeR, and
RNA-seq differential expression of known and novel transcripts
- Popular non-R work flow: Rbowtie2, tophat, cufflinks, cuffdiff.
- DEXSeq: differential exon use.
Rsubread::subjunc() for aligning without requiring known gene models.
- cummeRbund: working with cufflinks output.
See my recent
outlining ChIP-seq and relevant Bioconductor software.
- Experimental design / wet lab: important to effectively enrich
genomic DNA via ChIP, otherwise hard to distinguish signal peaks from background
- Sequencing: moderate length and number of single-end reads very adequate.
- Alignment: Basic aligners sufficient
- External software; many tools depending on application, e.g., MACS.
- Product: BED and / or WIG files of called peaks
Analysis & Comprehension
- ChIPQC for quality control.
- rtracklayer to input BED and WIG files to
standard Bioconductor data structures.
- ChIPpeakAnno, ChIPXpres for
annotating peaks in relation to genes.
- DiffBind to assess differential representation of
peaks in a designed experiment.
- AnnotationHub for accessing (some)
consortium-level summary data.
- Duplications or deletions larger than 1 kb
- Germ line (primarily diploid genome, homogeneous sample, integer
copy numbers) or somatic variants?
- Tumor / normal pairs?
- Bin and count. GC and other (e.g., exon length) correction. Easily
and efficiently done with, e.g.,
- Segment – circular binary segmentation (often via
Analysis & comprehension
- 45 packages tagged with “CopyNumberVariation” in
biocViews; also terms “DNASeq”, “ExomeSeq”, “WholeGenome”
- Represent duplicated regions as genomic ranges; integrates very
easily in Bioconductor annotation work flows.
See Michael Lawrence's variant calling with
and Val Obenchain's manipulation and annotation of called variants with
- Sequencing: requires high-quality reads with high per-nucleotide
depth of coverage – longer, paired-end sequencing.
Alignment: requires effective aligners; BWA, GMAP, …
- gmapR wraps the GMAP aligner in R.
Reduction: typically to VCF files summarizing variants and / or
population-level variation. GATK and other non-R tools commonly
- VariantTools includes facilities for calling
- h5vc targets a different intermediate step:
summarize base counts at each position in the genome; use this
as a starting point for calling variants, and to evaluate false
Analysis & comprehension
See the short
centered around Illumina 450k methylation arrays and the minfi package.
- Analysis & comprehension: bsseq, BiSeq
for processing and analysis; bumphunter as basic tool
for identifying CpG features.
- Experimental design: typically population-level surveys with
moderate (10's-100's) of samples.
- Wet lab & sequencing: often target phylogenetically-informative
genes, requiring longer (overlapping) paired-end reads. Many
existing studies used 454 technology, which has a different
sequencing error model than Illumina (e.g., homopolymers are a
common error, instead of trailing nucleotide quality deterioration).
Reduction: Pre-processing (e.g., knitting together overlapping
paired-end reads) and taxonomic classification / placement in
third-party software, e.g., QIIME, pplacer. End result: count
table summarizing represenation of distinct taxa in each sample.
- rRDP provides an R / Bioconductor interface
to the RDP classifiere.
Analysis: R / Bioconductor and many insights from microarray /
RNA-seq analysis well suited to count table, but common pipelines
have re- or dis-invented the wheel.
- phyloseq provides very nice tools for general