Contents

1 Project Overview

Analysis and comprehension of high-throughput genomic data

Packages, vignettes, work flows

Package installation and use

2 High-throughput Sequence Analysis

2.1 Overall Work Flow

  1. Experimental design
    • Keep it simple, e.g., ‘control’ and ‘treatment’ groups
    • Replicate within treatments!
  2. Wet-lab sequence preparation (figure from http://rnaseq.uoregon.edu/)

    • Record covariates, including processing day – likely ‘batch effects’
  3. (Illumina) Sequencing (Bentley et al., 2008, doi:10.1038/nature07517)

  4. Alignment
    • Choose to match task, e.g., Rsubread, Bowtie2 good for ChIPseq, some forms of RNAseq; BWA, GMAP better for variant calling
    • Primary output: BAM files of aligned reads
    • More recently: kallisto and similar programs that produce tables of reads aligned to transcripts
  5. Reduction
    • e.g., RNASeq ‘count table’ (simple spreadsheets), DNASeq called variants (VCF files), ChIPSeq peaks (BED, WIG files)
  6. Analysis
    • Differential expression, peak identification, …
  7. Comprehension
    • Biological context

Alt Sequencing Ecosystem

3 High-Throughput Sequence Data Types

3.1 Sequencing data types

Sequenced reads: FASTQ files

@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
+
HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
@ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
+
DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################

Aligned reads: BAM files

Called variants: VCF files

Genome annotations: BED, WIG, GTF, etc. files. E.g., TGF:

Derived results, e.g., ‘count’ tables (.csv files) for RNA-seq differential expressoin.

3.2 Major Bioconductor Packages

Bioconductor Objects