Fun and easy things to do with Bioconductor sequences, reads, ranges, and alignments. * Biostrings / BSgenome? ** Exome GC content and other simple metrics - Start with GRanges of all known genes - Use Views() or getSeq() to retrieve sequence - Use alphabetFrequency() or letterFrequency() to calculate GC content ** Binding motifs in 5' regions - Start with GRanges of 5' regions, 'position weight matrix' describing binding motif - Use getSeq() to get sequences of 5' regions - Use matchPWM to identify motif binding sites ** Consequences of variants for amino acid sequence - Use GenomicFeatures::extractTranscripts to assemble transcript sequences from exon coordinates - Use translate() to obtain amino acid sequences - Use subseq() <- ... to sequence to reflect variants - Use translate() to obtain variant amino acid sequences * GRanges? ** Intronic locations - Start with exons-by-gene GRangesList - Use gaps() to find intron coordinates ** Regions upstream of transcription start sites (TSS) - Start with exons-by-transcripts GRangesList - Use range() & start() to identify TSS - Use flank() to identify upstream (5') regions - Use BSgenome (or FaFile) & getSeq() to retrieve DNA sequence ** ChIP-seq nearest genes - Start with ChIP peaks GRanges, gene GRanges - Use nearest() (or precede() / follow()) ** Count ChIP peaks overlapping ENCODE regulatory region(s) - Start with ChIP GRanges 'chip', ENCODE track GRanges 'encode' - Use countOverlaps() or findOverlaps() e.g., how many peaks overlap ENCODE track? sum(countOverlaps(chip, encode) != 0) ** Mapping 'disjoint' exons to actual exons * GAlignments? ** Coverage - Use coverage(BamFile("my.bam")) ** Count reads overlapping genes - Use summarizeOverlaps(BamFileList(), ...) - memory efficient, parallel - Use count/findOverlaps for ad hoc schemes ** Identify reads whose mates map to different chromosomes - Use sortBam() to sort BAM file by qname - Use BamFile() and readGAlignments(..., obeyQname=TRUE) - ? and ScanBamParam(flag=scanBamFlag(isProperPair=TRUE)) - Split seqnames by rname, and determine unique length readnms <- split(seqnames, rname) mapto <- sapply(readnms, function(x) length(unique(x))) readnms[mapto != 1] - Clever way to do this operation without using split?