1 Splice event detection and quantification from RNA-seq data with SGSeq

Leonard D Goldstein [1, 2]

[1] Department of Bioinformatics and Computational Biology,

[2] Department of Molecular Biology,

Genentech, Inc., South San Francisco, CA, USA.

1.1 Abstract

The SGSeq package provides a framework for analyzing annotated and novel splice events from RNA-seq data. SGSeq predicts exons and splice junctions from reads aligned against a reference genome and assembles them into a genome-wide splice graph. Splice events are identified from the graph and quantified using reads spanning event boundaries. This workshop provides an introduction to SGSeq functionality, including splice event detection, quantification, annotation and visualization. The first part of the workshop illustrates a complete SGSeq workflow for analyzing a gene of interest starting from BAM files. The second part of the workshop covers exercises based on a genomewide data set previously processed with SGSeq.

1.2 Preliminaries

library(SGSeq)

The first part of this workshop illustrates an analysis of paired-end RNA-seq data from four tumor and four normal colorectal samples, which are part of a data set published in [Seshagiri et al. 2012] (#seshagiri). For the purpose of this vignette, we created BAM files that only include reads mapping to a single gene of interest (FBXO31).

When starting a new project, SGSeq requires information on the samples to be analyzed. This information can be provided as a data.frame, which must include columns sample_name (specificying a unique name for each sample) and file_bam (specifying the location of the BAM file). Function getBamInfo can be used to extract additional library information from BAM files, including paired-end status, median read length, median insert size and the total number of read alignments. This information must be obtained once initially and can then be used for all subsequent analyses. It is essential that BAM files are generated using a splice-aware alignment program that generates the custom tag ‘XS’ indicating the direction of transcription for spliced reads. In the following, we work with a data.frame si generated from the original (complete) BAM files with function getBamInfo.

si

##   sample_name file_bam paired_end read_length frag_length lib_size
## 1          N1   N1.bam       TRUE          75         293 12405197
## 2          N2   N2.bam       TRUE          75         197 13090179
## 3          N3   N3.bam       TRUE          75         206 14983084
## 4          N4   N4.bam       TRUE          75         207 15794088
## 5          T1   T1.bam       TRUE          75         284 14345976
## 6          T2   T2.bam       TRUE          75         235 15464168
## 7          T3   T3.bam       TRUE          75         259 15485954
## 8          T4   T4.bam       TRUE          75         247 15808356

The following code block sets the correct BAM file paths in the sample information for this vignette.

path <- system.file("extdata", package = "SGSeq")
si$file_bam <- file.path(path, "bams", si$file_bam)

1.3 Transcript features and the TxFeatures class

We use the UCSC knownGene table as reference annotation, which is available as a Bioconductor annotation package TxDb.Hsapiens.UCSC.hg19.knownGene. We retain transcripts on chromosome 16, where the FBXO31 gene is located, and change chromosome names in the annotation to match chromosome names in the BAM files.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- keepSeqlevels(txdb, "chr16")
seqlevelsStyle(txdb) <- "NCBI"

To work with transcript annotation in the SGSeq framework, we first extract exons and splice junctions from the TxDb object using function convertToTxFeatures. We only retain features overlapping the FBXO31 gene (genomic coordinates of the FBXO31 gene are stored in the GRanges object gr).

txf_ucsc <- convertToTxFeatures(txdb)

## Warning in convertToTxFeatures(txdb): Merged adjacent exons

txf_ucsc <- txf_ucsc[txf_ucsc %over% gr]
txf_ucsc

## TxFeatures object with 23 ranges and 0 metadata columns:
##        seqnames               ranges strand     type
##           <Rle>            <IRanges>  <Rle> <factor>
##    [1]       16 [87362942, 87365116]      -        L
##    [2]       16 [87365116, 87367492]      -        J
##    [3]       16 [87367492, 87367892]      -        I
##    [4]       16 [87367892, 87368910]      -        J
##    [5]       16 [87368910, 87369063]      -        I
##    ...      ...                  ...    ...      ...
##   [19]       16 [87417011, 87417394]      -        F
##   [20]       16 [87417628, 87417700]      -        U
##   [21]       16 [87423343, 87423454]      -        I
##   [22]       16 [87423454, 87425689]      -        J
##   [23]       16 [87425689, 87425708]      -        F
##                                  txName        geneName
##                         <CharacterList> <CharacterList>
##    [1] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##    [2] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##    [3] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##    [4] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##    [5] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##    ...                              ...             ...
##   [19]                       uc002fjw.3           79791
##   [20]                       uc021tmi.1                
##   [21]                       uc010vot.2           79791
##   [22]                       uc010vot.2           79791
##   [23]                       uc010vot.2           79791
##   -------
##   seqinfo: 1 sequence from hg19 genome

SGSeq makes extensive use of the Bioconductor infrastructure for genomic ranges ([Lawrence et al. 2013] (#lawrence)). The TxFeatures class shown above extends the GRanges class with additional metadata columns. Column type can take values

J (splice junction)
I (internal exon)
F (first/5\(^\prime\)-terminal exon)
L (last/3\(^\prime\)-terminal exon)
U (unspliced).

Columns txName and geneName indicate the transcript and gene each feature derives from. Note that a feature can belong to more than one transcript. Accordingly these columns can store multiple values for each feature.

Metadata columns can be accessed using accessor functions named after the columns they access (e.g. use function type to obtain feature type).

If transcript annotation is not available as a TxDb object, function convertToTxFeatures can construct TxFeatures from a GRangesList of exons grouped by transcript (see Exercises (#exercises) below).

1.4 Splice graph features and the SGFeatures class

Exons stored as TxFeatures can be overlapping (e.g. due to alternative splice sites) resulting in ambiguities (e.g. when when trying to assign reads to individual exons). We therefore partition exonic regions into disjoint exon bins. Splice junctions and disjoint exon bins uniquely determine a genome-wide splice graph ([Heber et al. 2002] (#heber)). To store splice graph features, SGSeq implements the SGFeatures class.

sgf_ucsc <- convertToSGFeatures(txf_ucsc)
sgf_ucsc

## SGFeatures object with 42 ranges and 0 metadata columns:
##        seqnames               ranges strand     type  splice5p  splice3p
##           <Rle>            <IRanges>  <Rle> <factor> <logical> <logical>
##    [1]       16 [87362942, 87365116]      -        E      TRUE     FALSE
##    [2]       16 [87365116, 87365116]      -        A      <NA>      <NA>
##    [3]       16 [87365116, 87367492]      -        J      <NA>      <NA>
##    [4]       16 [87367492, 87367492]      -        D      <NA>      <NA>
##    [5]       16 [87367492, 87367892]      -        E      TRUE      TRUE
##    ...      ...                  ...    ...      ...       ...       ...
##   [38]       16 [87423343, 87423454]      -        E      TRUE      TRUE
##   [39]       16 [87423454, 87423454]      -        A      <NA>      <NA>
##   [40]       16 [87423454, 87425689]      -        J      <NA>      <NA>
##   [41]       16 [87425689, 87425689]      -        D      <NA>      <NA>
##   [42]       16 [87425689, 87425708]      -        E     FALSE      TRUE
##        featureID    geneID                           txName
##        <integer> <integer>                  <CharacterList>
##    [1]         1         1 uc002fjv.3,uc002fjw.3,uc010vot.2
##    [2]         2         1 uc002fjv.3,uc002fjw.3,uc010vot.2
##    [3]         3         1 uc002fjv.3,uc002fjw.3,uc010vot.2
##    [4]         4         1 uc002fjv.3,uc002fjw.3,uc010vot.2
##    [5]         5         1 uc002fjv.3,uc002fjw.3,uc010vot.2
##    ...       ...       ...                              ...
##   [38]        38         1                       uc010vot.2
##   [39]        39         1                       uc010vot.2
##   [40]        40         1                       uc010vot.2
##   [41]        41         1                       uc010vot.2
##   [42]        42         1                       uc010vot.2
##               geneName
##        <CharacterList>
##    [1]           79791
##    [2]           79791
##    [3]           79791
##    [4]           79791
##    [5]           79791
##    ...             ...
##   [38]           79791
##   [39]           79791
##   [40]           79791
##   [41]           79791
##   [42]           79791
##   -------
##   seqinfo: 1 sequence from hg19 genome

Similar to TxFeatures, SGFeatures extends the GRanges class with additional metadata columns. Column type for an SGFeatures object takes values

J (splice junction)
E (disjoint exon bin)
D (splice donor site)
A (splice acceptor site).

By convention, splice donor and acceptor sites correspond to exonic positions immediately upstream and downstream of the intron, respectively. Note that splice sites are redundant in the sense that they are determined by the splice junctions included in the SGFeatures object. When assigning read counts to each feature (see below), counts for exons and splice junctions are based on structurally compatible reads. In the case of splice donor and acceptor sites, counts indicate the number of reads that extend across the spliced boundary (i.e. overlapping the splice site, as well as the flanking intronic position). Splice sites are included in the SGFeatures object since splice site counts are subsequently used for splice variant quantification.

SGFeatures includes additional metadata columns not included in TxFeatures. spliced5p and spliced3p indicate whether exon bins have a mandatory splice at the 5\(^\prime\) and 3\(^\prime\) boundary, respectively. This information is used to determine whether a read is structurally compatible with an exon bin, as well as to determine whether an exon bin is consistent with an annotated transcript.

Column featureID provides a unique identifier for each feature, while columnn geneID indicates the unique connected component of the splice graph a feature belongs to.

Both TxFeatures and SGFeatures objects can be exported to BED files using function exportFeatures.

1.5 Analysis based on annotated transcripts

We can now start analyzing the RNA-seq data at the FBXO31 gene locus. We first perform an analysis based on annotated transcripts. The following example converts the transcript features into splice graph features and obtains counts of compatible RNA-seq reads for each feature and each sample.

sgfc_ucsc <- analyzeFeatures(si, features = txf_ucsc)

## Process features...
## Obtain counts...

sgfc_ucsc

## class: SGFeatureCounts 
## dim: 42 8 
## metadata(0):
## assays(2): counts FPKM
## rownames: NULL
## rowRanges metadata column names(0):
## colnames(8): N1 N2 ... T3 T4
## colData names(6): sample_name file_bam ... frag_length lib_size

analyzeFeatures returns an object of class SGFeatureCounts, which extends the RangedSummarizedExperiment class from the SummarizedExperiment package. SGFeatureCounts contains the sample information as colData, splice graph features as rowRanges and assays counts and FPKM, which store structurally compatible counts and FPKMs, respectively. Accessor functions colData, rowRanges, counts and FPKM can be used to access the data.

Compatible FPKMs for splice graph features can be visualized with function plotFeatures. plotFeatures generates a two-panel figure with a splice graph shown in the top panel and a heatmap illustrating expression levels of individual features in the bottom panel. For customization of plotFeatures output, see section Visualization (#visualization). The plotting function invisibly returns a data.frame with information on splice graph features, including genomic coordinates.

df <- plotFeatures(sgfc_ucsc, geneID = 1)

df

##     id                     name type featureID color
## 1   E1 E:16:87425689-87425708:-    E        42 black
## 2   E2 E:16:87423343-87423454:-    E        38 black
## 3   E3 E:16:87417011-87417394:-    E        35 black
## 4   E4 E:16:87393973-87394561:-    E        33 black
## 5   E5 E:16:87393901-87393972:-    E        29 black
## 6   E6 E:16:87380780-87380856:-    E        25 black
## 7   E7 E:16:87377204-87377371:-    E        21 black
## 8   E8 E:16:87376483-87376557:-    E        17 black
## 9   E9 E:16:87369761-87369870:-    E        13 black
## 10 E10 E:16:87368910-87369063:-    E         9 black
## 11 E11 E:16:87367492-87367892:-    E         5 black
## 12 E12 E:16:87362942-87365116:-    E         1 black
## 13  J1 J:16:87423454-87425689:-    J        40 black
## 14  J2 J:16:87393972-87423343:-    J        32 black
## 15  J3 J:16:87393972-87417011:-    J        31 black
## 16  J4 J:16:87380856-87393901:-    J        27 black
## 17  J5 J:16:87377371-87380780:-    J        23 black
## 18  J6 J:16:87376557-87377204:-    J        19 black
## 19  J7 J:16:87369870-87376483:-    J        15 black
## 20  J8 J:16:87369063-87369761:-    J        11 black
## 21  J9 J:16:87367892-87368910:-    J         7 black
## 22 J10 J:16:87365116-87367492:-    J         3 black

Note that the splice graph derived from annotated transcripts includes three alternative transcript start sites (TSSs). However, the heatmap indicates that the first TSS is not used in the samples in our data set.

1.6 Analysis based on de novo prediction

Instead of relying on existing annotation, SGSeq can predict features from BAM files directly. The following code block predicts splice graph features with read evidence in our data set.

sgfc_pred <- analyzeFeatures(si, which = gr)

## Predict features...
## Process features...
## Obtain counts...

For interpretability, we annotate predicted features with respect to transcripts included in the UCSC knownGene table. The annotate function assigns compatible transcripts to each feature and stores them in metadata column txName. Metadata column geneName behaves transitively, meaning all features belonging to the same connected component of the splice graph (with identical geneID) have the same value for geneName. This behavior makes it easy to identify unannotated features (with empty txName) that belong to an annotated gene (non-empty geneName).

sgfc_pred <- annotate(sgfc_pred, txf_ucsc)

Predicted splice graph features and compatible FPKMs can be visualized as previously. Splice graph features with missing annotation can be highlighted using argument color_novel.

df <- plotFeatures(sgfc_pred, geneID = 1, color_novel = "red")

df

##     id                     name type featureID color
## 1   E1 E:16:87417011-87417348:-    E        38 black
## 2   E2 E:16:87393901-87393972:-    E        34 black
## 3   E3 E:16:87392017-87392103:-    E        30 black
## 4   E4 E:16:87380780-87380856:-    E        25 black
## 5   E5 E:16:87377204-87377371:-    E        21 black
## 6   E6 E:16:87376483-87376557:-    E        17 black
## 7   E7 E:16:87369761-87369870:-    E        13 black
## 8   E8 E:16:87368910-87369063:-    E         9 black
## 9   E9 E:16:87367492-87367892:-    E         5 black
## 10 E10 E:16:87362930-87365116:-    E         1 black
## 11  J1 J:16:87393972-87417011:-    J        36 black
## 12  J2 J:16:87392103-87393901:-    J        32 black
## 13  J3 J:16:87380856-87393901:-    J        28 black
## 14  J4 J:16:87380856-87392017:-    J        27 black
## 15  J5 J:16:87377371-87380780:-    J        23 black
## 16  J6 J:16:87376557-87377204:-    J        19 black
## 17  J7 J:16:87369870-87376483:-    J        15 black
## 18  J8 J:16:87369063-87369761:-    J        11 black
## 19  J9 J:16:87367892-87368910:-    J         7 black
## 20 J10 J:16:87365116-87367492:-    J         3 black

Note that most exons and splice junctions predicted from the RNA-seq data are consistent with transcripts in the UCSC knownGene table (shown in gray). However, in contrast to the previous figure, the predicted gene model does not include parts of the splice graph that are not expressed in our data set. Also, an unannotated exon (E3, shown in red) was discovered from the RNA-seq data, which is expressed in three of the four normal colorectal samples (N2, N3, N4).

1.7 Analysis of predicted splice variants

Instead of considering the complete splice graph of a gene, we can focus our analysis on individual splice events. In the SGSeq framework, the splice graph is a directed acyclic graph with nodes corresponding to transcript starts, ends and splice sites, and edges corresponding to disjoint exon bins and splice junctions, directed from 5\(^\prime\) to the 3\(^\prime\) end. A splice event is defined by a start node and an end node connected by two or more paths and no intervening nodes with all paths intersecting. SGSeq identifies splice events recursively from the graph, and estimates relative usage of splice variants based on compatible reads spanning the event boundaries. The following example identifies splice events from the splice graph and obtains representative counts for each splice variant.

sgvc_pred <- analyzeVariants(sgfc_pred)

## Find segments...
## Find variants...
## Annotate variants...

sgvc_pred

## class: SGVariantCounts 
## dim: 2 8 
## metadata(0):
## assays(5): countsVariant5p countsTotal5p countsVariant3p
##   countsTotal3p variantFreq
## rownames(2): 1 2
## rowRanges metadata column names(16): from to ... variantType
##   variantName
## colnames(8): N1 N2 ... T3 T4
## colData names(6): sample_name file_bam ... frag_length lib_size

analyzeVariants returns an SGVariantCounts object. Similar to SGFeatureCounts, SGVariantCounts extends the RangedSummarizedExperiment class. SGVariantCounts contains sample information as colData and SGVariants as rowRanges. Assay variantFreq stores estimates of relative usage for each splice variant and sample. Accessor functions colData, rowRanges and variantFreq can be used to access the data. Information on splice variants is stored in metadata columns in the SGVariants object and can be accessed as follows.

mcols(sgvc_pred)

## DataFrame with 2 rows and 16 columns
##              from              to        type   featureID   segmentID
##       <character>     <character> <character> <character> <character>
## 1 D:16:87393901:- A:16:87380856:-           J          28           4
## 2 D:16:87393901:- A:16:87380856:-         JEJ    32,30,27           2
##    closed3p  closed5p    geneID   eventID variantID   featureID5p
##   <logical> <logical> <integer> <integer> <integer> <IntegerList>
## 1      TRUE      TRUE         1         1         1            28
## 2      TRUE      TRUE         1         1         2            32
##     featureID3p                           txName        geneName
##   <IntegerList>                  <CharacterList> <CharacterList>
## 1            28 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
## 2            27                                            79791
##       variantType    variantName
##   <CharacterList>    <character>
## 1            SE:S 79791_1_1/2_SE
## 2            SE:I 79791_1_2/2_SE

from and to indicate the variant start and end node, respectively, from nodes are splice donor sites (D) or transcript starts (S), while to nodes are splice acceptor sites (A) or transcript ends (E)
metadata columns type and featureID describe the variant in terms of the splice graph features that make up the variant
segmentID are unique identifiers labelling unbranched segments of the splice graph (internal use only)
closed5p indicates whether the nodes belonging to a splice variant can be reached from nodes outside of the splice event exclusively through the from node
closed3p indicates whether the nodes belonging to a splice variant can reach nodes outside of the splice event exclusively through the to node
eventID and variantID are unique identifiers for each splice event and splice variant, respectively
featureID5p and featureID3p indicate representative features that are used for obtaining relative usage estimates at the variant start and end, respectively
variantType indicates whether a splice variant is consistent with a canonical splice event (for a list of possible values, see the manual page for annotateSGVariants)
variantName provides a unique identifier for each splice variant that is intended to be more human-readable than the numeric identifier stored in variantID (for details, see the manual page for makeVariantNames)

Splice variants and estimates of relative usage can be visualized with function plotVariants.

plotVariants(sgvc_pred, eventID = 1, color_novel = "red")

plotVariants generates a two-panel figure similar to plotFeatures. The splice graph in the top panel illustrates the selected splice event. In this example, the splice event consists of two splice variants that correspond to a skip or inclusion of the unannotated exon. The heatmap illustrates estimates of relative usage for each splice variant. We observe that samples N2, N3 and N4 show evidence for both transcripts that include the exon as well as transcripts that skip the exon. The remaining samples show little evidence for exon inclusion.

1.8 Visualization

Functions plotFeatures and plotVariants support many options for customizing figures. Note that the splice graph in the top figure panel is plotted by function plotSpliceGraph, which can be called directly.

plotFeatures includes multiple alternative arguments for selecting features to be displayed. The following code illustrates three different ways of selecting and plotting the splice graph and expression levels for FBXO31 (Entrez ID 79791).

plotFeatures(sgfc_pred, geneID = 1)
plotFeatures(sgfc_pred, geneName = "79791")
plotFeatures(sgfc_pred, which = gr)

By default, the heatmap generated by plotFeatures displays splice junctions. Alternatively, exon bins, or both exon bins and splice junctions can be displayed.

plotFeatures(sgfc_pred, geneID = 1, include = "junctions")
plotFeatures(sgfc_pred, geneID = 1, include = "exons")
plotFeatures(sgfc_pred, geneID = 1, include = "both")

Argument toscale controls which parts of the gene model are drawn to scale.

plotFeatures(sgfc_pred, geneID = 1, toscale = "gene")
plotFeatures(sgfc_pred, geneID = 1, toscale = "exon")
plotFeatures(sgfc_pred, geneID = 1, toscale = "none")

Heatmaps allow the visualization of expression values summarized for splice junctions and exon bins. Alternatively, per-base read coverages and splice junction counts can be visualized with function plotCoverage.

par(mfrow = c(5, 1), mar = c(1, 3, 1, 1))
plotSpliceGraph(rowRanges(sgfc_pred), geneID = 1, toscale = "none", color_novel = "red")
for (j in 1:4) {
  plotCoverage(sgfc_pred[, j], geneID = 1, toscale = "none")
}

1.9 Advanced use

Functions analyzeFeatures and analyzeVariants wrap multiple analysis steps for convenience. Alternatively, the functions performing individual steps can be called directly. For example, the previous analysis using de novo prediction can be performed as follows.

txf <- predictTxFeatures(si, gr)
sgf <- convertToSGFeatures(txf)
sgf <- annotate(sgf, txf_ucsc)
sgfc <- getSGFeatureCounts(si, sgf)
sgv <- findSGVariants(sgf)

## Find segments...
## Find variants...
## Annotate variants...

sgvc <- getSGVariantCounts(sgv, sgfc)

predictTxFeatures and getSGFeatureCounts can be run on individual samples (e.g. for distribution across a high-performance computing cluster). predictTxFeatures predicts features for each sample, merges features across samples and finally performs filtering and processing of predicted terminal exons. When using predictTxFeatures for individual samples, with predictions intended to be merged at a later point in time, run predictTxFeatures with argument min_overhang = NULL to suppress processing of terminal exons. Then predictions can subsequently be merged and processed with functions mergeTxFeatures and processTerminalExons, respectively.

1.10 Exercises

Exercise 1 Construct a TxFeatures object for a transcript with three exons. What happens if you add a second transcript with exons that are shared or overlapping with exons in the first transcript? What happens if you convert the TxFeatures object to an SGFeatures object?

tx_1 <- GRangesList(tx_1 = GRanges("1", IRanges(c(101, 301, 501), c(200, 400, 600)), "+"))
tx_2 <- GRangesList(tx_2 = GRanges("1", IRanges(c(101, 351, 701), c(200, 400, 800)), "+"))
txf_1 <- convertToTxFeatures(tx_1)
txf_2 <- convertToTxFeatures(tx_2)
txf <- convertToTxFeatures(c(tx_1, tx_2))
sgf <- convertToSGFeatures(txf)
par(mfrow = c(1, 1))
plotSpliceGraph(sgf)

The following exercises are based on genome-wide predictions obtained from paired-end RNA-seq data generated as part of the Illumina Body Map 2.0 ([Farrell et al. 2014] (#farrell)). The data were processed as shown in the following code block (the code will not run as BAM files are not available).

sgfc_IBM <- analyzeFeatures(si_IBM, alpha = 5, psi = 0.2, beta = 0.2, gamma = 0.2)
sgvc_IBM <- analyzeVariants(sgfc_IBM, min_denominator = 20)
exclude <- eventID(sgvc_IBM)[!closed5p(sgvc_IBM) | !closed3p(sgvc_IBM)]
sgvc_IBM <- sgvc_IBM[!eventID(sgvc_IBM) %in% exclude, ]

We load the previously obtained SGSeq predictions sgfc_IBM and sgvc_IBM.

data(sgfc_IBM, sgvc_IBM, package = "SGSeqBioC2015")

Exercise 2 Annotate the SGFeatureCounts and SGVariantCounts objects with respect to transcripts included in the UCSC knownGene table (this may take a few minutes).

txdb <- restoreSeqlevels(txdb)
seqlevelsStyle(txdb) <- "NCBI"
txdb <- keepSeqlevels(txdb, c(1:22, "X", "Y"))
txf_ucsc <- convertToTxFeatures(txdb)
sgfc_IBM <- annotate(sgfc_IBM, txf_ucsc)
sgvc_IBM <- annotate(sgvc_IBM, txf_ucsc)

Exercise 3 Plot the gene model for gene KIFAP3 (Entrez ID 22920). Which parts of the predicted gene model are unannotated and what tissues are they expressed in? Inspect expression levels for both splice junctions and exons.

plotFeatures(sgfc_IBM, geneName = "22920", color_novel = "red")
plotFeatures(sgfc_IBM, geneName = "22920", color_novel = "red", include = "exons")

Exercise 4 Plot the unannotated splice event for gene KIFAP3.

mcols(sgvc_IBM)[any(geneName(sgvc_IBM) == "22920"), ]
plotVariants(sgvc_IBM, eventID = 552, color_novel = "red")

Exercise 5 (Difficult) Find genes with the most highly expressed unannotated cassette exons. Can you interpret the predicted splice graph for the top genes?

selected <- which(elementLengths(txName(sgvc_IBM)) == 0 & any(variantType(sgvc_IBM) == "SE:I"))
variants <- rowRanges(sgvc_IBM)[selected]
features <- unlist(variants)
exons <- features[type(features) == "E"]
exons_FPKM <- FPKM(sgfc_IBM)[match(featureID(exons), featureID(sgfc_IBM)), ]
exons_FPKM_max <- apply(exons_FPKM, 1, max)
geneIDs <- geneID(exons)[order(exons_FPKM_max, decreasing = TRUE)]
plotFeatures(sgfc_IBM, geneID = geneIDs[1], color_novel = "red")

1.11 References

Seshagiri S, Stawiski EW, Durinck S, Modrusan Z, Storm EE, Conboy CB, Chaudhuri S, Guan Y, Janakiraman V, Jaiswal BS, Guillory J, Ha C, Dijkgraaf GJP, Stinson J, Gnad F, Huntley MA, Degenhardt JD, Haverty PM, Bourgon R, Wang W, Koeppen H, Gentleman R, Starr TK, Zhang Z, Largaespada DA, Wu TD, de Sauvage FJ. “Recurrent R-spondin fusions in colon cancer.” Nature 488, 660–664, 2012.

Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ. “Software for Computing and Annotating Genomic Ranges.” PLOS Computational Biology 9, e1003118, 2013.

Heber S, Alekseyev M, Sze S, Tang H, Pevzner PA. “Splicing graphs and EST assembly problem.” Bioinformatics 18 Suppl 1, S181–188, 2002.

Farrell CM, O’Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B, Hiatt SM, Frankish A, Suner MM, Rajput B, Steward CA, Brown GR, Bennett R, Murphy M, Wu W, Kay MP, Hart J, Rajan J, Weber J, Snow C, Riddick LD, Hunt T, Webb D, Thomas M, Tamez P, Rangwala SH, McGarvey KM, Pujar S, Shkeda A, Mudge JM, Gonzalez JM, Gilbert JG, Trevanion SJ, Baertsch R, Harrow JL, Hubbard T, Ostell JM, Haussler D, Pruitt KD. “Current status and new features of the Consensus Coding Sequence database”. Nucleic Acids Research 42(Database issue):D865-72, 2014.

1.12 Session information

sessionInfo()

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] XVector_0.9.1                          
##  [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3
##  [3] GenomicFeatures_1.21.13                
##  [4] AnnotationDbi_1.31.17                  
##  [5] Biobase_2.29.1                         
##  [6] SGSeq_1.3.14                           
##  [7] GenomicRanges_1.21.16                  
##  [8] GenomeInfoDb_1.5.8                     
##  [9] IRanges_2.3.14                         
## [10] S4Vectors_0.7.10                       
## [11] BiocGenerics_0.15.3                    
## [12] knitr_1.10.5                           
## [13] BiocStyle_1.7.4                        
## 
## loaded via a namespace (and not attached):
##  [1] igraph_1.0.1               magrittr_1.5              
##  [3] zlibbioc_1.15.0            GenomicAlignments_1.5.11  
##  [5] BiocParallel_1.3.34        stringr_1.0.0             
##  [7] tools_3.2.1                SummarizedExperiment_0.3.2
##  [9] DBI_0.3.1                  lambda.r_1.1.7            
## [11] futile.logger_1.4.1        htmltools_0.2.6           
## [13] yaml_2.1.13                digest_0.6.8              
## [15] rtracklayer_1.29.12        formatR_1.2               
## [17] futile.options_1.0.0       bitops_1.0-6              
## [19] biomaRt_2.25.1             RCurl_1.95-4.7            
## [21] RSQLite_1.0.0              evaluate_0.7              
## [23] rmarkdown_0.7              stringi_0.5-5             
## [25] Biostrings_2.37.2          Rsamtools_1.21.14         
## [27] XML_3.98-1.3

SGSeq

Leonard D Goldstein

17 July 2015

Abstract

Package version: SGSeqBioC2015 0.0.1

Contents