Introduction to epivizr: interactive visualization for functional genomics

Epiviz is an interactive visualization tool for functional genomics data. It supports genome navigation like other genome browsers, but allows multiple visualizations of data within genomic regions using scatterplots, heatmaps and other user-supplied visualizations. It also includes data from the Gene Expression Barcode project for transcriptome visualization. It has a flexible plugin framework so users can add d3 visualizations. You can find more information about Epiviz at http://epiviz.cbcb.umd.edu/help and see a video tour here.

The epivizr package implements two-way communication between the R/Bioconductor computational genomics environment and Epiviz. Objects in an R session can be displayed as tracks or plots on Epiviz. Epivizr uses Websockets for communication between the browser Javascript client and the R environmen, the same technology underlying the popular Shiny system for authoring interactive web-based reports in R.

Preliminaries: the data

In this vignette we will look at colon cancer methylation data from the TCGA project and expression data from the gene expression barcode project. The epivizrData package contains human chromosome 11 methylation data from the Illumina 450kHumanMethylation beadarray processed with the minfi package. We use expression data from the antiProfilesData bioconductor package.

require(epivizr)
## Loading required package: epivizr
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, as.vector, cbind, colnames, do.call,
##     duplicated, eval, evalq, get, intersect, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rep.int, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unlist
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Loading required package: GenomicRanges
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
require(antiProfilesData)
## Loading required package: antiProfilesData
data(tcga_colon_example)
data(apColonData)

The colon_blocks object is a GRanges object containing chromosome 11 regions of hypo or hyper methylation in colon cancer identified using the blockFinder function in the minfi package.

show(colon_blocks)
## GRanges with 129 ranges and 11 metadata columns:
##         seqnames                 ranges strand   |               value
##            <Rle>              <IRanges>  <Rle>   |           <numeric>
##     [1]    chr11 [  4407026,   6435089]      *   |  -0.142954896408814
##     [2]    chr11 [131239366, 133716186]      *   |  -0.135137323912239
##     [3]    chr11 [ 55041873,  57022542]      *   |   -0.17334711677526
##     [4]    chr11 [114645223, 116602403]      *   |  -0.140934365473709
##     [5]    chr11 [ 78357700,  80184550]      *   |   -0.15603691753532
##     ...      ...                    ...    ... ...                 ...
##   [125]    chr11   [29644815, 29650449]      *   | -0.0940450729648221
##   [126]    chr11   [41943963, 41956273]      *   | -0.0760193132782017
##   [127]    chr11   [16298618, 16314417]      *   | -0.0748443876820716
##   [128]    chr11   [38740154, 38740154]      *   |  -0.117248761155248
##   [129]    chr11   [81757203, 81757203]      *   | -0.0710557720226853
##                       area   cluster indexStart  indexEnd         L
##                  <numeric> <numeric>  <integer> <integer> <numeric>
##     [1]   30.3064380386685       495     130755    131000       212
##     [2]   23.9193063324663       520     141959    142173       177
##     [3]   19.5882241956043       507     134251    134374       113
##     [4]   14.0934365473709       520     140000    140138       100
##     [5]   14.0433225781788       507     138132    138249        90
##     ...                ...       ...        ...       ...       ...
##   [125]  0.188090145929644       497     132733    132734         2
##   [126]  0.152038626556403       502     133379    133380         2
##   [127]  0.149688775364143       495     131986    131987         2
##   [128]  0.117248761155248       499     133331    133331         1
##   [129] 0.0710557720226853       508     138263    138263         1
##          clusterL            p.value      fwer       p.valueArea  fwerArea
##         <integer>          <numeric> <numeric>         <numeric> <numeric>
##     [1]      1629                  0         0                 0         0
##     [2]      1759                  0         0                 0         0
##     [3]      1816                  0         0                 0         0
##     [4]      1759                  0         0                 0         0
##     [5]      1816                  0         0                 0         0
##     ...       ...                ...       ...               ...       ...
##   [125]       367 0.0949029198604193         1 0.416952488890214         1
##   [126]        45  0.316487220018491         1 0.485758597035402         1
##   [127]      1629  0.342091920427093         1 0.494228876494974         1
##   [128]         7 0.0564138506964121         1 0.592024814339825         1
##   [129]        22  0.729905454979272         1 0.860076351814847         1
##   ---
##   seqlengths:
##     chr1 chr10 chr11 chr12 chr13 chr14 ...  chr6  chr7  chr8  chr9  chrX
##       NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA

The columns value and p.value can be used to determine which of these regions, or blocks, are interesting by looking at a volcano plot for instance.

plot(colon_blocks$value, -log10(colon_blocks$p.value), main = "Volcano plot", 
    xlab = "Avg. methylation difference", ylab = "-log10 p-value", xlim = c(-0.5, 
        0.5))

plot of chunk unnamed-chunk-4

The colon_curves object is another GRanges object which contains the basepair resolution methylation data used to define these regions.

show(colon_curves)
## GRanges with 7135 ranges and 7 metadata columns:
##          seqnames                 ranges strand   |        id    type
##             <Rle>              <IRanges>  <Rle>   | <numeric> <array>
##      [1]    chr11       [131996, 132411]      *   |    129466 OpenSea
##      [2]    chr11       [189654, 189654]      *   |    129467 OpenSea
##      [3]    chr11       [190242, 190242]      *   |    129468 OpenSea
##      [4]    chr11       [192096, 192141]      *   |    129469 OpenSea
##      [5]    chr11       [192763, 193112]      *   |    129470 OpenSea
##      ...      ...                    ...    ... ...       ...     ...
##   [7131]    chr11 [134892703, 134892703]      *   |    142360 OpenSea
##   [7132]    chr11 [134903175, 134903175]      *   |    142361 OpenSea
##   [7133]    chr11 [134910774, 134910774]      *   |    142362 OpenSea
##   [7134]    chr11 [134911302, 134911302]      *   |    142363 OpenSea
##   [7135]    chr11 [134945848, 134945848]      *   |    142364 OpenSea
##          blockgroup      diff    smooth normalMean cancerMean
##           <numeric>  <matrix> <numeric>  <numeric>  <numeric>
##      [1]        495 -0.088871  -0.12367    0.75449    0.66562
##      [2]        495 -0.001245  -0.08656    0.95953    0.95828
##      [3]        495 -0.164759  -0.08625    0.68007    0.51531
##      [4]        495  0.003095  -0.08527    0.05134    0.05444
##      [5]        495 -0.102464  -0.08484    0.55236    0.44989
##      ...        ...       ...       ...        ...        ...
##   [7131]        520  -0.11395   -0.1421     0.4014     0.2874
##   [7132]        520  -0.01149   -0.1530     0.8482     0.8367
##   [7133]        520  -0.20066   -0.1628     0.6365     0.4359
##   [7134]        520  -0.14014   -0.1635     0.5832     0.4430
##   [7135]        520  -0.24670   -0.2309     0.6757     0.4290
##   ---
##   seqlengths:
##     chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  chrX  chrY
##       NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA

This basepair resolution data includes mean methylation levels for normal and cancer and a smoothed estimate of methylation difference. This smoothed difference estimate is used to define regions in the colon_blocks object.

Finally, the apColonData object is an ExpressionSet containing gene expression data for colon normal and tumor samples for genes within regions of methylation loss identified this paper. Our goal in this vignette is to visualize this data as we browse the genome.

The epivizr session manager

The connection to Epiviz is managed through a session manager object of class EpivizDeviceMgr. We can create this object and open Epiviz using the startEpiviz function.

mgr = startEpiviz(workspace = "C60FA3168F34DBC763F579C1EADA8AF0")
## [epivizr] Starting websocket server...

This opens a websocket connection between the interactive R session and the browser client. This will allow us to visualize data stored in the Epiviz server along with data in the interactive R session.

Adding block region devices

Once the browser is open we can visualize the colon_blocks object containing blocks of methylation modifications in colon cancer. We use the addDevice method to do so. We then use the service function to let the interactive R session serve data requests from the browser client. We need to escape (using ctl-c or esc depending on your platform) to continue with the interactive R session.

blocks_dev <- mgr$addDevice(colon_blocks, "450k colon_blocks")
mgr$service()

You should now see that a new track is added to the Epiviz web app. You can think of this track as an interactive display device in R. As you navigate on the browser, data is requested from the R session through the websocket connection. Remember to escape to continue with your interactive R session. The blocks_dev object inherits from class EpivizDevice, which contains information about the data being displayed and the chart used to display it. Note that the “brushing” effect we implement in Epiviz works for epivizr tracks as well.

The point of having interactive data connections to Epiviz is that we may want to iteratively compute and visualize our data. For example, we may want to only display methylation blocks inferred at a certain statistical significance level. In this case, we will filter by block size.

# subset to those with length > 250Kbp
keep <- width(colon_blocks) > 250000
mgr$updateDevice(blocks_dev, colon_blocks[keep, ])

Now, only this subset of blocks will be displayed in the already existing track.

Adding line plots along the genome

We have a number of available data types in epivizr. In the previous section, we used the block type. For the methylation data at base-pair resolution in the colon_curves object, we will use the bp type.

# add low-filter smoothed methylation estimates
means_dev <- mgr$addDevice(colon_curves, "450kMeth", type = "bp", columns = c("cancerMean", 
    "normalMean"))
mgr$service()

Notice that we added two lines in this plot, one for mean methylation in cancer and another for mean methylation in normal. The columns argument specifies which columns in mcols(colon_curves) will be displayed.

We can also add a track containing the smooth methylation difference estimate used to define methylation blocks.

diff_dev <- mgr$addDevice(colon_curves, "450kMethDiff", type = "bp", columns = c("smooth"), 
    ylim = matrix(c(-0.5, 0.5), nc = 1))
mgr$service()

We pass limits for the y axis in this case. To see other arguments supported, you can use the help framework in R ?"EpivizDeviceMgr-class".

The session manager again

We can use the session manager to list devices we have added so far, or to remove devices.

mgr$listDevices()
##               id        type
## 1 epivizDevice_1 blocksTrack
## 2 epivizDevice_2   lineTrack
## 3 epivizDevice_3   lineTrack
##                                          measurements connected
## 1                                    epivizMs_block_1          
## 2 epivizMs_bp_2__cancerMean,epivizMs_bp_2__normalMean          
## 3                               epivizMs_bp_3__smooth
mgr$rmDevice(means_dev)
mgr$listDevices()
##               id        type          measurements connected
## 1 epivizDevice_1 blocksTrack      epivizMs_block_1          
## 2 epivizDevice_3   lineTrack epivizMs_bp_3__smooth

Adding a scatterplot

Now we want to visualize the colon expression data in apColonData object as an MA plot in Epiviz. First, we add an "MA" assay to the ExpressionSet:

keep <- pData(apColonData)$SubType != "adenoma"
apColonData <- apColonData[, keep]
status <- pData(apColonData)$Status
Indexes <- split(seq(along = status), status)

exprMat <- exprs(apColonData)
mns <- sapply(Indexes, function(ind) rowMeans(exprMat[, ind]))
mat <- cbind(colonM = mns[, "1"] - mns[, "0"], colonA = 0.5 * (mns[, "1"] + 
    mns[, "0"]))
assayDataElement(apColonData, "MA") <- mat
show(apColonData)
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 5339 features, 2 samples 
##   element names: MA, exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM95473 GSM95474 ... GSM215114 (53 total)
##   varLabels: filename DB_ID ... Status (7 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu133plus2

epivizr will use the annotation(apColonData) annotation to determine genomic locations using the AnnotationDbi package so that only probesets inside the current browser window are displayed.

eset_dev <- mgr$addDevice(apColonData, "MAPlot", columns = c("colonA", "colonM"), 
    assay = "MA")
## Loading required package: hgu133plus2.db
## Loading required package: AnnotationDbi
## Loading required package: org.Hs.eg.db
## Loading required package: DBI
mgr$service()

In this case, we specify which data is displayed in each axis of the scatter plot using the columns argument. The assay arguments indicates where data is obtained.

The SummarizedExperiment Object

Epiviz is also able to display plots of data in the form of a SummarizedExperiment object. After loading the tcga_colon_expression dataset in the epivizrData package, we can see that the colonSE object contains information on 239322 exons in 40 samples.

data(tcga_colon_expression)
show(colonSE)
## class: SummarizedExperiment 
## dim: 12800 40 
## exptData(0):
## assays(1): ''
## rownames(12800): 143686 149186 ... 149184 149185
## rowData metadata column names(2): exon_id gene_id
## colnames(40): TCGA-A6-5659-01A-01R-1653-07
##   TCGA-A6-5659-11A-01R-1653-07 ... TCGA-D5-5540-01A-01R-1653-07
##   TCGA-D5-5541-01A-01R-1653-07
## colData names(110): bcr_patient_barcode bcr_sample_uuid ...
##   Basename fullID

The assay slot holds a matrix of raw sequencing counts, so before we can plot a scatterplot showing expression, we must first normalize the count data. We use the geometric mean of each row as a reference sample to divide each column (sample) by, then use the median of each column as a scaling factor to divide each row (exon) by.

ref_sample <- 2^rowMeans(log2(assay(colonSE) + 1))
scaled <- (assay(colonSE) + 1)/ref_sample
scaleFactor <- Biobase::rowMedians(t(scaled))
assay_normalized <- sweep(assay(colonSE), 2, scaleFactor, "/")
assay(colonSE) <- assay_normalized

Now, using the expression data in the assay slot and the sample data in the colData slot, we can compute mean exon expression by sample type.

status <- colData(colonSE)$sample_type
index <- split(seq(along = status), status)
logCounts <- log2(assay(colonSE) + 1)
means <- sapply(index, function(ind) rowMeans(logCounts[, ind]))
mat <- cbind(cancer = means[, "Primary Tumor"], normal = means[, "Solid Tissue Normal"])

Now, create a new SummarizedExperiment object with the two column matrix, and all the information about the features of interest, in this case exons, are stored in the rowData slot to be queried by Epiviz.

sumexp <- SummarizedExperiment(mat, rowData = rowData(colonSE))
se_dev <- mgr$addDevice(sumexp, "Mean by Sample Type", columns = c("normal", 
    "cancer"))
mgr$service()

Again, the columns argument specifies what data will be displayed along which axis.

Slideshow

We can navigate to a location on the genome using the navigate method of the session manager:

mgr$navigate("chr11", 1.1e+08, 1.2e+08)

There is a convenience function to quickly navigate to a series of locations in succession. We can use that to browse the genome along a ranked list of regions. Let's navigate to the 5 most up-regulated exons in the colon exon expression data.

foldChange = mat[, "cancer"] - mat[, "normal"]
ind = order(foldChange, decreasing = TRUE)

# bounding 1Mb around each exon
slideshowRegions <- rowData(sumexp)[ind] + 1000000L
## Warning: 'ranges' contains values outside of sequence bounds. See ?trim to
## subset ranges.
mgr$slideshow(slideshowRegions, n = 5)
## Region 1 of 5 . Press key to continue (ESC to stop)...
## Region 2 of 5 . Press key to continue (ESC to stop)...
## Region 3 of 5 . Press key to continue (ESC to stop)...
## Region 4 of 5 . Press key to continue (ESC to stop)...
## Region 5 of 5 . Press key to continue (ESC to stop)...

Closing the session

To close the connection to Epiviz and remove all tracks added during the interactive session, we use the stopServer function.

mgr$stopServer()

SessionInfo

sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] hgu133plus2.db_2.14.0   org.Hs.eg.db_2.14.0    
##  [3] RSQLite_0.11.4          DBI_0.2-7              
##  [5] AnnotationDbi_1.27.0    antiProfilesData_0.99.3
##  [7] epivizr_1.3.0           GenomicRanges_1.17.0   
##  [9] GenomeInfoDb_1.1.0      IRanges_1.21.45        
## [11] Biobase_2.25.0          BiocGenerics_0.11.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.1    XVector_0.5.0  evaluate_0.5.3 formatR_0.10  
##  [5] httpuv_1.3.0   knitr_1.5      rjson_0.2.13   stats4_3.1.0  
##  [9] stringr_0.6.2  tools_3.1.0