seq - set - vis
seqsetvis enables the visualization and analysis of multiple genomic datasets. Although seqsetvis was designed for the comparison of mulitple ChIP-seq datasets, this package is domain-agnostic and allows the processing of multiple genomic coordinate files (bed-like files) and signal files (bigwig files).
seqsetvis is ideal for the following tasks:
seqsetvis accepts and outputs the Bioconductor standard GRanges object while leveraging the power and speed of the data.table package (an enhancement of data.frame) with the flexibility of ggplot2 to provide users with visualizations that are simultaneously publication ready while still being highly customizable. By default, data.table is used entirely internally so users unfamiliar with its use can safely ignore it. That said, fans of data.table can use seqsetvis in a data.table exclusive mode if they wish.
Although initially designed to focus on peak call sets, seqsetvis accepts any data in GRanges format and is therefore flexible enough to work dowstream of many genomics tools and packages. Indeed, set comparison visualizations are genomics agnostic and also accept generic data types defining sets.
ssv*- most primary user facing functions are prefixed by ssv
ssvOverlapIntervalSets- derives sets from a list of GRanges
ssvMakeMembTable- coerces various ways of defining sets to a membership table
ssvFactorizeMembTable- converts a multi column membership table into a single factor.
ssvFetch*- retrieves properly formatted data from .bigwig and .bam files
ssvFeature*- visualizations of set counts and overlaps
ssvSignal*- visualizations of signal mapped to genomic coordinates in sets context
easyLoad_*- loads files into a list of GRanges for input into
## try http:// if https:// URLs are not supported if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("seqsetvis")
library(GenomicRanges) library(data.table) library(cowplot) theme_set(cowplot::theme_cowplot())
seqsetvis has several ways of accepting set information. Since seqsetvis was originally intended to focus on comparing sets of peak calls, a list of GRanges or a GRangesList is directly supported but there are more generic alternatives.
Many functions in seqsetvis require some kind of sets definition
as input. Sets can be defined in one of several ways.
ssvOverlapIntervalSets() creates a single GRanges from a list of GRanges
with a logical membership table as metadata columns defining sets.
olaps = ssvOverlapIntervalSets(CTCF_in_10a_narrowPeak_grs) head(olaps) ## GRanges object with 6 ranges and 3 metadata columns: ## seqnames ranges strand | MCF10A_CTCF MCF10AT1_CTCF MCF10CA1_CTCF ## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> ## 1 chr1 17507898-17508319 * | FALSE TRUE FALSE ## 2 chr1 40855546-40855866 * | TRUE FALSE FALSE ## 3 chr1 53444333-53445008 * | TRUE TRUE TRUE ## 4 chr1 64137020-64137657 * | TRUE TRUE TRUE ## 5 chr1 65075339-65076014 * | TRUE TRUE FALSE ## 6 chr1 92818291-92818802 * | TRUE TRUE TRUE ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths
ssvOverlapIntervalSets() also accepts a GRangesList if that’s more natural.
olaps_fromGRangesList = ssvOverlapIntervalSets( GenomicRanges::GRangesList(CTCF_in_10a_narrowPeak_grs))
The membership data contained in the metadata columns of
the data structure seqsetvis set overlap functions use internally.
The S4 method
ssvMakeMembTable() handles the necessary conversions and
can be extended
to accept a new class of data as long as it converts to an already accepted
data type and calls ssvMakeMembTable again.
ssvMakeMembTable(olaps) extracts the membership table from
the mcols of a single GRanges object.
head(ssvMakeMembTable(olaps)) ## MCF10A_CTCF MCF10AT1_CTCF MCF10CA1_CTCF ## 1 FALSE TRUE FALSE ## 2 TRUE FALSE FALSE ## 3 TRUE TRUE TRUE ## 4 TRUE TRUE TRUE ## 5 TRUE TRUE FALSE ## 6 TRUE TRUE TRUE
ssvMakeMembTable() can handle sets defined by a list of vectors.
These happen to be numeric but nearly anything that
as.character() works on
should be acceptable.
my_set_list = list(1:3, 2:3, 3:6) ssvMakeMembTable(my_set_list) ## set_A set_B set_C ## 1 TRUE FALSE FALSE ## 2 TRUE TRUE FALSE ## 3 TRUE TRUE TRUE ## 4 FALSE FALSE TRUE ## 5 FALSE FALSE TRUE ## 6 FALSE FALSE TRUE
Supplying a named list is recommended
names(my_set_list) = c("first", "second", "third") ssvMakeMembTable(my_set_list) ## first second third ## 1 TRUE FALSE FALSE ## 2 TRUE TRUE FALSE ## 3 TRUE TRUE TRUE ## 4 FALSE FALSE TRUE ## 5 FALSE FALSE TRUE ## 6 FALSE FALSE TRUE
A list of character vectors works just as well.
my_set_list_char = lapply(my_set_list, function(x)letters[x]) ssvMakeMembTable(my_set_list_char) ## first second third ## a TRUE FALSE FALSE ## b TRUE TRUE FALSE ## c TRUE TRUE TRUE ## d FALSE FALSE TRUE ## e FALSE FALSE TRUE ## f FALSE FALSE TRUE
Notice the rownames changed as the items changed.
The ssvFeature* family of functions all accept a valid set definition and output a ggplot.
Barplots are useful for for conveying the size of each set/group.
Pie charts are an alternative to barplots but are generally more difficult to interpret.
Venn diagrams show the number of items falling into each overlap category. At release, seqsetvis is limited to displaying 3 sets via venn diagram. This will be increased in a future update.
Often confused with Venn diagrams, Euler diagrams attempt to convey the size of each set and the degree to which the overlap with circles or ellipses of varying sizes. For data of any appreciable complexity, Euler diagrams are approximations by neccessity.
There is no hard limit to the number of sets seqsetvis can show in an Euler diagram but computation time increases rapidly along with the inaccuracy of approximation.
An alternative to Venn/Euler diagrams, this “heatmap” contains one column per set and one row per item with item membership per set mapped to color. In this example black indicates TRUE for membership hence the top 2/3 is black for all three sets.
The ssvSignal* family of functions all take as input continuous signal data
mapped to the genome. This type of data is generally stored in bedGraph,
wiggle, or bigWig format. At this time, seqsetvis directly supports the
bigwig format and creating pileups from indexed .bam files.
This is accomplished via the
If you don’t have .bigWig (or .bigwig or .bw) files you should be able to
easily generate some using rtracklayer
import.wig() as appropriate.
bigwig_files = c( system.file("extdata", "MCF10A_CTCF_FE_random100.bw", package = "seqsetvis"), system.file("extdata", "MCF10AT1_CTCF_FE_random100.bw", package = "seqsetvis"), system.file("extdata", "MCF10CA1_CTCF_FE_random100.bw", package = "seqsetvis") ) names(bigwig_files) = sub("_FE_random100.bw", "", basename(bigwig_files)) # names(bigwig_files) = letters[1:3] olap_gr = CTCF_in_10a_overlaps_gr target_size = quantile(width(olap_gr), .75) window_size = 50 target_size = round(target_size / window_size) * window_size olap_gr = resize(olap_gr, target_size, fix = "center") bw_gr = ssvFetchBigwig(bigwig_files, olap_gr, win_size = window_size) bw_gr
olap_gr = CTCF_in_10a_overlaps_gr bw_gr = CTCF_in_10a_profiles_gr
Overlap information must be transformed into a factor tied to id to be useful.
olap_groups = ssvFactorizeMembTable(mcols(olap_gr))
Line plots are perhaps the most natural and common type of visualization for this type of continuous genomic data. Many genome browsers and other BioC packages support this type of visualization. These other approaches are very good at show lot’s of different data of multiple types at single regions. In contrast, seqsetvis is focused on bringing in data from multiple regions simultaneously.
Basic line plot of a subset of regions
# facet labels will display better if split into multiple lines bw_gr$facet_label = sub("_", "\n", bw_gr$sample) ssvSignalLineplot(bw_data = subset(bw_gr, id %in% 1:12), facet_ = "facet_label")