Introduction

The DeepBlue Epigenomic Data Server is an online application that allows researchers to access data from various epigenomic mapping consortia such as DEEP, BLUEPRINT, ENCODE, or ROADMAP. DeepBlue can be accessed through a web interface or programmatically via its API. The usage of the API is documented with examples, use cases, and a user manual. While the description of the API is language agnostic, the examples and use cases shown online are focused on the python language. However, the R package presented here also enables access to the DeepBlue API directly within the R statistical environment and provides convenient functionality for triggering operations on the DeepBlue server as well as for data retrievel using R functions. In the following, we give a brief introduction to the package and subsequently show how python examples from the online documentation can be reproduced with it.

What is DeepBlue ?

A wealth of epigenomic data has been collected over the past decade by large epigenomic mapping consortia. Event though most of these data are publicly available, the task of identifiying, downloading and processing data from various experiments is challenging. Recognizing that these tedious steps need to be tackled programmatically, we developed the DeepBlue epigenomic data server. Epigenome data from the different epigenome mapping consortia are accessible with standardized metadata. An experiment is the most important entity in DeepBlue and typically encompasses a single file (usually a bed or wig file) with a set of mandatory metadata: name, genome assembly, epigenetic mark, biosource, sample, technique, and project. For the sake of organization, all metadata fields are part of controlled vocabularies, some of which are imported from ontologies (CL, EFO, and UBERON, to name a few). DeepBlue also contains annotations, i.e. auxiliary data that is helpful in epigenomic analysis, such as, for example, CpG Islands, promoter regions, and genes. DeepBlue provides different types of commands, such as listing and searching commands as well as commands for data retrieval. A typical work-flow for the latter is to select, filter, transform, and finally download the selected data. For a more thorough description of DeepBlue we refer to the DeepBlue publication in the 2016 NAR webserver issue. If you find DeepBlue useful and use it in your project consider citing this paper.

Important note: With the exception of data aggregation tasks, DeepBlue does not alter the imported data, i.e. it remains exactly as provided by the epigenome mapping consortia.

Getting started

Installation

Installation of DeepBlueR and its companion packages can be performed using the Bioconductor installer:

source("https://bioconductor.org/biocLite.R")
biocLite("DeepBlueR")

The package name is DeepBlueR and it can be loaded via:

library(DeepBlueR)

You can test your installation and connectivity by saying hello to the DeepBlue server:

deepblue_info("me")

Overview of DeepBlue commands

DeepBlue provides a comprehensive programmatic interface for finding, selecting, filtering, summarizing and downloading annotated genomic region sets. Downloaded region sets are stored using the GenomicRanges R package, which allows for downloaded region sets to be further processed, visualized and analyzed with existing R packages such as LOLA or GViz.

A list of all commands available by DeepBlue is provided in its API page. The vast majority of these commands is also available through this R package and can be listed as follows:

help(package="DeepBlueR")

In the following we listed the most frequently used DeepBlue commands. The full list of commands is available here. Note that each command in the following two tables has the prefix ’deepblue_*’, e.g. deepblue_select_genes.

Category Command Description
Information info Information about an entity
List and search list_genomes List registered genomes
list_biosources List registered biosources
list_samples List registered samples
list_epigenetic marks List registered epigenetic marks
list_experiments List available experiments
list_annotations List available annotations
search Perform a full-text search
Selection select_regions Select regions from experiments
select_experiments Select regions from experiments
select_annotations Select regions from annotations
select_genes Select genes as regions
select_expressions Select expression data
tiling_regions Generate tiling regions
input_regions Upload and use a small region-set
Operation aggregate Aggregate and summarize regions
filter_regions Filter regions by theirs attributes
flank Generate flanking regions
intersection Filter for intersecting regions
overlap Filter for regions overlapping by at least a specific size
merge_queries Merge two regions set
Result count_regions Count selected regions
score_matrix Request a score matrix
get_regions Request the selected regions
binning Bin results according to counts
Request get_request data Obtain the requested data

In addition, this package provides a set of convenience functions not part of the DeepBlue API, such as:

Category Command Description
Request batch_export_results Download the results for a list of requests
download_request_data Download and convert the requested data (blocking)
export_meta_data Export metadata to a tab delimited file
export_tab Export any result as tab delimited file
export_bed Export GenomicRanges results as BED file

DeepBlue usage examples

In the following we give a number of increasingly complex examples illustrating what DeepBlue can achieve in your epigenomic data analysis work-flow. We go beyond the online description of these examples by showing how the retrieved information can be further used in R.

One of the first tasks in DeepBlue is finding the data of interest. This can be achieved in three ways:

Listing experiments

We use the deepblue_list_experiments command to list all experiments with the corresponding values in their metadata.

experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3",
    biosource=c("inflammatory macrophage", "macrophage"),
    project="BLUEPRINT Epigenome")

Accessing the extra-metadata

The extra-metadata is important because it contains information that is not stored in the mandatory metadata fields. We use the deepblue_info command to access an experiment’s metadata- and extra-metadata fields. The following example prints the file_url attribute that is contained in the data imported from the ENCODE project.

info = deepblue_info("e30000")
print(info$extra_metadata$file_url)
## [1] "https://www.encodeproject.org/files/ENCFF001YBB/"

Select epigenomic data

We use the deepblue_select_experiments command to select all genomic regions from the two informed experiments. We use the deepblue_count_regions command with the query_id value returned by the deepblue_select_experiments command.

The deepblue_count_regions command is executed asynchronously. This means that the user receives a request_id and should check the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

query_id = deepblue_select_experiments(
    experiment_name=c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"))
# Count how many regions where selected
request_id = deepblue_count_regions(query_id=query_id)
# Download the request data as soon as processing is finished
requested_data = deepblue_download_request_data(request_id=request_id)
print(paste("The selected experiments have", requested_data, "regions."))
## [1] "The selected experiments have 115347 regions."

Output with selected columns

We use the deepblue_select_experiments command to select genomic regions from the experiments that are in chromosome 1, position 0 to 50,000,000.

We then use the deepblue_get_regions command with the query_id value returned by the deepblue_select_experiments command to request the regions with the selected columns. Selecting the columns @NAME and @BIOSOURCE represent the experiment name and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

query_id = deepblue_select_experiments (
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
        chromosome="chr1", start=0, end=50000000)

# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=query_id,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 3783 ranges and 4 metadata columns:
##          seqnames               ranges strand | SIGNAL_VALUE      PEAK
##             <Rle>            <IRanges>  <Rle> |  <character> <integer>
##      [1]     chr1     [270668, 270987]      * |       6.5758        39
##      [2]     chr1     [271277, 271468]      * |       6.2148       136
##      [3]     chr1     [273768, 274209]      * |      14.1567       164
##      [4]     chr1     [778377, 778676]      * |       8.0198       154
##      [5]     chr1     [778409, 778678]      * |       4.5767       123
##      ...      ...                  ...    ... .          ...       ...
##   [3779]     chr1 [47437420, 47437621]      * |       3.7686       147
##   [3780]     chr1 [47437751, 47438038]      * |       9.6553       149
##   [3781]     chr1 [48245368, 48245867]      * |       4.7708       346
##   [3782]     chr1 [48542755, 48543280]      * |       7.3002       152
##   [3783]     chr1 [48793649, 48793986]      * |       5.1974       108
##                                                       @NAME   @BIOSOURCE
##                                                 <character>  <character>
##      [1] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##      [2] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##      [3] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##      [4] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##      [5] BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed         BL-2
##      ...                                                ...          ...
##   [3779] BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed         BL-2
##   [3780] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##   [3781] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##   [3782] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##   [3783] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Filter epigenomic data by metadata

We use the deepblue_list_samples command to obtain all samples with the biosource ‘myeloid cell’ from the BLUEPRINT project. The deepblue_list_samples returns a list of samples with their IDs and content. We extract the sample IDs from this list and use it in the deepblue_select_regions command to selects genomic regions that are in chromosome 1, position 0 to 50,000.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME, SAMPLE_ID, and @BIOSOURCE representing the experiment name, the sample ID, and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

samples = deepblue_list_samples(
    biosource="myeloid cell",
    extra_metadata = list("source" = "BLUEPRINT Epigenome"))
samples_ids = deepblue_extract_ids(samples)
query_id = deepblue_select_regions(genome="GRCh38", sample=samples_ids,
    chromosome="chr1", start=0, end=50000)
request_id = deepblue_get_regions(query_id=query_id,
    output_format="CHROMOSOME,START,END,@NAME,@SAMPLE_ID,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
head(regions,1)
## GRanges object with 1 range and 3 metadata columns:
##       seqnames     ranges strand |
##          <Rle>  <IRanges>  <Rle> |
##   [1]     chr1 [0, 10000]      * |
##                                                      @NAME  @SAMPLE_ID
##                                                <character> <character>
##   [1] S00Q7NH1_12_12_Blueprint_release_201608_segments.bed       s8797
##         @BIOSOURCE
##        <character>
##   [1] myeloid cell
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Filter epigenomic data by region attributes

We use the deepblue_select_experiments command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE > 10 and PEAK > 1000.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME and @BIOSOURCE representing the experiment name and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

query_id = deepblue_select_experiments(
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
query_id_filter_signal = deepblue_filter_regions(
    query_id=query_id, field="SIGNAL_VALUE", operation=">",
    value="10", type="number")
query_id_filters = deepblue_filter_regions(
    query_id=query_id_filter_signal, field="PEAK", operation=">",
    value="1000", type="number")
request_id = deepblue_get_regions(query_id=query_id_filters,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 161 ranges and 4 metadata columns:
##         seqnames               ranges strand | SIGNAL_VALUE      PEAK
##            <Rle>            <IRanges>  <Rle> |  <character> <integer>
##     [1]     chr1   [1142428, 1144001]      * |      10.9313      1275
##     [2]     chr1   [1573400, 1575582]      * |      17.8805      1094
##     [3]     chr1   [1612814, 1616174]      * |      32.2064      2802
##     [4]     chr1   [1668761, 1670450]      * |      20.2936      1017
##     [5]     chr1   [1778583, 1783797]      * |      35.4277      1293
##     ...      ...                  ...    ... .          ...       ...
##   [157]     chr1 [44774644, 44776655]      * |      16.3227      1160
##   [158]     chr1 [44806139, 44811000]      * |      22.8156      1381
##   [159]     chr1 [46301112, 46304262]      * |      19.8041      2397
##   [160]     chr1 [46579227, 46582046]      * |      15.9613      1824
##   [161]     chr1 [46593677, 46595181]      * |      11.8798      1304
##                                                      @NAME   @BIOSOURCE
##                                                <character>  <character>
##     [1] BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed         BL-2
##     [2] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##     [3] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##     [4] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##     [5] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##     ...                                                ...          ...
##   [157] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##   [158] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##   [159] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##   [160] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##   [161] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Find intersecting regions

We use the deepblue_select_experiments command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE > 10 and PEAK > 1000.

The command deepblue_intersection filters for all regions of the query_id that intersect with at least one region in promoters_id.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME and @BIOSOURCE representing the experiment name and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

query_id = deepblue_select_experiments(
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
promoters_id = deepblue_select_annotations(annotation_name="promoters",
    genome="GRCh38", chromosome="chr1")
intersect_id = deepblue_intersection(
    query_data_id=query_id, query_filter_id=promoters_id)
request_id = deepblue_get_regions(
    query_id=intersect_id,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 608 ranges and 4 metadata columns:
##         seqnames               ranges strand | SIGNAL_VALUE      PEAK
##            <Rle>            <IRanges>  <Rle> |  <character> <integer>
##     [1]     chr1     [903997