1 Ranges revisited

In Bioconductor there are two classes, IRanges and GRanges, that are standard data structures for representing genomics data. Throughout this document I refer to either of these classes as Ranges if an operation can be performed on either class, otherwise I explicitly mention if a function is appropriate for an IRanges or GRanges.

Ranges objects can either represent sets of integers as IRanges (which have start, end and width attributes) or represent genomic intervals (which have additional attributes, sequence name, and strand) as GRanges. In addition, both types of Ranges can store information about their intervals as metadata columns (for example GC content over a genomic interval).

Ranges objects follow the tidy data principle: each row of a Ranges object corresponds to an interval, while each column will represent a variable about that interval, and generally each object will represent a single unit of observation (like gene annotations).

Consequently, Ranges objects provide a powerful representation for reasoning about genomic data. In this vignette, you will learn more about Ranges objects and how via grouping, restriction and summarisation you can perform common data tasks.

2 Constructing Ranges

To construct an IRanges we require that there are at least two columns that represent at either a starting coordinate, finishing coordinate or the width of the interval.

suppressPackageStartupMessages(library(plyranges))
set.seed(100)
df <- data.frame(start=c(2:-1, 13:15),
                 width=c(0:3, 2:0))

# produces IRanges
rng <- df %>% as_iranges()
rng
## IRanges object with 7 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         2         1         0
##   [2]         1         1         1
##   [3]         0         1         2
##   [4]        -1         1         3
##   [5]        13        14         2
##   [6]        14        14         1
##   [7]        15        14         0

To construct a GRanges we require a column that represents that sequence name ( contig or chromosome id), and an optional column to represent the strandedness of an interval.

# seqname is required for GRanges, metadata is automatically kept
grng <- df %>%
  transform(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
         strand = sample(c("+", "-"), 7, replace = TRUE),
         gc = runif(7)) %>%
  as_granges()

grng
## GRanges object with 7 ranges and 1 metadata column:
##       seqnames    ranges strand |        gc
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chr2       2-1      - |  0.762551
##   [2]     chr1         1      - |  0.669022
##   [3]     chr2       0-1      + |  0.204612
##   [4]     chr2      -1-1      - |  0.357525
##   [5]     chr1     13-14      - |  0.359475
##   [6]     chr1        14      - |  0.690291
##   [7]     chr2     15-14      - |  0.535811
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

3 Arithmetic on Ranges

Sometimes you want to modify a genomic interval by altering the width of the interval while leaving the start, end or midpoint of the coordinates unaltered. This is achieved with the mutate verb along with anchor_* adverbs.

The act of anchoring fixes either the start, end, center coordinates of the Range object, as shown in the figure and code below and anchors are used in combination with either mutate or stretch. By default, the start coordinate will be anchored, so regardless of strand. For behavior similar to GenomicRanges::resize, use anchor_5p.

rng <- as_iranges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8)))
grng <- as_granges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8),
                          seqnames = "seq1",
                          strand = c("+", "*", "-")))
mutate(rng, width = 10)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1        10        10
##   [2]         2        11        10
##   [3]         3        12        10
mutate(anchor_start(rng), width = 10)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1        10        10
##   [2]         2        11        10
##   [3]         3        12        10
mutate(anchor_end(rng), width = 10)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        -4         5        10
##   [2]        -7         2        10
##   [3]        -1         8        10
mutate(anchor_center(rng), width = 10)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        -2         7        10
##   [2]        -3         6        10
##   [3]         1        10        10
mutate(anchor_3p(grng), width = 10) # leave negative strand fixed
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     seq1      -4-5      +
##   [2]     seq1      -7-2      *
##   [3]     seq1      3-12      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
mutate(anchor_5p(grng), width = 10) # leave positive strand fixed
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     seq1      1-10      +
##   [2]     seq1      2-11      *
##   [3]     seq1      -1-8      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Similarly, you can modify the width of an interval using the stretch verb. Without anchoring, this function will extend the interval in either direction by an integer amount. With anchoring, either the start, end or midpoint are preserved.

rng2 <- stretch(anchor_center(rng), 10)
rng2
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        -4        10        15
##   [2]        -3         7        11
##   [3]        -2        13        16
stretch(anchor_end(rng2), 10)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       -14        10        25
##   [2]       -13         7        21
##   [3]       -12        13        26
stretch(anchor_start(rng2), 10)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        -4        20        25
##   [2]        -3        17        21
##   [3]        -2        23        26
stretch(anchor_3p(grng), 10)
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     seq1      -9-5      +
##   [2]     seq1      -8-2      *
##   [3]     seq1      3-18      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
stretch(anchor_5p(grng), 10)
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     seq1      1-15      +
##   [2]     seq1      2-12      *
##   [3]     seq1      -7-8      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Ranges can be shifted left or right. If strand information is available we can also shift upstream or downstream.

shift_left(rng, 100)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       -99       -95         5
##   [2]       -98       -98         1
##   [3]       -97       -92         6
shift_right(rng, 100)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       101       105         5
##   [2]       102       102         1
##   [3]       103       108         6
shift_upstream(grng, 100)
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     seq1   -99--95      +
##   [2]     seq1       -98      *
##   [3]     seq1   103-108      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
shift_downstream(grng, 100)
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     seq1   101-105      +
##   [2]     seq1       102      *
##   [3]     seq1   -97--92      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

4 Grouping Ranges

plyranges introduces a new class of Ranges called RangesGrouped, this is a similar idea to the grouped data.frame\tibble in dplyr.

Grouping can act on either the core components or the metadata columns of a Ranges object.

It is most effective when combined with other verbs such as mutate(), summarise(), filter(), reduce_ranges() or disjoin_ranges().

grng <- data.frame(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
         strand = sample(c("+", "-"), 7, replace = TRUE),
         gc = runif(7),
         start = 1:7,
         width = 10) %>%
  as_granges()

grng_by_strand <- grng %>%
  group_by(strand)

grng_by_strand
## GRanges object with 7 ranges and 1 metadata column:
## Groups: strand [2]
##       seqnames    ranges strand |        gc
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chr2      1-10      - |  0.889454
##   [2]     chr2      2-11      + |  0.180407
##   [3]     chr1      3-12      - |  0.629391
##   [4]     chr2      4-13      + |  0.989564
##   [5]     chr1      5-14      - |  0.130289
##   [6]     chr1      6-15      - |  0.330661
##   [7]     chr2      7-16      - |  0.865121
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

5 Restricting Ranges

The verb filter can be used to restrict rows in the Ranges. Note that grouping will cause the filter to act within each group of the data.

grng %>% filter(gc < 0.3)
## GRanges object with 2 ranges and 1 metadata column:
##       seqnames    ranges strand |        gc
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chr2      2-11      + |  0.180407
##   [2]     chr1      5-14      - |  0.130289
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
# filtering by group
grng_by_strand %>% filter(gc == max(gc))
## GRanges object with 2 ranges and 1 metadata column:
## Groups: strand [2]
##       seqnames    ranges strand |        gc
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chr2      1-10      - |  0.889454
##   [2]     chr2      4-13      + |  0.989564
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

We also provide the convenience methods filter_by_overlaps and filter_by_non_overlaps for restricting by any overlapping Ranges.

ir0 <- data.frame(start = c(5,10, 15,20), width = 5) %>%
  as_iranges()
ir1 <- data.frame(start = 2:6, width = 3:7) %>%
  as_iranges()
ir0
## IRanges object with 4 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         5         9         5
##   [2]        10        14         5
##   [3]        15        19         5
##   [4]        20        24         5
ir1
## IRanges object with 5 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         2         4         3
##   [2]         3         6         4
##   [3]         4         8         5
##   [4]         5        10         6
##   [5]         6        12         7
ir0 %>% filter_by_overlaps(ir1)
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         5         9         5
##   [2]        10        14         5
ir0 %>% filter_by_non_overlaps(ir1)
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        15        19         5
##   [2]        20        24         5

6 Summarising Ranges

The summarise function will return a DataFrame because the information required to return a Ranges object is lost. It is often most useful to use summarise() in combination with the group_by() family of functions.

ir1 <- ir1 %>%
  mutate(gc = runif(length(.)))

ir0 %>%
  group_by_overlaps(ir1) %>%
  summarise(gc = mean(gc))
## DataFrame with 2 rows and 2 columns
##       query        gc
##   <integer> <numeric>
## 1         1  0.675555
## 2         2  0.635795

7 Joins, or another way at looking at overlaps between Ranges

A join acts on two GRanges objects, a query and a subject.

query <- data.frame(seqnames = "chr1",
               strand = c("+", "-"),
               start = c(1, 9),
               end =  c(7, 10),
               key.a = letters[1:2]) %>%
  as_granges()

subject <- data.frame(seqnames = "chr1",
               strand = c("-", "+"),
               start = c(2, 6),
               end = c(4, 8),
               key.b = LETTERS[1:2]) %>%
  as_granges()
Query and Subject Ranges

Figure 1: Query and Subject Ranges

The join operator is relational in the sense that metadata from the query and subject ranges is retained in the joined range. All join operators in the plyranges DSL generate a set of hits based on overlap or proximity of ranges and use those hits to merge the two datasets in different ways. There are four supported matching algorithms: overlap, nearest, precede, and follow. We can further restrict the matching by whether the query is completely within the subject, and adding the directed suffix ensures that matching ranges have the same direction (strand).