1 Introduction

The plyinteractions package introduces tidy methods for the GInteractions class defined in the InteractionSet package (Lun, Perry, and Ing-Simmons, 2016).

1.1 GInteractions objects

GInteractions are objects describing interactions between two parallel sets of genomic ranges.

library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
library(InteractionSet)
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
anchor1 <- GRanges("chr1:10-20:+")
anchor2 <- GRanges("chr1:50-60:-")
gi <- GInteractions(anchor1, anchor2)

gi
#> GInteractions object with 1 interaction and 0 metadata columns:
#>       seqnames1   ranges1     seqnames2   ranges2
#>           <Rle> <IRanges>         <Rle> <IRanges>
#>   [1]      chr1     10-20 ---      chr1     50-60
#>   -------
#>   regions: 2 ranges and 0 metadata columns
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The InteractionSet package provides basic methods to interact with this class, but does not support tidy grammar principles.

1.2 Tidy grammar principles

The grammar of tidy genomic data transformation defined in plyranges and available for GInteractions currently supports:

  • dplyr verbs (for GInteractions and GroupedGInteractions):

    • Group genomic interactions with group_by();
    • Summarize grouped genomic interactions with summarize();
    • Tally/count grouped genomic interactions with tally and count();
    • Modify genomic interactions with mutate();
    • Subset genomic interactions with filter() using <data-masking> and logical expressions;
    • Pick out any columns from the associated metadata with select() using <tidy-select> arguments;
    • Subset using indices with slice();
    • Order genomic interactions with arrange() using categorical/numerical variables.
  • plyranges verbs (for PinnedGInteractions and AnchoredPinnedGInteractions):

    • Stretch specific anchors of genomic interactions to a given width with stretch();
    • anchor_*() functions to control how stretching is performed;
    • Shift specific anchors of genomic interactions with shift();
    • Obtain flanking GRanges from specific anchors of genomic interactions with flank().

2 Importing genomic interactions in R

plyinteractions provides a consistent interface for importing genomic interactions from pairs and bedpe files into GInteractions in R, following grammar of tidy data manipulation defined in the tidyverse ecosystem.

2.1 From bed-like text files

Tidy genomic data maniuplation implies that we first parse genomic files stored on disk as tabular data frames.

## This uses an example `bedpe` file provided in the `rtracklayer` package
bedpe_file <- system.file("tests", "test.bedpe", package = "rtracklayer")
bedpe_df <- read.delim(bedpe_file, header = FALSE, sep = '\t')

bedpe_df
#>      V1        V2        V3    V4        V5        V6                      V7 V8 V9 V10
#> 1  chr7 118965072 118965122  chr7 118970079 118970129 TUPAC_0001:3:1:0:1452#0 37  +   -
#> 2 chr11  46765606  46765656 chr10  46769934  46769984 TUPAC_0001:3:1:0:1472#0 37  +   -
#> 3 chr20  54704674  54704724 chr20  54708987  54709037 TUPAC_0001:3:1:1:1833#0 37  +   -

Genomic interactions in tabular format are not easy to manipulate. We can easily parse a data.frame into a GInteractions object using the as_ginteractions() function.

library(plyinteractions)
#> 
#> Attaching package: 'plyinteractions'
#> The following object is masked from 'package:matrixStats':
#> 
#>     count
#> The following object is masked from 'package:IRanges':
#> 
#>     slice
#> The following object is masked from 'package:S4Vectors':
#> 
#>     rename
#> The following object is masked from 'package:stats':
#> 
#>     filter
gi <- bedpe_df |> 
    as_ginteractions(
        seqnames1 = V1, start1 = V2, end1 = V3, strand1 = V9, 
        seqnames2 = V4, start2 = V5, end2 = V6, strand2 = V10, 
        starts.in.df.are.0based = TRUE
    )
#> Warning in .merge_two_Seqinfo_objects(x, y): Each of the 2 combined objects has sequence levels not in the other:
#>   - in 'x': chr11
#>   - in 'y': chr10
#>   Make sure to always combine/compare objects based on the same reference
#>   genome (use suppressWarnings() to suppress this warning).

gi
#> GInteractions object with 3 interactions and 2 metadata columns:
#>       seqnames1             ranges1 strand1     seqnames2             ranges2 strand2 |                     V7        V8
#>           <Rle>           <IRanges>   <Rle>         <Rle>           <IRanges>   <Rle> |            <character> <integer>
#>   [1]      chr7 118965073-118965122       + ---      chr7 118970080-118970129       - | TUPAC_0001:3:1:0:145..        37
#>   [2]     chr11   46765607-46765656       + ---     chr10   46769935-46769984       - | TUPAC_0001:3:1:0:147..        37
#>   [3]     chr20   54704675-54704724       + ---     chr20   54708988-54709037       - | TUPAC_0001:3:1:1:183..        37
#>   -------
#>   regions: 6 ranges and 0 metadata columns
#>   seqinfo: 4 sequences from an unspecified genome; no seqlengths

The columns containing information for core fields of the future GInteractions object (e.g. seqnames1, strand2, …) can be specified using the key = value (supported by quasiquotation).

2.2 From pairs files

The pairs file format has been formally defined by the 4DN consortium. Its specifications are available here.

It can be imported in R as a data.frame using read.delim() or any other tabular data import functions (including fread() or vroom() for larger files), and readily coerced into GInteractions with as_ginteractions().

## This uses an example `pairs` file provided in this package
pairs_file <- system.file('extdata', 'pairs.gz', package = 'plyinteractions') 
pairs_df <- read.delim(pairs_file, sep = "\t", header = FALSE, comment.char = "#")
head(pairs_df)
#>                                           V1 V2  V3 V4     V5 V6 V7   V8   V9
#> 1  NS500150:527:HHGYNBGXF:3:21611:19085:3986 II 105 II  48548  +  - 1358 1681
#> 2  NS500150:527:HHGYNBGXF:4:13604:19734:2406 II 113 II  45003  -  + 1358 1658
#> 3 NS500150:527:HHGYNBGXF:2:11108:25178:11036 II 119 II 687251  -  + 1358 5550
#> 4   NS500150:527:HHGYNBGXF:1:22301:8468:1586 II 160 II  26124  +  - 1358 1510
#> 5  NS500150:527:HHGYNBGXF:4:23606:24037:2076 II 169 II  39052  +  + 1358 1613
#> 6  NS500150:527:HHGYNBGXF:1:12110:9220:19806 II 177 II  10285  +  - 1358 1416
pairs <- as_ginteractions(pairs_df, 
    seqnames1 = V2, start1 = V3, strand1 = V6, 
    seqnames2 = V4, start2 = V5, strand2 = V7, 
    width1 = 1, width2 = 1, 
    keep.extra.columns = FALSE
)
pairs
#> GInteractions object with 50000 interactions and 0 metadata columns:
#>           seqnames1   ranges1 strand1     seqnames2   ranges2 strand2
#>               <Rle> <IRanges>   <Rle>         <Rle> <IRanges>   <Rle>
#>       [1]        II       105       + ---        II     48548       -
#>       [2]        II       113       - ---        II     45003       +
#>       [3]        II       119       - ---        II    687251       +
#>       [4]        II       160       + ---        II     26124       -
#>       [5]        II       169       + ---        II     39052       +
#>       ...       ...       ...     ... ...       ...       ...     ...
#>   [49996]        II     86996       + ---        II    487591       +
#>   [49997]        II     86997       + ---        II     96353       -
#>   [49998]        II     86997       + ---        II    114748       -
#>   [49999]        II     86998       + ---        II     88955       +
#>   [50000]        II     86999       + ---        II     87513       +
#>   -------
#>   regions: 62911 ranges and 0 metadata columns
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

2.3 Reverting from GInteractions to tabular data frames

The reverse operation to coerce GInteractions back to a tabular form is also possible using the as_tibble() function from the tibble package:

tibble::as_tibble(gi)
#> # A tibble: 3 × 12
#>   seqnames1    start1      end1 width1 strand1 seqnames2    start2      end2 width2 strand2 V7                         V8
#>   <fct>         <int>     <int>  <int> <fct>   <fct>         <int>     <int>  <int> <fct>   <chr>                   <int>
#> 1 chr7      118965073 118965122     50 +       chr7      118970080 118970129     50 -       TUPAC_0001:3:1:0:1452#0    37
#> 2 chr11      46765607  46765656     50 +       chr10      46769935  46769984     50 -       TUPAC_0001:3:1:0:1472#0    37
#> 3 chr20      54704675  54704724     50 +       chr20      54708988  54709037     50 -       TUPAC_0001:3:1:1:1833#0    37

3 Getter functions

3.1 anchors{12}

A GInteractions object consists of two sets of anchors: anchors1 and anchors2. Each set can be accessed with the corresponding function (anchors1() or anchors2()):