Contents

1 Preamble

1.1 The methods package

S4 class system implemented in the methods package

The ingredients:

  • The core components: classes, generic functions and methods

  • The glue: method dispatch

S4 classes are sometimes called formal classes

The result: a rich and complex system (218 symbols exported from the methods package). Can be overwhelming. The good news is that we only need a small subset of the system.

1.2 Strengths of S4 (compared to S3)

1.2.1 Multiple dispatch

1.2.2 Object validation

1.2.3 API vs internals

Unlike other OO programming languages, S4 offers no provision for encapsulation: the internals of an object (i.e. all its slots) are visible and directly accessible. However, the Bioconductor approach is to consider that the internals are the business of the developer and that the end user should never need to access them directly or even look at them. Access should be only via accessor functions (getters or setters).

This has the following advantages:

  • The developer has the freedom to modify the internals of an object without breaking user code.

  • Carefully crafted setters will make sure that the object remains valid when being modified by the user.

  • This separation between what an object looks from the outside and what’s really inside allows the developer to use tricks at the internals level to make objects more compact in memory and/or more efficient.

So from a user point of view what matters is the semantic of an object i.e. what it represents and how it can be manipulated (API).

Even developers who implement a new class by extending an existing class should do it in a way that is agnostic about the internals of the class they extend. S4 makes this possible (it’s a great feature). More on this later.

1.2.4 Powerful coercion system

1.2.5 Virtual/concrete classes

A virtual class is a class that cannot be instanciated. Typically used as the parent of one or more concrete classes:

setClass("A", representation("VIRTUAL"))  # virtual class
setClass("A1", contains="A", slots=...some slots...)
setClass("A2", contains="A", slots=...some other slots...)

For example the GRanges and GPos classes are both subclasses of virtual class GenomicRanges:

suppressPackageStartupMessages(library(GenomicRanges))
showClass("GenomicRanges")
## Virtual Class "GenomicRanges" [package "GenomicRanges"]
## 
## Slots:
##                                           
## Name:    elementMetadata          metadata
## Class: DataTable_OR_NULL              list
## 
## Extends: 
## Class "Vector", directly
## Class "GenomicRanges_OR_missing", directly
## Class "GenomicRanges_OR_GRangesList", directly
## Class "Annotated", by class "Vector", distance 2
## 
## Known Subclasses: "GRanges", "GPos", "DelegatingGenomicRanges", "GNCList"

Many methods can be implemented in a way that works for the 2 kinds of objects.

"names" method:

selectMethod("names", "GRanges")
## Method Definition:
## 
## function (x) 
## names(ranges(x))
## <environment: namespace:GenomicRanges>
## 
## Signatures:
##         x              
## target  "GRanges"      
## defined "GenomicRanges"

Will also work on GPos objects because the ranges() getter works on these objects (and like for GRanges objects, it also returns a names IRanges object). So instead of defining 2 "names" methods (one for GRanges and one for GPos objects) we define one method for GenomicRanges objects. GRanges and GPos objects inherit it.

1.2.6 Virtual classes with no slots

A virtual class can have slots (like GenomicRanges), or no slots (like class A in our first example). Typically the slots of a virtual class are not enough to represent an object (this is why the class is made virtual). The concrete subclasses add the missing slots.

1.2.7 Multiple inheritance

Powerful but can lead to a class hierarchy that is very hard to maintain if not used carefully.

showClass("CompressedIRangesList")
## Class "CompressedIRangesList" [package "IRanges"]
## 
## Slots:
##                                                             
## Name:        elementType   elementMetadata          metadata
## Class:         character DataTable_OR_NULL              list
##                                           
## Name:         unlistData      partitioning
## Class:               ANY PartitioningByEnd
## 
## Extends: 
## Class "IRangesList", directly
## Class "CompressedList", directly
## Class "RangesList", by class "IRangesList", distance 2
## Class "List", by class "CompressedList", distance 2
## Class "Vector", by class "CompressedList", distance 3
## Class "Annotated", by class "CompressedList", distance 4
## 
## Known Subclasses: "CompressedNormalIRangesList"

S4 imposes no restrictions but I think we never had the need to define a class with more than 2 direct parents.

1.2.8 Class unions

Just a special kind of virtual class with no slots.

showClass("GenomicRanges_OR_GRangesList")
## Virtual Class "GenomicRanges_OR_GRangesList" [package "GenomicRanges"]
## 
## No Slots, prototype of class "S4"
## 
## Known Subclasses: 
## Class "GenomicRanges", directly
## Class "GRangesList", directly
## Class "GRanges", by class "GenomicRanges", distance 2
## Class "GPos", by class "GenomicRanges", distance 2
## Class "DelegatingGenomicRanges", by class "GenomicRanges", distance 2
## Class "GNCList", by class "GenomicRanges", distance 2

(This class is used in the definition of RangedSummarizedExperiment objects.)

1.2.9 Reference classes

Rarely needed. As a general principle, R objects should have a pass-by-copy semantic so passing them to a function is guaranteed to leave the original object unmodified. Pass-by-reference semantic is almost never needed and should be avoided.

1.3 A first example

Open A quick overview of the S4 class system (PDF) on the S4Vectors landing page and go to the Implementing an S4 class (in 4 slides) section.

Exercise

  • Implement the SNPLocations class (setClass() statement, validity method, and "length" method only). Allow the genome slot to contain an NA
setClass("SNPLocations",
    slots=c(
        genome="character",  # a single string or NA
        snpid="character",   # a character vector of length N
        chrom="character",   # a character vector of length N
        pos="integer"        # an integer vector of length N
    )
)

setValidity("SNPLocations",
    function(object)
    {
        if (length(object@genome) != 1)
            return("'genome' slot must have length 1")
        slot_lengths <- c(length(object@snpid),
                          length(object@chrom),
                          length(object@pos))
        if (length(unique(slot_lengths)) != 1)
            return("'snpid', 'chrom' and 'pos' slots must have the same length")
        TRUE
    }
)

setMethod("length", "SNPLocations", function(x) length(x@snpid))
  • Call new("SNPLocations"). What happens? Does this look right?

  • How could we change the behavior of new("SNPLocations") so it generates a valid object?

setClass("SNPLocations",
    slots=c(
        genome="character",
        snpid="character",
        chrom="character",
        pos="integer"
    ),
    prototype=list(genome=NA_character_)
)

2 Design and implementation of S4 objects

2.1 Two approaches

  • From scratch (a rare situation)

  • Extend an existing class (less work)

2.2 What to implement?

2.2.1 Core stuff

  • Class def (setClass statement)

  • Validity method

  • Constructor function (named as the class, should be documented and user friendly).

  • "show" method

2.2.2 API

At a minimum some getters Try to reuse existing generics instead of introducing new ones e.g.:

  • organism(), score(), strand() (defined in BiocGenerics)

  • genome(), seqinfo() (defined in GenomeInfoDb)

Depending on the shape of the objects, methods for some of the following core generic functions (from base R):

  • length(), names(), subsetting ([, [[), dim(), dimnames(), c()

Some setters (unless the objects are not intended to be modified)

  • Setters for the corresponding getters

  • Methods for some of names<-, subassignment ([<-, [[<-), dimnames<-

Specialized operations As methods for existing generics (e.g. normalize defined in BiocGenerics), or as new generics.

2.2.3 Coercion methods

setAs("SNPLocations", "data.frame",
    function(from)
        data.frame(snpid=from@snpid, chrom=from@chrom, pos=from@pos)
)

Exercise

Implement a coercion method from SNPLocations to GRanges.

2.3 About the shape of an object

Most objects fall in one of these categories

  • vector-like objects: length()

Examples: atomic vectors and ordinary list in base R; Rle, Hits, IRanges, GRanges, GRangesList, DNAString, DNAStringSet objects

  • list-like objects: vector-like + [[

Examples: ordinary list and data frames in base R. CharacterList, IRanges, IRangesList, GRangesList, DNAStringSet in Bioconductor.

  • matrix-like, array-like objects: dim() and length(x) (= prod(dim(x)))

Examples: matrix and array objects in base R. Matrix objects in the Matrix package. DelayedArray objects in the DelayedArray package.

  • data-frame-like objects: dim() + list-like along the columns

Examples: data.frame objects in base R. DataFrame objects in Bioconductor.

These categories are not mutually exclusive:

  • list-like objects are vector-like objects

  • array-like objects are also vector-like objects

  • data-frame-like objects are list-like objects: length(x) is ncol(x)), names(x) is colnames(x), x[[i]] is x[ , i] They are NOT matrix-like objects.

These categories reflect the shape of the object

  • linear shape: vector-like, list-like

  • rectangular shape: matrix-like, data-frame-like, SummarizedExperiment

  • multi-dimensional shape: array-like

Rectangular and multi-dimensional objects have an “implicit linear shape”. Think of it as the “backbone” of the object.

For example, 3 different linear shapes can be associated with a rectangular shape. They are described by the following “backbones”:

  1. “matrix-like backbone”:

    run from the top-left element to the bottom-right element of the rectangle, going thru the entire rectangle by running down along all the columns (continuing from the top of next column when reaching the bottom of a column)

This is the backbone of matrix-like objects. Array-like objects generalize this to N dimensions: backbone runs along the first dimension first (fastest changing dimension in R is the first one), then along the 2nd dimension, and so on…)

  1. “data-frame-like backbone”:

    run horizontally from first to last column of the rectangle

This is the backbone of data-frame-like objects.

  1. Bioconductor introduces a 3rd type of backbone for rectangular objects:

    run vertically from first to last row of the rectangle

This is the “SummarizedExperiment-like backbone”.

When implementing a class that represents rectangular objects, the developper should decide what the backbone of the objects will be. Concretely that means deciding what the "length" method will return: prod(dim(x)), ncol(x), or nrow(x)? Once this choice is made, things like names(x), x[i], and c() should behave consistently (e.g. c(x, y) should concatenate the 2 objects along their backbones).

Making it clear what the backbone is in the documentation will help the user understand (and predict) the behavior of length(x), names(x), x[i], c(), etc…

Exercise

What’s the shape of SNPLocations objects?

2.4 Subsetting

Subsetting is a powerful and very flexible operation in R.

4 subsetting operators: [, [[, [<-, [[<-

Different types of subsetting:

  • Single-bracket vs double-bracket
    • Single-bracket: [
    • Double-bracket: [[ (sometimes called list-style subsetting).
  • Linear vs multi-dimensional
    • Linear: only 1 subscript is used: x[i] or x[[i]]. Sometimes called 1D-style subsetting.
    • Multi-dimensional: more than 1 subscript is used: x[i_1, i_2, ..., i_n] or x[[i_1, i_2, ..., i_n]]. Some (or all) subscripts can be missing e.g. x[ , 3, ].
  • Extraction vs replacement
    • Extraction: y <- x[...] or y <- x[[...]]
    • Replacement: x[...] <- value or x[[...]] <- value. Subsetting is on the left side of the assignment. value is called the right value or replacement value. This form of subsetting calls the [<- or [[<- operators. Sometimes called subassignment.

R syntax allows any combination of these types e.g.:

  • y <- x[i, j, k]: Single-bracket multi-dimensional subsetting

  • x[i] <- value: Single-bracket linear subassignment

  • x[[i, j]] <- value: Double-bracket two-dimensional subassignment

  • etc…

==> 8 possible combinations.

Even though all these forms are syntactically valid, they don’t necessarily make sense for all objects (and they should fail in that case).

For multi-dimensional objects that support it, linear subsetting is expected to be along the backbone of the object (1 <= i <= length(x)).

Some examples:

  • Ordinary matrix, array, data.frame, DataFrame objects support the 8 forms. Gotcha: Implicit linear shape of matrix-like and data-frame-like objects differ, and so do linear subsetting on them.

  • SummuarizedExperiment objects support rectangular (x[i, j]) and linear (x[i]) subsetting and their replacement versions. x[i] and x[i] <- value are equivalent to x[i, ] and x[i, ] <- value, respectively. Gotcha: [[ is also supported but behaves in a non-standard way!

  • GRanges objects support linear [ and [<-. As a convenience, rectangular subsetting is also supported (even though GRanges objects are NOT rectangular).

Exercise

Find the general man page for subsetting in base R.

Find the specific man page for subsetting a data frame.

Look at the signatures of the [, [[, [<-, and [[<- generic functions.

Exercise

What form of subsetting SNPLocations objects would naturally support?

Implement it.

2.5 About API design

2.5.1 General recommendations

Avoid surprising the user.

Pay particular attention to how length(), names(), [, [[, c(), dim(), and dimnames() behave.

Also pay attention to 2 properties that are considered good properties:

  • Should the transformation I’m implementing preserve positionality?

  • Should it behave like an endomorphism?

2.5.2 Transformations that preserve positionality

Positionality is preserved when the output is parallel to the input, that is, it has the same length as the input and its i-th element corresponds to the i-th element in the input.

Examples:

  • All the operations in base R that are said to be vectorized e.g.:

    • is.na(), duplicated(), nchar(), chartr(), tolower(), grepl(), regexpr(), rank()

    • Functions from the Math group: log(), sqrt(), cumsum(), …

    • Binary operators: +, -, *, /, ^, ==, <, &, |, … Vectorized with respect to the longest operand.

    • match(), %in% (with respect to the 1st argument)

  • seqnames(), start(), end(), width(), strand() getters on a GRanges or GRangesList object.

  • Intra-range transformations in Bioconductor e.g. shift(), flank(), …

gr <- GRanges(c("chr1:51-70", "chr1:75-100", "chr1:6-55"))
gr
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1 [51,  70]      *
##   [2]     chr1 [75, 100]      *
##   [3]     chr1 [ 6,  55]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
shift(gr, 1000)
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames       ranges strand
##          <Rle>    <IRanges>  <Rle>
##   [1]     chr1 [1051, 1070]      *
##   [2]     chr1 [1075, 1100]      *
##   [3]     chr1 [1006, 1055]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
  • Biostrings::translate() on a DNAStringSet object (not on a DNAString).

These operations typically propagate the names.

Also expected to propagate the metadata columns when applied to Bioconductor objects.

Do NOT preserve positionality:

  • grep(), sort(), order()

  • Summarization functions on atomic vectors e.g. min(), max(), sum(), mean(), prod(), all(), any(), … However they DO preserve positionality when applied to NumericList or LogicalList objects from Bioconductor!

  • Set operations e.g. union(), intersect(), …

  • Inter-range transformations in Bioconductor (e.g. reduce(), disjoin(), …) do not preserve positionality in general:

reduce(gr)
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1 [ 6,  70]      *
##   [2]     chr1 [75, 100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

These operations do NOT propagate the names or metadata columns.

Exercise

Try unique(), sort(), order() on NumericList, CharacterList, and RleList objects.

How do you extract the unique values object-wise?

Count the values in a NumericList object that are greater than a given value.

2.5.3 To be or not to be an endomorphism

An endomorphism is a transformation that preserves the class of the object.

Examples:

  • [ (when drop=FALSE), except linear subsetting on an array-like object

  • Replacement methods: [<- and [[<- and other setters (e.g. names<-, mcols<-, seqinfo<-, …) With some notable exceptions:

    • [<- on an atomic vector when the storage.mode of the vector cannot represent the right-value (e.g. x[2] <- 3.75 on an integer vector, or x[2] <- "a" on a numeric vector)

    • rowRanges() setter on a SummarizedExperiment instance.

  • c() on linear objects (e.g. on Hits, GRanges, GRangesList, DNAStringSet objects, …) With a notable exception on factors (this is bad!)

  • rbind() and cbind() on rectangular objects

  • Intra-range transformations

NOT endomorphisms:

  • c() on array-like objects

  • Inter-range transformations are not endomorphisms in general:

gpos <- GPos(gr)
gpos
## GPos object with 96 positions and 0 metadata columns:
##        seqnames       pos strand
##           <Rle> <integer>  <Rle>
##    [1]     chr1        51      *
##    [2]     chr1        52      *
##    [3]     chr1        53      *
##    [4]     chr1        54      *
##    [5]     chr1        55      *
##    ...      ...       ...    ...
##   [92]     chr1        51      *
##   [93]     chr1        52      *
##   [94]     chr1        53      *
##   [95]     chr1        54      *
##   [96]     chr1        55      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
reduce(gpos)
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1 [ 6,  70]      *
##   [2]     chr1 [75, 100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
  • Biostrings::translate()

2.6 Reusing existing classes

It’s very likely that Bioconductor already has a container suited for the kind of data that you want to handle in your package.

If your data is rectangular, the first choice would be SummarizedExperiment.

If not, then maybe extending an existing class by adding a few slots is all you need to do.

Many packages have already done this with SummarizedExperiment e.g.:

  • DESeq2 with the DESeqDataSet and DESeqTransform containers

  • minfi with the MethylSet, RGChannelSet, RatioSet, GenomicMethylSet, and GenomicRatioSet containers

  • VariantAnnotation with the VCF class.

  • InteractionSet with the InteractionSet container.

  • GenomicFiles with the GenomicFiles container.

  • SGSeq with the SGVariantCounts and SGFeatureCounts containers.

  • AllelicImbalance with the ASEset, RiskVariant, RegionSummary, DetectedAI, and LinkVariantAlmlof containers.

  • and many more…

Ask on the bioc-devel mailing list if you need help finding the right container for your data (to use as-is or to extend).

The SummarizedExperiment vignette (available here) has a section dedicated to how to extend the RangedSummarizedExperiment class.

Starting a new class from scratch should be the last resort. Even in that case, it’s very likely that you could extend one of the virtual classes that are at the bottom of our core class hierarchy.

For example:

  • By extending the Annotated class you get the metadata() getter and setter out-of-the-box.

  • By extending the Vector class (only if your objects are vector-like objects), you get the above (Vector extends Annotated), plus:
    • length() out-of-the box
    • mcols() getter and setter out-of-the box
    • Validation of the parallel slots almost out-of-the box (via definition of a "parallelSlotNames" method)
    • Automatic subsetting of the parallel slots if you decide that your object should support subsetting
  • By extending the List class (only if your objects are list-like objects), you get all the above (List extends Vector), plus:
    • elementType() almost out-of-the box (via adding a prototype in your class definition)
    • as.list() out-of-the box
    • access to a mechanism that facilitates implementation of list-style subsetting ([[ and [[<-).

Exercise

  • Modify the definition of the SNPLocations class to make it extend Vector.

  • What to we gain by doing this?

2.7 Extending a class without adding any slot

Sometimes we need to extend a class not to add slots to it, but to add constraints on the objects.

For example NormalIRanges objects have the same slots as IRanges objects but NormalIRanges objects have the obligation to be normal (see Normality section in ?isNormal for what that means).

A simpler example is the following: let’s say we want to implement a class that represent IRanges objects of constant width. A simple option is to just extend the IRanges class:

setClass("ConstantWidthIRanges", contains="IRanges")

and add the “constant width” constraint via a validity method:

setValidity("ConstantWidthIRanges",
    function(object)
    {
        w <- width(object)
        if (length(w) != 0L && any(w != w[[1]]))
            return("object width is not constant")
        TRUE
    }
)

(Should we discuss length(w) != 0L && any(w != w[[1]] vs length(unique(w)) > 1?)

The S4 system provides automatic coercion methods from parent to child (promotion) and from child to parent (downgrade).

There is a gotcha with these methods though.

Let’s look at promotion:

x <- as(IRanges(6:8, width=10), "ConstantWidthIRanges")
y <- as(IRanges(6:8, end=10), "ConstantWidthIRanges")

Surprisingly the coercion worked for y but produced an invalid ConstantWidthIRanges object:

validObject(x)  # TRUE
validObject(y)  # error!

Why?

Let’s look at downgrade:

x2 <- as(x, "IRanges")
y2 <- as(y, "IRanges")

Works fine (both objects are valid IRanges instances).

So coercion from IRanges to ConstantWidthIRanges is broken. How can we fix it?

Note that there are other approaches for implementing constant-width IRanges objects.

2.8 Overriding existing methods

When we extend a class, we get a lot of things that work out-of-the-box (thanks to inheritance).

A few things don’t work exactly as expected though, so we need to fix them. The standard mechanism for doing this is by overriding existing methods.

3 important things to keep in mind when overriding methods:

  1. We only need to do this for methods that don’t behave as expected.

  2. We should NEVER override a method to change its semantic! For example implementing a "width" method for ConstantWidthIRanges objects that return the width as a single number would be a very bad idea. Why? What should we do if we want something like that?

  3. Last but not least: sometimes we do it to improve performance!

Example: "sum" method for CompressedIntegerList objects.

2.9 Extending vs composition

Let’s say we want to implement a class for representing a view on an array i.e. a multi-dimensional window on an array. The window can be represented by a set of ranges, one range per dimension e.g. [1-4, 1-5] for the topleft corner of a matrix. For an array with N dimensions, we need N ranges to represent the view. We’ll call this kind of object a “viewport” (by analogy with the terminology used in the computer graphics world).

In our viewport objects, we also want to store the dimensions of “the reference array” i.e. the array on top of which the viewport is defined (a.k.a. “the underlying array”). This will be useful for validation purposes.

Consider the 2 following implementations:

setClass("ArrayViewport1",
    contains="IRanges",
    representation(
        refdim="integer"
    )
)

setClass("ArrayViewport2",
    representation(
        refdim="integer",
        ranges="IRanges"
    )
)

One extends the IRanges class so inherits the IRanges slots and full IRanges API. is(x, "IRanges") will be TRUE on these objects. This says: “an ArrayViewport1 object is an IRanges object so can be used anywhere an IRanges object is expected”.

The other one uses composition. Unlike ArrayViewport1 objects, ArrayViewport2 objects are not considered to be IRanges object. But they have one inside them (stored in the 'ranges' slot).

In the end, the 2 containers store the same data but their semantics are different. Which one to choose?

Now I want my viewport objects to have the shape of an array i.e. dim() should return the dimensions of the sub-array delimitted by the viewport. Can I make the right choice now?

2.10 Documenting an S4 class

  • Talk about accessors, not about slots.

  • Focus on semantics, not on implementation details.

  • If your class extends a class that is already well documented, refer the reader to the man page of that class and add a link to it (typically in the \seealso section). Don’t repeat this information in your man page.

  • Add \alias{MyClass} (in addition to \alias{class:MyClass} and \alias{MyClass-class}) to make it easier for the end-user to find your man page.

See ?GPos in the GenomicRanges package for an example.

3 ArrayGrid objects dissected

library(DelayedArray)
## Loading required package: matrixStats
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following object is masked from 'package:base':
## 
##     apply
grid <- ArrayRegularGrid(c(10L, 10L), spacings=c(5L, 4L))
grid
## 2 x 3 ArrayRegularGrid object on a 10 x 10 array:
##      [,1]        [,2]        [,3]        
## [1,] [1-5, 1-4]  [1-5, 5-8]  [1-5, 9-10] 
## [2,] [6-10, 1-4] [6-10, 5-8] [6-10, 9-10]

4 Resources

5 Acknowledgements

Research reported in this course was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)