`r knitr::opts_chunk$set(tidy=FALSE)` # Extended case study: an S4 "Annotated Matrix" class Why bother? - Interoperability -- packages in the same work flow re-use the same data strutures, e.g., `GRanges` for describing regions of interest in sequencing experiments - Programmer efficiency -- classes enable re-use, avoiding implementing functionality that already exists. - Often, correct way to implement a class is not to -- re-use existing classes instead, even if it's not quite perfect Motivation - Numerical data typically augmented by important information about rows and columns - Rows ('regions of interested'): gene symbols, genome coordinates, significance values from other tests - Columns ('samples'): study identifiers, treatment groups, covariates - Significant risks of mis-aligning row and column data, with catastrophic and real-world consequences ## Declaration 'Is a' versus 'has a' - 'is a' caries a lot of baggage and introduces a lot of constraints - Particularly challenging when thinking about extending base R objects, where the 'API' is not well-defined Starting our class definition - A `matrix()` with row- and column `data.frame()` annotations - Construct with `setClass`; returns a simple generating function. ```{r setClass} .AnnMat <- setClass("AnnMat", representation(matData="matrix", rowData="data.frame", colData="data.frame")) ``` - 'Is a' (inheritance) relationship via `contains=` argument; multiple in inheritance possible. - In use ```{r generating-function} am0 <- .AnnMat() am1 <- .AnnMat( matData=matrix(1:10, 2, dimnames=list(letters[1:2], LETTERS[1:5])), rowData=data.frame(roi_id=1:2), colData=data.frame( sample_id=1:5, treatment=c("A", "A", "B", "B", "B"))) ``` A simple method: accessors ("getters") - Definition of generics ```{r getters} setGeneric("rowData", function(x, ...) standardGeneric("rowData")) setGeneric("colData", function(x, ...) standardGeneric("colData")) setGeneric("matData", function(x, ...) standardGeneric("matData")) ``` - Then methods implemented of class-specific methods on the generics ```{r getter-methods} setMethod("rowData", "AnnMat", function(x, ...) x@rowData) setMethod("colData", "AnnMat", function(x, ...) x@colData) setMethod("matData", "AnnMat", function(x, ...) x@matData) ``` A simple method: `dim` and `dimnames` - Discover existing generic, for signature ```{r dim-getGeneic} getGeneric("dim") ``` - Implement a method on our class ```{r dim-setMethod} setMethod("dim", "AnnMat", function(x) dim(matData(x))) setMethod("dimnames", "AnnMat", function(x) dimnames(matData(x))) ``` - **Question** Hey neat, `nrow()` and `ncol()` (and `rownames()` and `colnames()`) for free! Why is that? ```{r nrow-for-free} dim(am1) nrow(am1) ``` A simple method: `show` - Purpose: brief summary when displaying (printing) during interactive use - Existing generic: `getGeneric("show")` - Many existing methods: `showMethods("show", where=search())` - Our implementation: brief summary, with an eye toward re-use by derived classes. Avoid direct slot access ```{r show} setMethod("show", "AnnMat", function(object) { cat("class:", class(object), "\n") cat("dim:", dim(object), "\n") cat("rowData names():", names(rowData(object)), "\n") cat("colData names():", names(colData(object)), "\n") }) am1 ``` - **Question** Is there a better overall philosophy for `show`? A more complicated method: updating ('replacement', 'setter') methods - Provide the illusion and simple syntax for in-place modification - A familiar example: update the value of a column in a `data.frame` ```{r update} df <- data.frame(x=1:5, y=5:1) df[,"x"] <- log(df$x) ``` - R denotes `df[,"x"]` as `[.data.frame`, "the subset method for data.frame", and `df[,"x"] <- value` as `[<-.data.frame` "the subset-replace method for data.frame". - There actually _is_ a function `[<-.data.frame` ```{r subset-assign-df} head(get("[<-.data.frame")) ``` R's parser translates `df[,"x"] <- value` to `[<-.data.frame(x, , "x", value) and actually modifies (a copied, if necessary) first argument. - Replacement methods, e.g., `matData<-`, signature takes the object to be updated, additional optional arguments, and the value to update the argument with ```{r setter-matData} setGeneric("matData<-", function(x, ..., value) standardGeneric("matData<-")) ``` Dispatch on one or both of `x`, `value` - Impelement as a method that dispatches on both the object and value, updates the slot, and returns the updated object. ```{r setter-matData-impl} setReplaceMethod("matData", c("AnnMat", "matrix"), function(x, ..., value) { x@matData <- value x }) ``` **Exercise**: walk through how that assignment in the body works. - Another replacement method, for `dimnames<-` (the generic already exists; what is it?) ```{r setter-dimnames-impl} setReplaceMethod("dimnames", c("AnnMat", "list"), function(x, value) { dimnames(matData(x)) <- value value }) ```` - **Exercise**: walk through how that assignment in the body of the method works - Hey neat, we get `rownames<-` and `colnames<-` for free! A more complicated operation: validity - Constraints on row, column and matrix dimensions: all must be equal - `validity` argument to `setClass`, or `setValidity()` function call. - Validity function is weird - Each class in hierarchy visited, so no need to test for super-class properties - returns TRUE if the object is valid, a text string defining the transgression otherwise. - Evaluated frequently, so needs to be efficient / light-weight ```{r setValidity} setValidity("AnnMat", function(object) { msg <- NULL if (nrow(rowData(object)) != nrow(matData(object))) msg <- c(msg, "number of rowData rows and matData rows differ") if (nrow(colData(object)) != ncol(matData(object))) msg <- c(msg, "number of colData rows and matData columns differ") if (is.null(msg)) TRUE else msg }) ``` In action: ```{r validity-action} .AnnMat(matData=matrix(1:10, 2), rowData=data.frame(roi_id=1:2), colData=data.frame(sample_id=1:5)) cat(try({ .AnnMat(matData=matrix(1:10, 2), rowData=data.frame(roi_id=1:5), colData=data.frame(sample_id=1:2)) })) ``` A more complicated method: subsetting - Why do we need this? Part of the informal matrix 'API' expected by a user - Discovery ```{r subsetting} getGeneric("[") ``` - Possible methods multiply -- `x` times `i` times `j`; e.g., `i` could be integer, logical, character, ... - One approach -- _facade_ of methods that do minimal work to translate into a common base function - Special variable classes: `ANY`, `missing` - Exploit default `initialize` function, which acts as a copy constructor that updates slots in its first argument with values provided by named arguments. - Pass '...' to allow derived classes to use this method ```{r subset-facade} setMethod("[", c("AnnMat", "ANY", "ANY"), function(x, i, j, ..., drop=TRUE) { ## FIXME: warn user about ignoring 'drop'? initialize(x, matData=matData(x)[i, j, drop=FALSE], rowData=rowData(x)[i,,drop=FALSE], colData=colData(x)[,j,drop=FALSE], ...) }) setMethod("[", c("AnnMat", "ANY", "missing"), function(x, i, j, ..., drop=TRUE) { initialize(x, matData=matData(x)[i,,drop=FALSE], rowData=rowData(x)[i,,drop=FALSE]) }) setMethod("[", c("AnnMat", "missing", "ANY"), function(x, i, j, ..., drop=TRUE) { initialize(x, matData=matData(x)[,j,drop=FALSE], colData=colData(x)[j,,drop=FALSE]) }) setMethod("[", c("AnnMat", "missing", "missing"), function(x, i, j, ..., drop=TRUE) { initialize(x, ...) }) ``` - **Exercise**: create some simple unit tests for these methods. - Seems 'good enough' for numeric or logical indexes, what about character? What else have we agreed to in the matrix API?