Contents

1 R as we know it

1.1 Memory management: copy-on-change; garbage collection

Copy on change

  • It seems like R variables do not share any memory

    x <- y <- 1:5
    x
    ## [1] 1 2 3 4 5
    y
    ## [1] 1 2 3 4 5
    x[1] <- 2L
    x
    ## [1] 2 2 3 4 5
    y
    ## [1] 1 2 3 4 5
    x <- 1:5
    fun = function(z) { z[1] = 2L; z }
    fun(x)
    ## [1] 2 2 3 4 5
    x
    ## [1] 1 2 3 4 5
    • Different from many programming languages
    • Very convenient for users – they don’t have to think about ‘change-at-a-distance’.
  • R uses a ‘named’ concept, rather than strict ‘reference counting’

    • Counts the number of references to a location in memory…
    • But can only count to 2
    • So ‘2’ means that at some point in the memory location’s history, 2 or more symbols referenced that location.
  • In action

    x = 1:5
    .Internal(inspect(x))
    ## @480c138 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
    • @41de7d8: memory address.
    • 13 INTSXP: integer ‘S-expression’ (more later).
    • [NAM(1)]: One symbol (x) references the location in memory.
    x = y = 1:5
    .Internal(inspect(x))
    ## @48f2f18 13 INTSXP g0c3 [MARK,NAM(2)] (len=5, tl=0) 1,2,3,4,5
    .Internal(inspect(y))
    ## @48f2f18 13 INTSXP g0c3 [MARK,NAM(2)] (len=5, tl=0) 1,2,3,4,5
    • x, y point to same location in memory
    • [NAM(2)]: (at least) two symbols reference the location
  • ‘Copy-on-change’: when updating a vector, make a copy if NAM() == 2.

    ## copy on change
    x = y = 1:5
    .Internal(inspect(x))
    ## @3f94c98 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
    x[1] = 2L
    .Internal(inspect(x))
    ## @3cf5db0 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 2,2,3,4,5
    .Internal(inspect(y))
    ## @3f94c98 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
    ## no copy necessary
    x = 1:5
    .Internal(inspect(x))
    ## @3bf3838 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
    x[1] = 2L
    .Internal(inspect(x))
    ## @3aef798 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 2,2,3,4,5

Garbage collection (approximate description)

  • S-expressions (SEXP) allocated by the system to a pool of available S-expressions
  • x <- 1 moves the SEXP from allocated to ‘in use’ pool, level 0 (the 0 in g0c4 of .Internal(inspect(x)))
  • After rm(x), a symbol no longer references the SEXP and it is a candidate for re-use.
  • gc() (manual, or more typically triggered automatically when R runs out of SEXP in the available pool)

    • Visits all SEXP in g0 pool and marks as ‘unreferenced’
    • Visits all symbols in use and marks their SEXP as ‘in use’
    • Visits (second time) all SEXP in g0 pool. Moves unreferenced symbols to the ‘available’ pool, moves ‘in use’ to the g1 pool.
  • Note: SEXP belong to ‘generations’ and classes

    y = 1
    fun = function() { x <- 1; 2 }
    fun(); gc()    # scans x, y; collects x, ages y
    ## [1] 2
    ##          used (Mb) gc trigger (Mb) max used (Mb)
    ## Ncells 503648 26.9     940480 50.3   940480 50.3
    ## Vcells 939506  7.2    1941543 14.9  1199408  9.2
    fun(); gc()    # scans & collects x, does not visit y
    ## [1] 2
    ##          used (Mb) gc trigger (Mb) max used (Mb)
    ## Ncells 503747 27.0     940480 50.3   940480 50.3
    ## Vcells 939776  7.2    1941543 14.9  1199408  9.2
    • g0 generation are recently allocated, and checked for suitability for garbage collection most frequently – typically symbols allocated in functions and no longer in scope; ripe for collection.
    • g1 has survived increasingly many garbage collections – package or globals symbols that seldom become eligible for collection.
    • Net effect: g0 nodes enriched for SEXP that can collected.
  • Resources: R Internals manual e.g., the write barrier, Luke Tierney’s notes; memory.c.

Performance consequences

  • Allocation and copying are comparatively expensive
  • Some common use patterns involve a lot of copying!

    • c(), rbind(), cbind() in a loop
    • Modifying a data.frame(), list(), or S4 object.

1.1.1 Exercises

Use .Internal(inspect()) to follow memory allocated to x in the following loop.

x <- integer()
n <- 5
for (i in 1:n)
    x <- c(x, i)

Answer:

n = 5
x = integer(); .Internal(inspect(x))
for (i in 1:n) {
    x <- c(x, i)
    .Internal(inspect(x))
}

x = integer(); .Internal(inspect(x))
for (i in 1:n) {
    x[i] = i
    .Internal(inspect(x))
}
x
  • If an integer occupies about 8 bytes of memory, and each iteration in the loop copies the current allocation into the new allocation, how many bytes are copied in a loop with n iterations?

  • Interpret the output of tracemem() for memory allocations when updating a value in a data.frame(). How much of the data.frame is being copied?

    df <- mtcars
    tracemem(df)        # The overall data.frame
    ## [1] "<0x45779e0>"
    tracemem(df[[1]])   # The specific column
    ## [1] "<0x1766760>"
    df[1, 1]
    ## [1] 21
    df[1, 1] = 22.1
    ## tracemem[0x45779e0 -> 0x44fcd68]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
    ## tracemem[0x44fcd68 -> 0x447ce60]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
    ## tracemem[0x447ce60 -> 0x447d058]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
    ## tracemem[0x1766760 -> 0x264a210]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
    df[2, 2] = 22.2
    ## tracemem[0x447d058 -> 0x4464e78]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
    ## tracemem[0x4464e78 -> 0x4464f20]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
    ## tracemem[0x4464f20 -> 0x4465070]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
  • Compare this to memory allocations when modifying a similarly sized matrix

    m <- m1 <- matrix(0, nrow(mtcars), ncol(mtcars))
    tracemem(m)         # A matrix is a vector with dims attributes
    ## [1] "<0x477afd0>"
    m[1, 1] = 22.1
    ## tracemem[0x477afd0 -> 0x45b3820]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
    m[2, 1] = 22.2
  • And to a ‘vectorized’ update

    df <- mtcars
    tracemem(df); tracemem(df[[1]])
    ## [1] "<0x45779e0>"
    ## [1] "<0x1766760>"
    df[1:2, 1] <- c(22.1, 22.2)
    ## tracemem[0x45779e0 -> 0x3f57e58]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
    ## tracemem[0x3f57e58 -> 0x3f57f00]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
    ## tracemem[0x3f57f00 -> 0x3f58050]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
    ## tracemem[0x1766760 -> 0x3d04bd0]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
    m <- m1 <- matrix(0, nrow(mtcars), ncol(mtcars))
    tracemem(m)         # A matrix is a vector with dims attributes
    ## [1] "<0x4615310>"
    m[1:2, 1] <- c(22.1, 22.2)
    ## tracemem[0x4615310 -> 0x3cd6510]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>

1.2 Correct, robust, simple, fast: practical techniques

Here’s a simple function, implemented poorly

f <- function(n) {
    x <- integer()
    for (i in 1:n)
        x <- c(x, i)
    x
}

Correct

  • Most important!
  • identical() – very strict
  • all.equal() – allows for numerical approximation
  • Unit tests!

    library(testthat)
    test_that("f() is correct", {
        expect_identical(f(5), 1:5)
    })

Robust

  • Edge cases, e.g., length 0 iterations, NA values, … Here we add some additional tests, and put them in a function for easy reuse.

    tests <- function(f) {
        test_that("f() is correct and robust", {
            expect_identical(f(5), 1:5)
            expect_identical(f(1), 1L)
            expect_identical(f(0), integer(0))
        })
    }
  • Our code fails the unit tests, because 1:n fails with n == 0

    tryCatch(tests(f), error = conditionMessage)
    ## [1] "Test failed: 'f() is correct and robust'\n* f(0) not identical to integer(0).\nLengths differ: 2 vs 0\n"
  • Revised code, robust to common use case

    f1 = function(n) {
        x <- integer()
        for (i in seq_len(n))
            x = c(x, i)
        x
    }
    tests(f1)

Fast

  • system.time(), but can be considerable run-to-run variation
  • microbenchmark

    library(microbenchmark)
    res <- microbenchmark(f1(100), f1(200), f1(400), f1(800), f1(1600))
    plot(res)

  • Scales quadratically, because of copying. Solutions?

    f2 <- function(n) {
        sapply(seq_len(n), function(i) i)
    }
    tests(f2)  # fails! n == 0
    f2a <- function(n) {
        if (n == 0) {
            integer()
        } else {
            sapply(seq_len(n), function(i) i)
        }
    }
    tests(f2a)
    f3 <- function(n) {
        vapply(seq_len(n), function(i) i, integer(1))
    }
    test_that("f() is correct", {
        expect_identical(f3(5), 1:5)
        expect_identical(f3(1), 1L)
        expect_identical(f3(0), integer(0))
    })
    tests(f3)
    f4 <- function(n) {
        x <- integer(n)
        for (i in seq_len(n))
            x[[i]] = i
        x
    }
    tests(f4)
    f5 <- seq_len
    tests(f5)
    n <- 1000
    microbenchmark(f1(n), f2a(n), f3(n), f4(n), f5(n))

1.3 Symbol resolution and evaluation

Environments

  • Current environment

    environment()
    ## <environment: R_GlobalEnv>
  • surprisingly central to R

    env <- new.env()
    env[["foo"]] <- 123
    env[["foo"]]
    ## [1] 123
    get("foo", env)      # equivalent (not really!)
    ## [1] 123
    ls(env)
    ## [1] "foo"
  • hash table of key-value pairs
  • Key is called a name or symbol in R
  • constant-time lookup

  • ‘reference’ semantics (almost all else in R is ‘copy-on-change’ semantics)

    another <- env
    env[["bar"]] <- 456
    another[["bar"]]
    ## [1] 456
  • All environments have a ‘parent’ environment

    parent.env(env)
    ## <environment: R_GlobalEnv>
    • special environment: emptyenv()
    • parent environment is set during construction (argument to new.env())
  • Environment can be chained together as a linked list

    top <- new.env(parent = emptyenv())
    mid <- new.env(parent = top)
    bot <- new.env(parent = mid)
  • Symbol (key) look-up

    top[["top"]] <- 1
    mid[["mid"]] <- 2
    bot[["bot"]] <- 3
  • Symbols not found in the reference enviroment are looked for in the parent environment

    get("bot", bot)
    ## [1] 3
    get("mid", bot)
    ## [1] 2
    tryCatch({
        get("mid", bot, inherits=FALSE) # restricts search to current env
    }, error = conditionMessage)
    ## [1] "object 'mid' not found"
    bot[["mid"]]                        # NULL; `[[` restricts to current env
    ## NULL
  • Assignment is to the current environment

    bot[["mid"]] <- 5
    bot[["mid"]]
    ## [1] 5
    mid[["mid"]]
    ## [1] 2
  • Functions are intimately connected to environments

    fun <- function() {
        environment()
    }
    fun()              # an environment defines the body of a function
    ## <environment: 0x593b1a8>
    fun()              # new function, so new environment
    ## <environment: 0x59a57f0>
  • What happens to the environment in a function?

    • garbage collection (later!) when no longer referenced
  • If a function body has an environment, there must be a parent environment

    fun <- function() {
        env <- environment()
        list(env, parent.env(env))
    }
    fun()
    ## [[1]]
    ## <environment: 0x4bb9690>
    ## 
    ## [[2]]
    ## <environment: R_GlobalEnv>
    • Is the parent of the environment of a function the environment in which the function was called from, or the environment in which the function was defined?
    fun <- function() {
        fun1 <- function() {
            env1 <- environment()    # envrionment created by fun1
            par1 <- parent.env(env1) # parent environment
            list(env1 = env1, par1 = par1)
        }
        env <- environment()        # environment created by fun
        par <- parent.env(env)
        list(env = env, par = par, fun1=fun1)
    }
    fun()
    ## $env
    ## <environment: 0x3d44050>
    ## 
    ## $par
    ## <environment: R_GlobalEnv>
    ## 
    ## $fun1
    ## function () 
    ## {
    ##     env1 <- environment()
    ##     par1 <- parent.env(env1)
    ##     list(env1 = env1, par1 = par1)
    ## }
    ## <environment: 0x3d44050>
    • The parent of a function’s environment is the environment in which it was defined (different from the environment from which it was called)

    • Lookup, assignment (<-) and assignment to the parent(s) environment (<<-)

    fun = function() {
        x0 <- 0
        x1 <- 0
        fun1 <- function() {
            y <- 1     # create local variable y
            x1 <<- 2   # assign to first occurrence of 'x' in parent env(s)
            z <<- 3    # no z in parent env(s), so created in top env
            c(x0 = x0, x1 = x1, y = y, z = z)
        }
    
        result <- fun1()
        list(x0 = x0, x1 = x1, result = result)
    }
  • This is called lexical scope

  • When is lexical scope used?

Package namespaces, imports, and search()

  • The `search() path

    search()
    ##  [1] ".GlobalEnv"             "package:microbenchmark"
    ##  [3] "package:testthat"       "package:BiocStyle"     
    ##  [5] "package:stats"          "package:graphics"      
    ##  [7] "package:grDevices"      "package:utils"         
    ##  [9] "package:datasets"       "package:methods"       
    ## [11] "Autoloads"              "package:base"
  • library(BiocGenerics): loading and attaching a package

    1. Package NAMESPACE is loaded into the current R session

    2. An environment containing exported symbols from the package is attached to the R search path

  • ‘Attaching’ a package means that the .GlobalEnv’s parent points to the new package exported symbol environment. The new package exported symbol environment points to the package previously pointed to by .GlobalEnv

  • A package consists of a NAMESPACE

    • A NAMESPACE is an environment.
    • The parent of the NAMESPACE is an environment containing imported symbols.
    • The parent of the environment with imported symbols is the base environment.
    • The parent of the base environment is the .GlobalEnv.
    • …and so on down the search path.
  • Symbol resolution in packages follows lexical scope

    • Resolved in the function, then package NAMESPACE, then imports environment, then base then .GlobalEnv, then …

Evaluation, lazy and non-standard evaluation

1.3.1 Exercises

Exercise: bank account

  • Congratulations, you own a bank! Here’s a function that creates an account, and provides access to it

    account <- function() {
       balance <- 0
    
       available <- function() {}
    
       deposit <- function(amt) {}
    
       withdraw <- function(amt) {}
    
       list(available = available, deposit = deposit, withdraw = withdraw)
    }

Answer:

account <- function() {
   balance <- 0

   available <- function() {
       balance
   }

   deposit <- function(amt) {
       balance <<- balance + amt
       balance
   }

   withdraw <- function(amt) {
       if (amt > balance)
           stop("insufficient funds")
       deposit(-amt)
   }

   list(available = available, deposit = deposit, withdraw = withdraw)
}
  • Implement for the following test case, using the concept of lexical scope:

    library(testthat)
    test_that("I understand lexical scope", {
        acct1 <- account()
        expect_equal(acct1$deposit(10), 10)
        expect_equal(acct1$deposit(12), 22)
        expect_equal(acct1$available(), 22)
        expect_equal(acct1$withdraw(20), 2)
        expect_error(acct1$withdraw(20), "insufficient funds")
    })
  • What happens when you get a second customer?

    test_that("functions create local scope", {
        acct1 <- account(); acct1$deposit(10)
        acct2 <- account(); acct2$deposit(20)
        expect_equal(acct1$available(), 10)
        expect_equal(acct2$available(), 20)
    })

Answer:

account <- function() {
   balance <- 0

   available <- function() {
       balance
   }

   deposit <- function(amt) {
       balance <<- balance + amt
       funs
   }

   withdraw <- function(amt) {
       if (amt > balance)
           stop("insufficient funds")
       deposit(-amt)
   }

   funs = list(
       available = available, deposit = deposit, withdraw = withdraw
   )

   funs
}

Exercise: bank account 2

  • What changes to implementation would be require to allow

    test_that("endomorphisms rock", {
        expect_equal(account()$deposit(10)$withdraw(5)$available(), 5)
    })

Exercise: package-local options

  • Implement ‘package-local’ options that the user can set via functions. Do this by using local() to create and populate an environment that contains the package-local variables as well as accessor functions get() and set() to retrieve / assign variables, e.g., tolerance. Here’s a start, and some unit tests

    .myoptions <- local({
        tolerance = 1e-4
    
        get = function() {
            tolerance
        }
    
        set = function(value) {
        }
    
        ## ...
    })
    
    ## setTolerance = ...

Answer:

.myoptions <- local({
    tolerance = 1e-4
    get = function() tolerance
    list(
        get = get,
        set = function(value) {
            ovalue <- get()
            tolerance <<- value
            ovalue
        }
    )
})

getTolerance <- .myoptions$get
setTolerance <- .myoptions$set
  • Here are some unit tests

    test_that("local options can be set", {
        expect_equal(getTolerance(), 1e-4)
        ## setTolerance() returns previous value, to ease resetting
        expect_equal(otol <- setTolerance(1e-5), 1e-4)
        expect_equal(setTolerance(otol), 1e-5)
    })

1.4 Debugging

f <- function(x) {
    ## ...
    g()
}

g <- function(x) {
    ## ...
    if (log(x) < 0) {
        TRUE
    } else {
        FALSE
    }
}
  • Invoke traceback() after error.

    f(-1)
    traceback()
  • debugonce(g): enter the ‘browser’ when the function is invoked. See ?browser for what can be done. Variant: debug(g)undebug(g).

    debugonce(g)
    f(-1)
    ## Browser[2]>
    ## ...
  • browser(). Edit the source code (not very practical for complex or package code)

    g <- function(x) {
        ## ...
        browser()
        if (log(x) < 0) {
            TRUE
        } else {
            FALSE
        }
    }
  • options(error=recover). When done, options(error=NULL). Actually, any function accepting 0 arguments is possible for recover, e.g., options(error=traceback).

    f(-1)
    ## Error: all(x >= 0) is not TRUE
    ##
    ## Enter a frame number, or 0 to exit
    ##
    ## 1: f(-1)
    ## 2: #2: stopifnot(all(x >= 0))
    ##
    ## Selection: 2
    ## Called from: f(-1)
    ## Browse[1]> c
    ##
    ## Enter a frame number, or 0 to exit
    ##
    ## 1: f(-1)
    ## 2: #2: stopifnot(all(x >= 0))
    ##
    ## Selection:
  • trace() – especially useful for tracing execution (e.g., tracer = quote(print(argname))), and for S4 methods (argument signature=)

    ## like debug
    trace(g, tracer=browser)
    f(-1)
    trace(g, quote({ print(x); print(log(x)) }))
    for (i in runif(1000, -1, 100)) f(i)
    
    trace(g, quote(if (x < 0) browser()))
    for (i in runif(1000, -1, 100)) f(i)
    library(GenomicRanges)
    showMethods("findOverlaps")
    selectMethod("findOverlaps", c("GRanges", "GRanges"))
    trace(findOverlaps, browser, signature=c("GenomicRanges", "GenomicRanges"))

Found a bug, now what?

  • Recover more gracefully. (Somehow not a good solution – would rather avoid the problem than treat the symptom).

    g <- function(x) {
        ## ...
        tryCatch({
            if (log(x) < 0)
                TRUE
            else
                FALSE
        }, error = function(e) {
           NA
        })
  • Program more cleverly / robustly – in our trivial example, the if() is irrelevant, and introduces a cryptic error message!

    g <- function(x) {
        ## ...
        log(x) < 0  # NA, instead of error from if (NA) (!)
  • Best: assert pre-conditions (and program more cleverly).

    f <- function(x) {
        stopifnot(is.numeric(x), length(x) == 1, !is.na(x), x >= 0)
        ## ...
        g(x)
    }

1.5 Packages

Structure: A simple directory with specific files and structure

  • DESCRIPTION and NAMESPACE files (required)
    • DESCRIPTION: key-value pairs describing the package, e.g., ‘Title:’, ‘Version:’, ‘License:’, etc.
  • R/ directory
    • Files containing functions
    • Following a logical organization with consistent formatting
  • man/ directory
    • Files in ‘.Rd’ format
    • Structured in parallel to R/ directory, R/foo.R –> R/foo.Rd
  • vignettes/ directory
    • One or more vignettes, R markdown (.Rmd) or LaTeX / Sweave (.Rnw)
    • Markdown much prefered for user-oriented documentation.
    • My preference: no ‘derived’ files, e.g., HTML produced from Rmd.
  • inst/
    • data/: R objects, e.g., ‘.rds’ file.
    • extdata/: Non-R data resources, e.g., SQL data base.
    • …/: any other files or folders that need to be _inst_alled into the R library.
  • src/: C, C++, Fortran, or other ‘foreign’ code.

Creation

  • devtools::create("MyPackage")

Development

  • devtools::load_all() to reduce the edit-install-run loop.
  • devtools::use_testthat() and devtools::test() for testthat unit tests.
  • devtools::document() (roxygen2) for documentation.
  • devtools::test() (run unit tests); devtools::check()
  • devtools::install()

N.B.

  • ALL of the steps above (package creation, installation, writing .Rd documentation files, build, check) can be done ‘easily’ from scratch.
  • Final arbiter:

    R CMD build MyPackage   # source for version 0.99.1 of package
    R CMD check MyPackage_0.99.1.tar.gz

Mangement

2 Finding the inner R: the C API and implementation

2.1 Behind the curtain

S-expressions

  • R ‘atomic’ vectors (numeric(), logical(), etc), lists, environments, functions, etc., are represented by C structs
  • The C struct is called an ‘S-expression’, represented by the symbol SEXP (which I pronounce ‘S-exp’)
  • SEXP is defined in Rinternals.h
  • An SEXP is polymorhpic – the interpretation of the struct depends on type information contained in the SEXP itself. Types are enumerated as INTSXP, REALSXP, …
    • STRSXP: character(), each element of a character vector is a CHARSXP.
    • VECSXP: list(); LISTSXP is a ‘pair-list’ primarily used to represent function arguments.
    • EXTPTRSXP: an R reference to arbitrary data that is not represented in R, e.g., an C++ class.
  • All SEXP contain sexpinfo, with named, garbage collect, marked, and other status, as well as actual data (e.g., the integers of an integer() vector.
  • Macros (actually, functions in package code) control access to the SEXP data structure. Several styles
    • e.g., LENGTH(), Rf_length()
    • Generally, prefer the Rf_* style.

From R to C

  • Several routes – .Internal(), .External(), .C(), .Call(), …
  • .C(): R does some work to coerce R types, e.g., integer() to C types, int *; useful for many light-weight purposes.
    • character() and more complicated objects difficult to deal with.
    • Return value must be allocated and passed from R.
  • .Call(): R arguments are represented as SEXP
    • Flexible; main focus of serious development.
    • Rcpp uses this convention, but wraps the details of R’s SEXP in C++ idioms.

The public API

2.2 Old-school approaches to foreign langauges

When to worry?

  • Access third-party libraries
  • Novel algorithms
  • Simpler reasoning
  • Copy-free operations

.C()

.Call()

2.2.1 Example: uuid

  • uuid: globally unique id.
  • Provided by Boost uuid ‘header only’ C++ library.
  1. Create a package and ‘src’ directory. Add the LinkingTo: tag to indicate that the package is going to link to (C++) header files provided by the BH (’Boost Headers’) package.

    devtools::create("/tmp/uuid", list(LinkingTo = "BH"))
    ## Creating package 'uuid' in '/tmp'
    ## No DESCRIPTION found. Creating with values:
    ## Package: uuid
    ## Title: What the Package Does (one line, title case)
    ## Version: 0.0.0.9000
    ## Authors@R: person("First", "Last", email = "first.last@example.com", role = c("aut", "cre"))
    ## Description: What the package does (one paragraph).
    ## Depends: R (>= 3.4.0)
    ## License: What license is it under?
    ## Encoding: UTF-8
    ## LazyData: true
    ## LinkingTo: BH
    ## * Creating `uuid.Rproj` from template.
    ## * Adding `.Rproj.user`, `.Rhistory`, `.RData` to ./.gitignore
    dir.create("/tmp/uuid/src")
  2. Write unit tests

    devtools::use_testthat("/tmp/uuid")
    ## * Adding testthat to Suggests
    ## * Creating `tests/testthat`.
    ## * Creating `tests/testthat.R` from template.
    cat('
    context("uuid")
    
    test_that("uuid returns different values", {
        expect_true(is.character(uuid()))
        expect_true(length(uuid()) == 1L)
        uu <- unique(unlist(replicate(100, uuid())))
        expect_identical(length(uu), 100L)
    })',
        file = "/tmp/uuid/tests/testthat/test_uuid.R"
    )
  3. Implement the ‘C++’ layer

    cat('
    #include <boost/uuid/uuid_generators.hpp>
    #include <boost/uuid/uuid_io.hpp>
    
    static boost::uuids::random_generator uuid_generator =
        boost::uuids::random_generator();
    
    std::string uuid_generate()
    {
        return boost::uuids::to_string(uuid_generator());
    }
    ',
        file = "/tmp/uuid/src/uuid.cpp"
    )
  4. Implement the interface from C++ to R’s C layer.

    cat('
    #include <Rinternals.h>
    
    SEXP uuid()
    {
        return mkString(uuid_generate().c_str());
    }
    ',
        file = "/tmp/uuid/src/uuid.cpp",
        append = TRUE
    )
  5. Implement the interface from C to R.

    cat('
    #include <R_ext/Rdynload.h>
    
    extern "C" {
    
        static const R_CallMethodDef callMethods[] = {
            {".uuid", (DL_FUNC) &uuid, 0},
            {NULL, NULL, 0}
        };
    
        void R_init_uuid(DllInfo *info)
        {
            R_registerRoutines(info, NULL, callMethods, NULL, NULL);
        }
    }',
        file = "/tmp/uuid/src/uuid.cpp",
        append = TRUE
    )
  6. Implement the end-user API in R.

    cat("
    #' @useDynLib uuid, .registration = TRUE
    #' @export
    uuid <- function()
         .Call(.uuid)
    ",
        file = "/tmp/uuid/R/uuid.R"
    )
  7. Document, test, install, use!

    devtools::document("/tmp/uuid")
    ## Updating uuid documentation
    ## Loading uuid
    ## Re-compiling uuid
    ## '/home/mtmorgan/bin/R-3-4-branch/bin/R' --no-site-file --no-environ  \
    ##   --no-save --no-restore --quiet CMD INSTALL '/tmp/uuid'  \
    ##   --library='/tmp/RtmpPZvmxO/devtools_install_20c366d77c59' --no-R  \
    ##   --no-data --no-help --no-demo --no-inst --no-docs --no-exec  \
    ##   --no-multiarch --no-test-load
    ## 
    ## Updating roxygen version in /tmp/uuid/DESCRIPTION
    ## Writing NAMESPACE
    devtools::test("/tmp/uuid")
    ## Loading uuid
    ## Testing uuid
    ## uuid: ...
    ## 
    ## DONE ========================================================================
    devtools::install("/tmp/uuid")
    ## Installing uuid
    ## '/home/mtmorgan/bin/R-3-4-branch/bin/R' --no-site-file --no-environ  \
    ##   --no-save --no-restore --quiet CMD INSTALL '/tmp/uuid'  \
    ##   --library='/home/mtmorgan/R/x86_64-pc-linux-gnu-library/3.4-Bioc-3.6'  \
    ##   --install-tests
    ## 
    ## Reloading installed uuid

2.2.2 Exercises

Add the following unit test and implement appropriate functionality

test_that("uuid returns multiple values", {
    n <- 5
    result <- uuid(5)
    expect_true(is.character(result))
    expect_identical(length(result), 5L)
    expect_identical(length(unique(result)), 5L)
})

Do this in several stages, from ‘easy’ to harder

  1. Implement the R infrastructure by adding an argument n to the definition of the R-level uuid(). Provide n with a default value so that the original unit tests continue to work, i.e., that devtools::test("/tmp/uuid") still passes the original tests.

  2. Pass the argument n to C. Do this by adding n to the R-level .Call(). Then modify the C-level uuid() to accept an SEXP as argument, so the signature is SEXP uuid(SEXP n_sexp). Update the callMethods[] array so that the uuid function takes 1 argument. Test that the source code compiles using, e.g., devtools::install("/tmp/uuid") or by running the unit tests again.

  3. In C, modify uuid() to check that the argument is in fact a scalar integer. Do this by using the macro IS_SCALAR() and function Rf_asInteger() and the constant R_NaInt; see the header file ‘Rinternals.h’ in R.home("include"). Signal an error with Rf_error(). You’ll add code that looks like

    bool test = IS_SCALAR(n_sexp, INTSXP) && (R_NaInt != Rf_asInteger(n_sexp));
    if (!test)
        Rf_error("'n' cannot be coerced to integer(1) and not NA");
    int n = Rf_asInteger(n_sexp);
  4. Modify uuid() to allocate an SEXP to contain the result – a STRSXP of length n. This allocated SEXP needs to be protected from garbage collection, because it is an R memory allocation not associated with a symbol. Protection is provided by the PROTECT() macro, which is traditional wrapped around functions that return SEXP.

    SEXP result = PROTECT(Rf_allocVector(n, STRSXP));
  5. Fill the return vector using a C loop, constructing a CHARSXP for each element using mkChar() and setting the ith element of result using the function SET_STRING_ELT(). Although at the R level the first element of a vector is element number 1, in C the first element is 0.

    for (int i = 0; i < n; ++i)
        SET_STRING_ELT(result, i, mkChar(uuid_generate().c_str()));
  6. Our function is done; use UNPROTECT() to remove the protection on the allocated SEXP (the calling function, in this case R itself, is resposible for PROTECTing any SEXP it recieves) and return the result to R.

    UNPROTECT(1);
    return result;
  7. (Optional) As a nicety for our users, coerce their R-level argument n to an integer (so, e.g., they can provide 5 rather than 5L as the argument). This can be done at the R level using as.integer(), or at the C level with the following. Consider the merits of each approach.

    n_sexp = PROTECT(Rf_coerceVector(n_sexp, INTSXP));

2.3 Rcpp

  1. Create a new package using Rcpp::Rcpp.package.skeleton(). Edit the DESCRIPTION file to include BH in the LinkingTo: field.

  2. Add a file src/rcpp_uuid.cpp with the boost-related code

    #include <boost/uuid/uuid_generators.hpp>
    #include <boost/uuid/uuid_io.hpp>
    
    static boost::uuids::random_generator uuid_generator =
        boost::uuids::random_generator();
    
    std::string uuid_generate()
    {
        return boost::uuids::to_string(uuid_generator());
    }
  3. Include the Rcpp.h header and use the Rcpp namespace

    #include <Rcpp.h>
    using namespace Rcpp;
  4. Implement the function using Rcpp structures like CharacterVector; no need to worry about coercing the argument, PROTECTion, etc. Decorate the function with the // [[Rcpp::export]] attribute.

    // [[Rcpp::export]]
    CharacterVector rcpp_uuid(int n) {
        CharacterVector uuids(n);
        for (int i = 0; i < uuids.size(); ++i)
            uuids[i] = uuid_generate();
        return uuids;
    }
  5. In R, make Rcpp process the attributes; check out src/RcppExports.cpp and R/RcppExports.R

> Rcpp::compileAttributes()

  1. Install and use!

    Rcppuuid$ R CMD INSTALL .
    [...]
    Rcppuuid$ R --vanilla -e "Rcppuuid::rcpp_uuid(3)"
    > Rcppuuid::rcpp_uuid(3)
    [1] "cb9cc7e8-bfcb-4262-b05b-2f1c0a4bc196"
    [2] "7b592e1f-8341-4da0-af04-b387726af8ea"
    [3] "7b2ade09-6ff2-4e76-b564-31b1a626e864"

2.4 Fun (?!) with C-level debugging

Setup

  • gdb on linux, lldb on Mac; [gdb-to-lldb][gdb-lldb] command map.
  • Code compiled with debugging symbols and without optimizations
  • One approach:

    $ cat ~/.R/Makevars
    CFLAGS = -g -O0
    CXXFLAGS = -g -O0
    CXX11FLAGS = -g -O0
  • Alternative: configure R with appropriate CFLAGS, etc.

Overall

  • Get and install a copy of our uuid package from GitHub

    $ git clone https://github.com/Bioconductor/BiocAdvanced
    $ cd BiocAdvanced
    $ git checkout Zurich-2017
    $ cd inst/uuid
    $ R CMD INSTALL .
  • Start R and launch debugger

    $ R -d gdb
    (gdb)
  • Start the R process, load packages, etc., break into the debugger

    (gdb) run
    > library(uuid)
    > uuid()
    Error in uuid() : invalid type/length (symbol/16) in vector allocation
  • Break into the debugger, set a breakpoint at the point where R throws an error, and continue

    > ^C
    (gdb) break Rf_error
    (gdb) continue
    >
  • Trigger the error

    > uuid()
    Breakpoint 1, Rf_error (
        format=0x643020 "invalid type/length (%s/%d) in vector allocation")
        at /home/mtmorgan/src/R-3-4-branch/src/main/errors.c:821
    821 {
    (gdb)
  • Discover where you are in the call stack. Mostly we’re in R’s source code where bugs are unlikely (though not impossible!) three up is our own code

    (gdb) where
    #0  Rf_error (
        format=0x643020 "invalid type/length (%s/%d) in vector allocation")
        at /home/mtmorgan/src/R-3-4-branch/src/main/errors.c:821
    #1  0x0000000000525dbc in Rf_allocVector3 (type=1, length=16, allocator=0x0)
        at /home/mtmorgan/src/R-3-4-branch/src/main/memory.c:2612
    #2  0x000000000050e5c0 in Rf_allocVector (type=1, length=16)
        at /home/mtmorgan/src/R-3-4-branch/src/include/Rinlinedfuns.h:196
    #3  0x00007fffed786780 in uuid (n_sexp=0x21a16a8) at uuid.cpp:26
    #4  0x000000000048b7e3 in R_doDotCall (ofun=0x7fffed7866ac <uuid(SEXPREC*)>,
        nargs=1, cargs=0x7fffffffbdd0, call=0x19f7d48)
        at /home/mtmorgan/src/R-3-4-branch/src/main/dotcode.c:570
  • Navigate and look around

3 Acknowledgements

Research reported in this course was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)