1R as we know it

1.1 Memory management: copy-on-change; garbage collection

Copy on change

• It seems like R variables do not share any memory

x <- y <- 1:5
x
## [1] 1 2 3 4 5
y
## [1] 1 2 3 4 5
x[1] <- 2L
x
## [1] 2 2 3 4 5
y
## [1] 1 2 3 4 5
x <- 1:5
fun = function(z) { z[1] = 2L; z }
fun(x)
## [1] 2 2 3 4 5
x
## [1] 1 2 3 4 5
• Different from many programming languages
• Very convenient for users – they don’t have to think about ‘change-at-a-distance’.
• R uses a ‘named’ concept, rather than strict ‘reference counting’

• Counts the number of references to a location in memory…
• But can only count to 2
• So ‘2’ means that at some point in the memory location’s history, 2 or more symbols referenced that location.
• In action

x = 1:5
.Internal(inspect(x))
## @480c138 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
• @41de7d8: memory address.
• 13 INTSXP: integer ‘S-expression’ (more later).
• [NAM(1)]: One symbol (x) references the location in memory.
x = y = 1:5
.Internal(inspect(x))
## @48f2f18 13 INTSXP g0c3 [MARK,NAM(2)] (len=5, tl=0) 1,2,3,4,5
.Internal(inspect(y))
## @48f2f18 13 INTSXP g0c3 [MARK,NAM(2)] (len=5, tl=0) 1,2,3,4,5
• x, y point to same location in memory
• [NAM(2)]: (at least) two symbols reference the location
• ‘Copy-on-change’: when updating a vector, make a copy if NAM() == 2.

## copy on change
x = y = 1:5
.Internal(inspect(x))
## @3f94c98 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
x[1] = 2L
.Internal(inspect(x))
## @3cf5db0 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 2,2,3,4,5
.Internal(inspect(y))
## @3f94c98 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
## no copy necessary
x = 1:5
.Internal(inspect(x))
## @3bf3838 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
x[1] = 2L
.Internal(inspect(x))
## @3aef798 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 2,2,3,4,5

Garbage collection (approximate description)

• S-expressions (SEXP) allocated by the system to a pool of available S-expressions
• x <- 1 moves the SEXP from allocated to ‘in use’ pool, level 0 (the 0 in g0c4 of .Internal(inspect(x)))
• After rm(x), a symbol no longer references the SEXP and it is a candidate for re-use.
• gc() (manual, or more typically triggered automatically when R runs out of SEXP in the available pool)

• Visits all SEXP in g0 pool and marks as ‘unreferenced’
• Visits all symbols in use and marks their SEXP as ‘in use’
• Visits (second time) all SEXP in g0 pool. Moves unreferenced symbols to the ‘available’ pool, moves ‘in use’ to the g1 pool.
• Note: SEXP belong to ‘generations’ and classes

y = 1
fun = function() { x <- 1; 2 }
fun(); gc()    # scans x, y; collects x, ages y
## [1] 2
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 503648 26.9     940480 50.3   940480 50.3
## Vcells 939506  7.2    1941543 14.9  1199408  9.2
fun(); gc()    # scans & collects x, does not visit y
## [1] 2
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 503747 27.0     940480 50.3   940480 50.3
## Vcells 939776  7.2    1941543 14.9  1199408  9.2
• g0 generation are recently allocated, and checked for suitability for garbage collection most frequently – typically symbols allocated in functions and no longer in scope; ripe for collection.
• g1 has survived increasingly many garbage collections – package or globals symbols that seldom become eligible for collection.
• Net effect: g0 nodes enriched for SEXP that can collected.
• Resources: R Internals manual e.g., the write barrier, Luke Tierney’s notes; memory.c.

Performance consequences

• Allocation and copying are comparatively expensive
• Some common use patterns involve a lot of copying!

• c(), rbind(), cbind() in a loop
• Modifying a data.frame(), list(), or S4 object.

1.1.1 Exercises

Use .Internal(inspect()) to follow memory allocated to x in the following loop.

x <- integer()
n <- 5
for (i in 1:n)
x <- c(x, i)

n = 5
x = integer(); .Internal(inspect(x))
for (i in 1:n) {
x <- c(x, i)
.Internal(inspect(x))
}

x = integer(); .Internal(inspect(x))
for (i in 1:n) {
x[i] = i
.Internal(inspect(x))
}
x
• If an integer occupies about 8 bytes of memory, and each iteration in the loop copies the current allocation into the new allocation, how many bytes are copied in a loop with n iterations?

• Interpret the output of tracemem() for memory allocations when updating a value in a data.frame(). How much of the data.frame is being copied?

df <- mtcars
tracemem(df)        # The overall data.frame
## [1] "<0x45779e0>"
tracemem(df[[1]])   # The specific column
## [1] "<0x1766760>"
df[1, 1]
## [1] 21
df[1, 1] = 22.1
## tracemem[0x45779e0 -> 0x44fcd68]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
## tracemem[0x44fcd68 -> 0x447ce60]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
## tracemem[0x447ce60 -> 0x447d058]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
## tracemem[0x1766760 -> 0x264a210]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
df[2, 2] = 22.2
## tracemem[0x447d058 -> 0x4464e78]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
## tracemem[0x4464e78 -> 0x4464f20]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
## tracemem[0x4464f20 -> 0x4465070]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
• Compare this to memory allocations when modifying a similarly sized matrix

m <- m1 <- matrix(0, nrow(mtcars), ncol(mtcars))
tracemem(m)         # A matrix is a vector with dims attributes
## [1] "<0x477afd0>"
m[1, 1] = 22.1
## tracemem[0x477afd0 -> 0x45b3820]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
m[2, 1] = 22.2
• And to a ‘vectorized’ update

df <- mtcars
tracemem(df); tracemem(df[[1]])
## [1] "<0x45779e0>"
## [1] "<0x1766760>"
df[1:2, 1] <- c(22.1, 22.2)
## tracemem[0x45779e0 -> 0x3f57e58]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
## tracemem[0x3f57e58 -> 0x3f57f00]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
## tracemem[0x3f57f00 -> 0x3f58050]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
## tracemem[0x1766760 -> 0x3d04bd0]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>
m <- m1 <- matrix(0, nrow(mtcars), ncol(mtcars))
tracemem(m)         # A matrix is a vector with dims attributes
## [1] "<0x4615310>"
m[1:2, 1] <- c(22.1, 22.2)
## tracemem[0x4615310 -> 0x3cd6510]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>

1.2 Correct, robust, simple, fast: practical techniques

Here’s a simple function, implemented poorly

f <- function(n) {
x <- integer()
for (i in 1:n)
x <- c(x, i)
x
}

Correct

• Most important!
• identical() – very strict
• all.equal() – allows for numerical approximation
• Unit tests!

library(testthat)
test_that("f() is correct", {
expect_identical(f(5), 1:5)
})

Robust

• Edge cases, e.g., length 0 iterations, NA values, … Here we add some additional tests, and put them in a function for easy reuse.

tests <- function(f) {
test_that("f() is correct and robust", {
expect_identical(f(5), 1:5)
expect_identical(f(1), 1L)
expect_identical(f(0), integer(0))
})
}
• Our code fails the unit tests, because 1:n fails with n == 0

tryCatch(tests(f), error = conditionMessage)
## [1] "Test failed: 'f() is correct and robust'\n* f(0) not identical to integer(0).\nLengths differ: 2 vs 0\n"
• Revised code, robust to common use case

f1 = function(n) {
x <- integer()
for (i in seq_len(n))
x = c(x, i)
x
}
tests(f1)

Fast

• system.time(), but can be considerable run-to-run variation
• microbenchmark

library(microbenchmark)
res <- microbenchmark(f1(100), f1(200), f1(400), f1(800), f1(1600))
plot(res)

• Scales quadratically, because of copying. Solutions?

f2 <- function(n) {
sapply(seq_len(n), function(i) i)
}
tests(f2)  # fails! n == 0
f2a <- function(n) {
if (n == 0) {
integer()
} else {
sapply(seq_len(n), function(i) i)
}
}
tests(f2a)
f3 <- function(n) {
vapply(seq_len(n), function(i) i, integer(1))
}
test_that("f() is correct", {
expect_identical(f3(5), 1:5)
expect_identical(f3(1), 1L)
expect_identical(f3(0), integer(0))
})
tests(f3)
f4 <- function(n) {
x <- integer(n)
for (i in seq_len(n))
x[[i]] = i
x
}
tests(f4)
f5 <- seq_len
tests(f5)
n <- 1000
microbenchmark(f1(n), f2a(n), f3(n), f4(n), f5(n))

1.3 Symbol resolution and evaluation

Environments

• Current environment

environment()
## <environment: R_GlobalEnv>
• surprisingly central to R

env <- new.env()
env[["foo"]] <- 123
env[["foo"]]
## [1] 123
get("foo", env)      # equivalent (not really!)
## [1] 123
ls(env)
## [1] "foo"
• hash table of key-value pairs
• Key is called a name or symbol in R
• constant-time lookup

• ‘reference’ semantics (almost all else in R is ‘copy-on-change’ semantics)

another <- env
env[["bar"]] <- 456
another[["bar"]]
## [1] 456
• All environments have a ‘parent’ environment

parent.env(env)
## <environment: R_GlobalEnv>
• special environment: emptyenv()
• parent environment is set during construction (argument to new.env())
• Environment can be chained together as a linked list

top <- new.env(parent = emptyenv())
mid <- new.env(parent = top)
bot <- new.env(parent = mid)
• Symbol (key) look-up

top[["top"]] <- 1
mid[["mid"]] <- 2
bot[["bot"]] <- 3
• Symbols not found in the reference enviroment are looked for in the parent environment

get("bot", bot)
## [1] 3
get("mid", bot)
## [1] 2
tryCatch({
get("mid", bot, inherits=FALSE) # restricts search to current env
}, error = conditionMessage)
## [1] "object 'mid' not found"
bot[["mid"]]                        # NULL; [[ restricts to current env
## NULL
• Assignment is to the current environment

bot[["mid"]] <- 5
bot[["mid"]]
## [1] 5
mid[["mid"]]
## [1] 2
• Functions are intimately connected to environments

fun <- function() {
environment()
}
fun()              # an environment defines the body of a function
## <environment: 0x593b1a8>
fun()              # new function, so new environment
## <environment: 0x59a57f0>
• What happens to the environment in a function?

• garbage collection (later!) when no longer referenced
• If a function body has an environment, there must be a parent environment

fun <- function() {
env <- environment()
list(env, parent.env(env))
}
fun()
## [[1]]
## <environment: 0x4bb9690>
##
## [[2]]
## <environment: R_GlobalEnv>
• Is the parent of the environment of a function the environment in which the function was called from, or the environment in which the function was defined?
fun <- function() {
fun1 <- function() {
env1 <- environment()    # envrionment created by fun1
par1 <- parent.env(env1) # parent environment
list(env1 = env1, par1 = par1)
}
env <- environment()        # environment created by fun
par <- parent.env(env)
list(env = env, par = par, fun1=fun1)
}
fun()
## $env ## <environment: 0x3d44050> ## ##$par
## <environment: R_GlobalEnv>
##
## $fun1 ## function () ## { ## env1 <- environment() ## par1 <- parent.env(env1) ## list(env1 = env1, par1 = par1) ## } ## <environment: 0x3d44050> • The parent of a function’s environment is the environment in which it was defined (different from the environment from which it was called) • Lookup, assignment (<-) and assignment to the parent(s) environment (<<-) fun = function() { x0 <- 0 x1 <- 0 fun1 <- function() { y <- 1 # create local variable y x1 <<- 2 # assign to first occurrence of 'x' in parent env(s) z <<- 3 # no z in parent env(s), so created in top env c(x0 = x0, x1 = x1, y = y, z = z) } result <- fun1() list(x0 = x0, x1 = x1, result = result) } • This is called lexical scope • When is lexical scope used? Package namespaces, imports, and search() • The search() path search() ## [1] ".GlobalEnv" "package:microbenchmark" ## [3] "package:testthat" "package:BiocStyle" ## [5] "package:stats" "package:graphics" ## [7] "package:grDevices" "package:utils" ## [9] "package:datasets" "package:methods" ## [11] "Autoloads" "package:base" • library(BiocGenerics): loading and attaching a package 1. Package NAMESPACE is loaded into the current R session 2. An environment containing exported symbols from the package is attached to the R search path • ‘Attaching’ a package means that the .GlobalEnv’s parent points to the new package exported symbol environment. The new package exported symbol environment points to the package previously pointed to by .GlobalEnv • A package consists of a NAMESPACE • A NAMESPACE is an environment. • The parent of the NAMESPACE is an environment containing imported symbols. • The parent of the environment with imported symbols is the base environment. • The parent of the base environment is the .GlobalEnv. • …and so on down the search path. • Symbol resolution in packages follows lexical scope • Resolved in the function, then package NAMESPACE, then imports environment, then base then .GlobalEnv, then … Evaluation, lazy and non-standard evaluation 1.3.1 Exercises Exercise: bank account • Congratulations, you own a bank! Here’s a function that creates an account, and provides access to it account <- function() { balance <- 0 available <- function() {} deposit <- function(amt) {} withdraw <- function(amt) {} list(available = available, deposit = deposit, withdraw = withdraw) } Answer: account <- function() { balance <- 0 available <- function() { balance } deposit <- function(amt) { balance <<- balance + amt balance } withdraw <- function(amt) { if (amt > balance) stop("insufficient funds") deposit(-amt) } list(available = available, deposit = deposit, withdraw = withdraw) } • Implement for the following test case, using the concept of lexical scope: library(testthat) test_that("I understand lexical scope", { acct1 <- account() expect_equal(acct1$deposit(10), 10)
expect_equal(acct1$deposit(12), 22) expect_equal(acct1$available(), 22)
expect_equal(acct1$withdraw(20), 2) expect_error(acct1$withdraw(20), "insufficient funds")
})
• What happens when you get a second customer?

test_that("functions create local scope", {
acct1 <- account(); acct1$deposit(10) acct2 <- account(); acct2$deposit(20)
expect_equal(acct1$available(), 10) expect_equal(acct2$available(), 20)
})

account <- function() {
balance <- 0

available <- function() {
balance
}

deposit <- function(amt) {
balance <<- balance + amt
funs
}

withdraw <- function(amt) {
if (amt > balance)
stop("insufficient funds")
deposit(-amt)
}

funs = list(
available = available, deposit = deposit, withdraw = withdraw
)

funs
}

Exercise: bank account 2

• What changes to implementation would be require to allow

test_that("endomorphisms rock", {
expect_equal(account()$deposit(10)$withdraw(5)$available(), 5) }) Exercise: package-local options • Implement ‘package-local’ options that the user can set via functions. Do this by using local() to create and populate an environment that contains the package-local variables as well as accessor functions get() and set() to retrieve / assign variables, e.g., tolerance. Here’s a start, and some unit tests .myoptions <- local({ tolerance = 1e-4 get = function() { tolerance } set = function(value) { } ## ... }) ## setTolerance = ... Answer: .myoptions <- local({ tolerance = 1e-4 get = function() tolerance list( get = get, set = function(value) { ovalue <- get() tolerance <<- value ovalue } ) }) getTolerance <- .myoptions$get
[...]
Rcppuuid$R --vanilla -e "Rcppuuid::rcpp_uuid(3)" > Rcppuuid::rcpp_uuid(3) [1] "cb9cc7e8-bfcb-4262-b05b-2f1c0a4bc196" [2] "7b592e1f-8341-4da0-af04-b387726af8ea" [3] "7b2ade09-6ff2-4e76-b564-31b1a626e864" 2.4 Fun (?!) with C-level debugging Setup • gdb on linux, lldb on Mac; [gdb-to-lldb][gdb-lldb] command map. • Code compiled with debugging symbols and without optimizations • One approach: $ cat ~/.R/Makevars
CFLAGS = -g -O0
CXXFLAGS = -g -O0
CXX11FLAGS = -g -O0
• Alternative: configure R with appropriate CFLAGS, etc.

Overall

• Get and install a copy of our uuid package from GitHub

$git clone https://github.com/Bioconductor/BiocAdvanced$ cd BiocAdvanced
$git checkout Zurich-2017$ cd inst/uuid
$R CMD INSTALL . • Start R and launch debugger $ R -d gdb
(gdb)
• Start the R process, load packages, etc., break into the debugger

(gdb) run
> library(uuid)
> uuid()
Error in uuid() : invalid type/length (symbol/16) in vector allocation
• Break into the debugger, set a breakpoint at the point where R throws an error, and continue

> ^C
(gdb) break Rf_error
(gdb) continue
>
• Trigger the error

> uuid()
Breakpoint 1, Rf_error (
format=0x643020 "invalid type/length (%s/%d) in vector allocation")
at /home/mtmorgan/src/R-3-4-branch/src/main/errors.c:821
821 {
(gdb)
• Discover where you are in the call stack. Mostly we’re in R’s source code where bugs are unlikely (though not impossible!) three up is our own code

(gdb) where
#0  Rf_error (
format=0x643020 "invalid type/length (%s/%d) in vector allocation")
at /home/mtmorgan/src/R-3-4-branch/src/main/errors.c:821
#1  0x0000000000525dbc in Rf_allocVector3 (type=1, length=16, allocator=0x0)
at /home/mtmorgan/src/R-3-4-branch/src/main/memory.c:2612
#2  0x000000000050e5c0 in Rf_allocVector (type=1, length=16)
at /home/mtmorgan/src/R-3-4-branch/src/include/Rinlinedfuns.h:196
#3  0x00007fffed786780 in uuid (n_sexp=0x21a16a8) at uuid.cpp:26
#4  0x000000000048b7e3 in R_doDotCall (ofun=0x7fffed7866ac <uuid(SEXPREC*)>,
nargs=1, cargs=0x7fffffffbdd0, call=0x19f7d48)
at /home/mtmorgan/src/R-3-4-branch/src/main/dotcode.c:570`
• Navigate and look around

3 Acknowledgements

Research reported in this course was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)