Interfacing to C

Motivation and relevance

  1. Access to existing libraries (e.g., samtools)

  2. Fast or memory efficient implementation of advanced algorithms (e.g., findOverlaps).

Exercise: sum a numeric vector, in C

.C

  1. Create a file in your home directory ~/c_sum.c containing the following lines of code

    void c_sum(const double *x, const int *length, double *answer)
    {
       answer[0] = 0;
       for (int i = 0; i < length[0]; ++i)
           answer[0] += x[i];
    }
    

    We'll pass the arguments x, length, and answer in to C from R; everything in R is a vector, so all values seen in C are pointers.

  2. Compile this function to a 'dynamic library' by opening a shell, navigating to the directory where c_sum.c is located, and evaluating the command

    R CMD SHLIB c_sum.c
    

    This creates (if there are no syntax or other errors!) a file c_sum.so.

  3. In R, write a short wrapper function to call c_sum

    c_sum <- function(x) 
       .C("c_sum", as.numeric(x), length(x), answer=numeric(1))$answer
    

    The first argument to .C is the name of the C function we want to invoke. The remainder of the arguments are the values we pass to C; all are vectors in R, hence pointers in C. The type of the R argument must match the type of the C signature; there is no type-checking provided (e.g., it would be a mistake to pass the integer vector x <- 1:10, without first coercing to numeric). x and length(x) MUST BE treated as read-only in C; we've enforced this by declaring them as const, e.g., const double *. We'll modify the final argument, but to do so we must be confident that there are no other references to it in R, i.e., that it's NAMED level is 0.

    Arguments to .C can be named; names are not relevant at the C level, but are useful in interpretting the return value. The return value is a list() of the original arguments, perhaps updated during the call The paradigm adopted above names the argument that we wish to use in subsequent calculations, and selects that value from the returned list.

  4. Use the function after loading the dynamic library

    dyn.load("~/c_sum.so")
    c_sum(1:100)
    
  5. Compare the time required for c_sum() to sum a large vector, e.g., 100M values, with the time required for base R's sum(). Any ideas on the difference? Break c_sum() in various ways.

  6. (Advanced) Explore C headers provided with R

    R.home("include")
    
    ## [1] "/Library/Frameworks/R.framework/Resources/include"
    

.Call

The .C interface is useful but quite restricted, e.g., we had to pass in the value that we returned, rather than creating it in C.

  1. Add the following line at the begining of c_sum.c

    #include <Rinternals.h>
    

    This includes the C header file that defines function prototypes and other information required to work with R. Explore the header file (at R.home("include")) at your leisure.

  2. Add the following function after c_sum:

    SEXP call_sum(SEXP x)
    {
       SEXP answer = PROTECT(Rf_allocVector(REALSXP, 1));
       int len = Rf_length(x);
       c_sum(REAL(x), &len, REAL(answer));
       UNPROTECT(1);
       return answer;
    }
    

    This defines a function that takes an SEXP (S-expression, i.e., any R object; an SEXP is a pointer to a C struct) as an argument, and returns an SEXP.

    The first line declares a new SEXP, allocates memory to hold information for a REALSXP of length 1 (the equivalent of numeric(1) in R), and PROTECTs the allocation from garbage collection.

    The second line creates an integer variable, and uses an accessor to reach in to x to discover its length.

    The third line uses accessors to reach in to the various SEXP to retrieve the underlying data; these are passed to our previously defined, pure C function c_sum.

    In the third line we prepare to hand the memory we've allocated back to R for memory management; we don't need to PROTECT it any more.

    The final line returns our allocated SEXP to R.

  3. Compile as before

  4. Write a short wrapper. Because we're able to determine the length of the object we've passed in, and because we can allocate and return memory from C, we just need to pass in the vector we're adding.

    call_sum <- function(x) .Call("call_sum", as.numeric(x))
    
  5. Load the DLL and invoke call_sum as before. It may be necessary to start an new R session to replace the previously loaded shared object.

  6. Compare the output and performance of sum(), c_sum(), and call_sum().

  7. Move the vector coercion as.numeric() in R to Rf_coerceVector(x, REALSXP) in C. Do you need to PROTECT the result of the coerion? Any thoughts on whether it's better to coerce in R, or in C?

Rcpp

.Call() is flexible, but exposes the details of R's internal representation and memory management to the programmer. The Rcpp package hides this complexity behind C++ classes, trading off specialized understanding of R's internals for more general understanding of C++.

  1. Start a new file “rcpp_sum.cpp” with the following

    #include <Rcpp.h>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    NumericVector rcpp_sum(NumericVector x)
    {
       NumericVector ans(1);
       for (int i = 0; i < x.size(); ++i)
           ans[0] += x[i];
       return ans;
    }
    

    The file starts by including the headers for Rcpp. using namespace Rcpp is a C++ directive to manage where symbols (e.g., NumericVector) will be discovered. // [[Rcpp::export]] is an Rcpp attribute that indicates that the following function should be made available in R (not all functions need to be exposed to the R user).

    NumericVector is a C++ class representing R's numeric() type, so the signature to rcpp_sum() says that it expects a single numeric() vector argument, and will return a numeric() vector to R. Rcpp is wired so that type coercion and memory management are automatic – if the user invokes rcpp_sum() with an integer vector (Rcpp type IntegerVector), it will be coerced to type NumericVector. NumericVector behaves like standard C++ containers, so has a size() method and array subset operations [] that behave in predictable ways.

    Rcpp implements NumericVector and friends on top of R's interface, but hides from us the need to explicitly allocate and manage memory.

  2. It's possible to compile this as before, but let's instead do this from within R

    library(Rcpp)
    sourceCpp("~/rcpp_sum.cpp")
    

    This creates an object in R, rccp_sum(), that can be invoked immediately.

    rcpp_sum(1:10)
    
  3. Compare the output and performance of rcpp_sum() with sum() and the other functions we've written. What issues have been addressed by Rcpp? What problems remain?