7.2 Good programming techniques

A major benefit of using R (as opposed to C or Fortran, say), is that coding time is greatly reduced. However if we are not careful, it’s very easy to write programs that are incredibly slow. While optimisations such as going parallel can easily double speed, poor code can easily run 100s of times slower. For this reason a priority of an efficient programmer should be to avoid the following common mistakes. If you spend any time programming in R, then reading (Burns 2011) should be considered essential reading.

7.2.1 General tips

The key to making R code run fast is to access the underlying C/Fortran routines as quickly as possible. For example, suppose that x is a standard R vector of length n. Then

x = x + 1

involves a single function call to the + function. Whereas,

for(i in 1:n) {
  x[i] = x[i] + 1 
}

has

  • n function calls to +;
  • n function calls to the [ function;
  • n function calls to the [<- function (used in the assignment operation);
  • A function call to for and the : operator.

It isn’t that the for loop is slow, rather it is because we calling many more functions. This point is indirectly tackled again in the section on vectorised code.

Another general technique is to be careful with memory allocation. In fact this could be considered the number \(1\) rule when programming in R. If possible always pre-allocate your vector or data frame then fill in the values. Let’s consider three methods of creating a sequence of numbers.

Method 1 creates an empty vector, and grows the object

method1 = function(n) {
  myvec = NULL
  for(i in 1:n)
    myvec = c(myvec, i)
  myvec
}

Method 2 creates an object of the final length and then changes the values in the object by subscripting:

method2 = function(n) {
  myvec = numeric(n)
  for(i in 1:n)
    myvec[i] = i
  myvec
}

Method 3 directly creates the final object

method3 = function(n) 1:n

To compare the three methods we use the benchmark function from the previous chapter

n = 1e4
benchmark(replications=10, 
          method1(n), method2(n), method3(n),
          columns=c("test", "elapsed"))
##         test elapsed
## 1 method1(n)   1.721
## 2 method2(n)   0.082
## 3 method3(n)   0.000

The table below shows the timing in seconds on my machine for these three methods for a selection of values of \(n\). The relationships for varying \(n\) are all roughly linear on a log-log scale, but the timings between methods are drastically different. Notice that the timings are no longer trivial. When \(n=10^7\), method 1 takes around an hour whilst method 2 takes \(2\) seconds and method 3 is almost instantaneous.

Time in seconds to create sequences. When \(n=10^7\), method 1 takes around an hour while methods 2 takes 2 seconds and method 3 almost instantaneous.
\(n\) Method 1 Method 2 Method 3
\(10^5\) \(\phantom{000}0.208\) \(0.024\) \(0.000\)
\(10^6\) \(\phantom{00}25.500\) \(0.220\) \(0.000\)
\(10^7\) \(3827.0000\) \(2.212\) \(0.000\)

7.2.2 Caching variables

A straightforward method for speeding up code is to calculate objects once and reuse the value when necessary. This could be as simple with replacing log(x) in multiple function calls with the object log_x that is defined once and reused. This small saving in time, quickly multiplies when the cached variable is used inside a for loop.

A more advanced form of caching is use the memoise package. If a function is called multiple times with the same input, it may be possible to speed things up by keeping a cache of known answers that it can retrieve. The memoise package allows us easily store the value of function call and returns the cached result when the function is called again with the same arguments. This package trades off memory versus speed, since the memoised function stores all previous inputs and outputs. To cache a function, we simply pass the function to the memoise function.

The classic memoise example is the factorial function. Another example is to limit use to a web resource. For example, suppose we are developing a shiny (an interactive graphic) application where the user can fit regression line to data. The user can remove points and refit the line. An example function would be

## Argument indicates row to remove
plot_mpg = function(row_to_remove) {
  data(mpg, package="ggplot2")
  mpg = mpg[-row_to_remove,]
  plot(mpg$cty, mpg$hwy)
  lines(lowess(mpg$cty, mpg$hwy), col=2)
}

We can use memoise speed up by caching results. A quick benchmark

m_plot = memoise(plot_mpg)
benchmark(m_plot(10), plot_mpg(10), columns = c("test", "relative", "elapsed"))
#         test relative elapsed
#1   m_plot(10)    1.000   0.007
#2 plot_mpg(10)  481.857   3.373

suggests that we can obtain a 500-fold speed-up.

7.2.3 Function closures

More advanced caching is available using function closures. A closure in R is an object that contains functions bound to the environment the closure was created in. Technically all functions in R have this property, but we use the term function closure to denote functions where the environment is not .GlobalEnv. One of the environments associated with function is known as the enclosing environment, that is, where was the function created. We can determine the enclosing environment using the environment function

environment(plot_mpg)
## <environment: R_GlobalEnv>

The plot_mpg function’s enclosing environment is the .GlobalEnv. This is important for variable scope, i.e. where should be look for a particular object. Consider the function f

f = function() {
  x = 5
  function() {
    x
  }
}

When we call the function f, the object returned is a function. While the enclosing environment of f is .GlobalEnv, the enclosing environment of the returned function is something different

g = f()
environment(g)
## <environment: 0x84e24a0>

When we call this new function g,

x = 10
g()
## [1] 5

The value returned is obtained from environment(g) is 5, not .GlobalEnv. This environment allows to cache variables between function calls. The counter function is basic example of this feature

counter = function() {
  no = 0
  count = function() {
    no <<- no + 1
    no
  }
}

When we call the function, we retain object values between function calls

sc = counter()
sc()
## [1] 1
sc()
## [1] 2

The key points of the counter function are

  • The counter function returns a function

    sc = counter()
    sc()
    ## [1] 1
  • The enclosing environment of sc is not .GlobalEnv instead, it’s the binding environment of sc.
  • The function sc has an environment that can be used to store/cache values
  • The operator <<- is used to alter the no.

We can exploit function closures to simplify our code. Suppose we wished to simulate a games of Snakes and Ladders. We could have function that checked if we landed on a Snake, and if so move

check_snake = function(square) {
   switch(as.character(square), 
       '16'=6,  '49'=12, '47'=26, '56'=48, '62'=19, 
       '64'=60, '87'=24, '93'=73, '96'=76, '98'=78, 
       square)
}

If we then wanted to determine how often we landed on a Snake, we could use a function closure to keep track

check_snake = function() {
  no_of_snakes = 0
  function(square) {
    new_square = switch(as.character(square), 
       '16'=6,  '49'=12, '47'=26, '56'=48, '62'=19, 
       '64'=60, '87'=24, '93'=73, '96'=76, '98'=78, 
       square)
    no_of_snakes = no_of_snakes + (new_square != square)
    new_square
  }
}

By keeping the variable no_of_snakes attached to the check_snake function, enables us to have cleaner code.

7.2.4 Vectorised code

When writing code in R, you need to remember that you are using R and not C (or even Fortran 77!). For example,

## Change 1000 uniform random numbers
x = runif(1000) + 1
logsum = 0
for(i in 1:length(x))
  logsum = logsum + log(x[i])

is a piece R code that has a strong, unhealthy influence from C. Instead we should write

logsum = sum(log(x))

Writing code this way has a number of benefits.

  • It’s faster. When \(n = 10^7\) the ``R way’’ is about forty times faster.
  • It’s neater.
  • It doesn’t contain a bug when x is of length \(0\).

Another common example is sub-setting a vector. When writing in C, we would have something like:

ans = NULL
for(i in 1:length(x)) {
  if(x[i] < 0) 
    ans = c(ans, x[i])
}

This of course can be done simply with

ans = x[x < 0]
Example of Monte-Carlo integration. To estimate the area under the curve throw random points at the graph and count the number of points that lie under the curve.

Figure 7.2: Example of Monte-Carlo integration. To estimate the area under the curve throw random points at the graph and count the number of points that lie under the curve.

7.2.4.1 Example: Monte-Carlo integration

It’s also important to make full use of R functions that use vectors. For example, suppose we wish to estimate \[ \int_0^1 x^2 dx \] using a basic Monte-Carlo method. Essentially, we throw darts at the curve and count the number of darts that fall below the curve (as in 7.2).

Monte Carlo Integration

  1. Initialise: hits = 0
  2. for i in 1:N
  3. \(~~~\) Generate two random numbers, \(U_1, U_2\), between 0 and 1
  4. \(~~~\) If \(U_2 < U_1^2\), then hits = hits + 1
  5. end for
  6. Area estimate = hits/N

A standard C approach to implementing this Monte-Carlo algorithm would be something like:

N = 500000
f = function(N){
  hits = 0
  for(i in 1:N)  {
    u1 = runif(1); u2 = runif(1)
    if(u1^2 > u2)
      hits = hits + 1
  }
  return(hits/N)
}

In R this takes a few seconds:

system.time(f(N))
##    user  system elapsed 
##   2.223   0.000   2.223

In contrast, a more R-centric approach would be the following:

f1 = function(N){
  hits = sum(runif(N)^2 > runif(N))
  return(hits/N)
}

f1 is around \(30\) times faster than f, illustrating the efficiency gains that can be made by vectorising your code:

system.time(f1(N))
##    user  system elapsed 
##   0.048   0.000   0.048