- 1 Building the book
- 2 Introduction
- 3 Efficient set-up
- 4 Efficient hardware
- 5 Efficient workflow
- 6 Efficient collaboration
- 7 Efficient programming
- 8 Efficient Rcpp
- 9 Efficient Memory
- 10 Efficient Learning

A major benefit of using R (as opposed to C or Fortran, say), is that coding time is greatly reduced. However if we are not careful, it’s very easy to write programs that are incredibly slow. While optimisations such as going parallel can easily double speed, poor code can easily run 100s of times slower. For this reason a priority of an efficient programmer should be to avoid the following common mistakes. If you spend any time programming in R, then reading (Burns 2011) should be considered essential reading.

The key to making R code run fast is to access the underlying C/Fortran routines as quickly as possible. For example, suppose that `x`

is a standard R vector of length `n`

. Then

`x = x + 1`

involves a single function call to the `+`

function. Whereas,

```
for(i in 1:n) {
x[i] = x[i] + 1
}
```

has

`n`

function calls to`+`

;`n`

function calls to the`[`

function;`n`

function calls to the`[<-`

function (used in the assignment operation);- A function call to
`for`

and the`:`

operator.

It isn’t that the `for`

loop is slow, rather it is because we calling many more functions. This point is indirectly tackled again in the section on vectorised code.

Another general technique is to be careful with memory allocation. In fact this could be considered the number \(1\) rule when programming in R. If possible always pre-allocate your vector or data frame then fill in the values. Let’s consider three methods of creating a sequence of numbers.

**Method 1** creates an empty vector, and grows the object

```
method1 = function(n) {
myvec = NULL
for(i in 1:n)
myvec = c(myvec, i)
myvec
}
```

**Method 2** creates an object of the final length and then changes the values in the object by subscripting:

```
method2 = function(n) {
myvec = numeric(n)
for(i in 1:n)
myvec[i] = i
myvec
}
```

**Method 3** directly creates the final object

`method3 = function(n) 1:n`

To compare the three methods we use the `benchmark`

function from the previous chapter

```
n = 1e4
benchmark(replications=10,
method1(n), method2(n), method3(n),
columns=c("test", "elapsed"))
```

```
## test elapsed
## 1 method1(n) 1.721
## 2 method2(n) 0.082
## 3 method3(n) 0.000
```

The table below shows the timing in seconds on my machine for these three methods for a selection of values of \(n\). The relationships for varying \(n\) are all roughly linear on a log-log scale, but the timings between methods are drastically different. Notice that the timings are no longer trivial. When \(n=10^7\), method 1 takes around an hour whilst method 2 takes \(2\) seconds and method 3 is almost instantaneous.

\(n\) | Method 1 | Method 2 | Method 3 |
---|---|---|---|

\(10^5\) | \(\phantom{000}0.208\) | \(0.024\) | \(0.000\) |

\(10^6\) | \(\phantom{00}25.500\) | \(0.220\) | \(0.000\) |

\(10^7\) | \(3827.0000\) | \(2.212\) | \(0.000\) |

A straightforward method for speeding up code is to calculate objects once and reuse the value when necessary. This could be as simple with replacing `log(x)`

in multiple function calls with the object `log_x`

that is defined once and reused. This small saving in time, quickly multiplies when the cached variable is used inside a `for`

loop.

A more advanced form of caching is use the **memoise** package. If a function is called multiple times with the same input, it may be possible to speed things up by keeping a cache of known answers that it can retrieve. The **memoise** package allows us easily store the value of function call and returns the cached result when the function is called again with the same arguments. This package trades off memory versus speed, since the memoised function stores all previous inputs and outputs. To cache a function, we simply pass the function to the **memoise** function.

The classic memoise example is the factorial function. Another example is to limit use to a web resource. For example, suppose we are developing a shiny (an interactive graphic) application where the user can fit regression line to data. The user can remove points and refit the line. An example function would be

```
## Argument indicates row to remove
plot_mpg = function(row_to_remove) {
data(mpg, package="ggplot2")
mpg = mpg[-row_to_remove,]
plot(mpg$cty, mpg$hwy)
lines(lowess(mpg$cty, mpg$hwy), col=2)
}
```

We can use **memoise** speed up by caching results. A quick benchmark

```
m_plot = memoise(plot_mpg)
benchmark(m_plot(10), plot_mpg(10), columns = c("test", "relative", "elapsed"))
# test relative elapsed
#1 m_plot(10) 1.000 0.007
#2 plot_mpg(10) 481.857 3.373
```

suggests that we can obtain a 500-fold speed-up.

More advanced caching is available using *function closures*. A closure in R is an object that contains functions bound to the environment the closure was created in. Technically all functions in R have this property, but we use the term function closure to denote functions where the environment is not `.GlobalEnv`

. One of the environments associated with function is known as the enclosing environment, that is, where was the function created. We can determine the enclosing environment using the `environment`

function

`environment(plot_mpg)`

`## <environment: R_GlobalEnv>`

The `plot_mpg`

function’s enclosing environment is the `.GlobalEnv`

. This is important for variable scope, i.e. where should be look for a particular object. Consider the function `f`

```
f = function() {
x = 5
function() {
x
}
}
```

When we call the function `f`

, the object returned is a function. While the enclosing environment of `f`

is `.GlobalEnv`

, the enclosing environment of the **returned** function is something different

```
g = f()
environment(g)
```

`## <environment: 0x84e24a0>`

When we call this new function `g`

,

```
x = 10
g()
```

`## [1] 5`

The value returned is obtained from `environment(g)`

is 5, not `.GlobalEnv`

. This environment allows to cache variables between function calls. The `counter`

function is basic example of this feature

```
counter = function() {
no = 0
count = function() {
no <<- no + 1
no
}
}
```

When we call the function, we retain object values between function calls

```
sc = counter()
sc()
```

`## [1] 1`

`sc()`

`## [1] 2`

The key points of the `counter`

function are

The counter function returns a function

`sc = counter() sc()`

`## [1] 1`

- The enclosing environment of
`sc`

is not`.GlobalEnv`

instead, it’s the binding environment of`sc`

. - The function
`sc`

has an environment that can be used to store/cache values The operator

`<<-`

is used to alter the`no`

.

We can exploit function closures to simplify our code. Suppose we wished to simulate a games of Snakes and Ladders. We could have function that checked if we landed on a Snake, and if so move

```
check_snake = function(square) {
switch(as.character(square),
'16'=6, '49'=12, '47'=26, '56'=48, '62'=19,
'64'=60, '87'=24, '93'=73, '96'=76, '98'=78,
square)
}
```

If we then wanted to determine how often we landed on a Snake, we could use a function closure to keep track

```
check_snake = function() {
no_of_snakes = 0
function(square) {
new_square = switch(as.character(square),
'16'=6, '49'=12, '47'=26, '56'=48, '62'=19,
'64'=60, '87'=24, '93'=73, '96'=76, '98'=78,
square)
no_of_snakes = no_of_snakes + (new_square != square)
new_square
}
}
```

By keeping the variable `no_of_snakes`

attached to the `check_snake`

function, enables us to have cleaner code.

When writing code in R, you need to remember that you are using R and not C (or even Fortran 77!). For example,

```
## Change 1000 uniform random numbers
x = runif(1000) + 1
logsum = 0
for(i in 1:length(x))
logsum = logsum + log(x[i])
```

is a piece R code that has a strong, unhealthy influence from C. Instead we should write

`logsum = sum(log(x))`

Writing code this way has a number of benefits.

- It’s faster. When \(n = 10^7\) the ``R way’’ is about forty times faster.
- It’s neater.
- It doesn’t contain a bug when
`x`

is of length \(0\).

Another common example is sub-setting a vector. When writing in C, we would have something like:

```
ans = NULL
for(i in 1:length(x)) {
if(x[i] < 0)
ans = c(ans, x[i])
}
```

This of course can be done simply with

`ans = x[x < 0]`

It’s also important to make full use of R functions that use vectors. For example, suppose we wish to estimate \[ \int_0^1 x^2 dx \] using a basic Monte-Carlo method. Essentially, we throw darts at the curve and count the number of darts that fall below the curve (as in 7.2).

*Monte Carlo Integration*

- Initialise:
`hits = 0`

**for i in 1:N**- \(~~~\) Generate two random numbers, \(U_1, U_2\), between 0 and 1
- \(~~~\) If \(U_2 < U_1^2\), then
`hits = hits + 1`

**end for**- Area estimate =
`hits/N`

A standard C approach to implementing this Monte-Carlo algorithm would be something like:

```
N = 500000
f = function(N){
hits = 0
for(i in 1:N) {
u1 = runif(1); u2 = runif(1)
if(u1^2 > u2)
hits = hits + 1
}
return(hits/N)
}
```

In R this takes a few seconds:

`system.time(f(N))`

```
## user system elapsed
## 2.223 0.000 2.223
```

In contrast, a more R-centric approach would be the following:

```
f1 = function(N){
hits = sum(runif(N)^2 > runif(N))
return(hits/N)
}
```

`f1`

is around \(30\) times faster than `f`

, illustrating the efficiency gains that can be made by vectorising your code:

`system.time(f1(N))`

```
## user system elapsed
## 0.048 0.000 0.048
```