- 1 Building the book
- 2 Introduction
- 3 Efficient set-up
- 4 Efficient hardware
- 5 Efficient workflow
- 6 Efficient collaboration
- 7 Efficient programming
- 8 Efficient Rcpp
- 9 Efficient Memory
- 10 Efficient Learning

A data type is an object that has a set of predefined characteristics, such as a number or a character. When programming in C or FORTRAN, the data type of every object must be specified by the user. he advantage is that it allows the compiler to perform type-specific optimisation. The downside is verbose and fragile code, which is inefficient to type. In R data types are less critical, but understanding them will help you debug and optimize for computational efficiency. Essentially, we have a trade-off between CPU run time and developer thinking time. However an understanding of data types can help when debugging and optimizing for computational efficiency. In this chapter, we will pick out the key point data types from an efficiency perspective. Chapter 2 of Advanced R Programming (Wickham 2014a) provides a more comprehensive treatment.

The vector is a fundamental data structure in R. Confusingly there are two varieties:

- Atomic vectors are where all elements have the same type and are usually created using the
`c()`

function; - Lists are where elements can have different types.

To test if an object is a vector, we must use `is.atomic(x) || is.list(x)`

. The more obvious choice for determining if an object is a vector, `is.vector(x)`

, only returns `TRUE`

is an object is a vector with no attributes other than names. For example, when we use the `table`

function

`x = table(rpois(100, 5))`

the object `x`

has additional attributes (such as `dim`

), so `is.vector(x)`

return `FALSE`

. But the contents `x`

is clearly a vector, so `is.atomic(x)`

returns `TRUE`

.

The core vector data types are logicals, integers, doubles and characters. When an atomic vector is created with a mixture of types, the output type is coerced to highest type in the following hierarchy:

`logical < integer < double < character `

This means that any vector containing a character string will be coerced to class, as illustrated below.

Numbers in R are usually stored in double-precision floating-point format - see Braun and Murdoch (2007) and Goldberg (1991). The term ‘double’ refers to the fact that on \(32\) bit systems (for which the format was developed) two memory locations are used to store a single number. Each double-precision number occupies \(8\) bytes and is accurate to around \(17\) decimal places (R does not print all of these, as you will see by typing `pi`

). Somewhat surprisingly, when we run the command

`x = 1`

we have created an atomic vector, contain a single double-precision floating point number. When comparing floating point numbers, we should be particularly careful, since

```
y = sqrt(2)*sqrt(2)
y == 2
```

`## [1] FALSE`

This is because the value of `y`

is not exactly \(2\), instead it’s **almost** \(2\)

`sprintf("%.16f", y)`

`## [1] "2.0000000000000004"`

To compare numbers in R it is advisable to use `all.equal`

and set an appropriate tolerance, e.g.

`all.equal(y, 2, tolerance = 1e-9)`

`## [1] TRUE`

Although using double precision objects is the most common type, R does have other ways of storing numbers:

`single`

: R doesn’t have a single precision data type. Instead, all real numbers are stored in double precision format. The functions`as.single`

and`single`

are identical to`as.double`

and`double`

except they set the attribute`Csingle`

that is used in the`.C`

and`.Fortran`

interface.`integer`

: Integers primarily exist to be passed to C or Fortran code. Typically we don’t worry about creating integers. However they are occasionally used to optimise sub-setting operations. When we subset a data frame or matrix, we are interacting with C code. For example, if we look at the arguments for the`head`

function

`args(head.matrix)`

```
## function (x, n = 6L, ...)
## NULL
```

The default argument is `6L`

(the `L`

, is short for Literal and is used to create an integer). Since this function is being called by almost everyone that uses R, this low level optimisation is useful. To illustrate the speed increase, suppose we are selecting the first \(100\) rows from a data frame (`clock_speed`

, from the **efficient** package). The speed increase is illustrated below, using the **microbenchmark** package:

```
s_int = 1:100; s = seq(1, 100, 1.0)
microbenchmark(clock_speed[s_int, 2L], clock_speed[s, 2.0], times=1000000L)
```

```
## Unit: microseconds
## expr min lq mean median uq max neval cld
## clock_speed[s_int, 2L] 11.79 13.43 15.30 13.81 14.22 87979 1e+06 a
## clock_speed[s, 2] 12.79 14.37 16.04 14.76 15.18 21964 1e+06 b
```

The above result shows that using integers is slightly faster, but probably not worth worrying about.

`numeric`

: The function`numeric()`

is identical to`double()`

; it creates is a double-precision number. However,`is.numeric()`

isn’t the same as`as.double()`

, instead`is.numeric()`

returns`TRUE`

for both numeric and double types.

To find out the type of data stored in an R vector use the command `typeof()`

:

`typeof(c("a", "b"))`

`## [1] "character"`

A good way of determining how to use more advanced programming concepts, is to examine the source code of R.

- What are the data types of
`c(1, 2, 3)`

and`1:3`

? - Have a look at the following function definitions:
`tail.matrix`

`lm`

- How does the function
`seq.int`

, which was used in the`tail.matrix`

function, differ to the standard`seq`

function?

A factor is useful when you know all of the possible values a variable may take. For example, suppose our data set related to months of the year

`m = c("January", "December", "March")`

If we sort `m`

in the usual way `sort(m)`

, we use standard alpha-numeric ordering, placing December first. While this is completely correct, it is also not that helpful. We can use factors to remedy this problem by specifying the admissible levels

```
## month.name contains the 12 months
fac_m = factor(m, levels=month.name)
sort(fac_m)
```

```
## [1] January March December
## 12 Levels: January February March April May June July August ... December
```

Most users interact with factors via the `read.csv`

function where character columns are automatically converted to factors. It is generally recommended to avoid this feature using the `stringsAsFactors=FALSE`

argument. Although this argument can be also placed in the global `options()`

list, this leads to non-portable code, so should be avoided.

Although factors look similar to character vectors, they are actually integers. This leads to initially surprising behaviour

`c(m)`

`## [1] "January" "December" "March"`

`c(fac_m)`

`## [1] 1 12 3`

In this case the `c()`

function is using the underlying integer representation of the factor. Overall factors are useful, but can lead to unwanted side-effects if we are not careful.

In early versions of R, storing character data as a factor was more space efficient. However since identical character strings now share storage, the space gain in factors is now space.

A data frame is a tabular (two dimensional or ‘rectangular’) object in which the columns may be composed of differing vector types such as `numeric`

, `logical`

, `character`

and so on. Matrices can only accept a single data type for all cells as explained in the next section. Data frames are the workhorses of R. Many R functions, such as `boxplot`

, `lm`

and `ggplot`

, expect your data set to be in a data frame. As a general rule, columns in your data should be variables and rows should be the thing of interest. This is illustrated in the `USAarrests`

data set:

`head(USArrests, 2)`

```
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
```

Note that each row corresponds to a particular state and each column to a variable. One particular trap to be wary of is when using `read.csv`

and `read.table`

characters are automatically converted to factors. One can avoid this pitfall by using the argument `stringsAsFactors = FALSE`

.

Since working with R frequently involves interacting with data frames, it’s useful to be fluent a few key functions:

Name | Description |
---|---|

`dim` |
Data frame dimensions |

`ncol` /`nrow` |
No. of columns/rows |

`NCOL` /`NROW` |
As above, but also works with vectors |

`cbind` /`rbind` |
Column/row bind |

`head` /`tail` |
Select the first/last few rows |

`colnames` /`rownames` |
Column and row |

When loading a dataset called `df`

into R, a typical workflow would be:

- Check dimensions using
`dim(df)`

; - Look at the first/last few rows using
`head(df)`

and`tail(df)`

; - Rename columns using
`colnames(df) =`

.

A matrix is similar to a data frame: it is a two dimensional object and sub-setting and other functions work in the same way. However all matrix columns must have the same type. Matrices tend to be used during statistical calculations. Linear regression using `lm()`

, for example, internally converts the data to a matrix before calculating the results; any characters are thus recoded as numeric dummy variables.

Matrices are generally faster than data frames. The datasets `ex_mat`

and `ex_df`

from the **efficient** package each have \(1000\) rows and \(100\) columns. They contain the same random numbers. However, selecting rows from a data frame is around \(150\) times slower than a matrix. This illustrates the reason for using matrices instead of data frames for efficient modelling in R:

```
data(ex_mat, ex_df, package="efficient")
benchmark(replications=10000,
ex_mat[1,], ex_df[1,],
columns=c("test", "elapsed", "relative"))
```

```
## test elapsed relative
## 2 ex_df[1, ] 6.308 137.13
## 1 ex_mat[1, ] 0.046 1.00
```

R has three built-in object oriented systems. These systems differ in how classes and methods are defined. The easiest and oldest system is the S3 system. S3 refers to the third version of S. The syntax of R is largely based on this version of S. In R there has never been S1 and S2 classes.

The S3 system implements a generic-function object oriented (OO) system. This type of OO is different to the message-passing style of Java and C++. In a message-passing framework, messages/methods are sent to objects and the object determines which function to call, e.g. `normal.rand(1)`

. The S3 class system is different. In S3, the generic function decides which method to call - it would have the form `rand(normal, 1)`

.

The S3 system is based on the class of an object. In this system, a class is just an attribute. The S3 class(es) of a object can be determined with the `class`

function.

`## [1] "data.frame"`

The S3 system can be used to great effect. For example, a `data.frame`

is simply a standard R list, with class `data.frame`

. When we pass an object to a *generic* function, the function first examines the class of the object, and then decides what to do: it dispatches to another method. The generic `summary`

function, for example, contains the following:

`summary`

```
## function (object, ...)
## UseMethod("summary")
## <bytecode: 0x5f5c890>
## <environment: namespace:base>
```

Note that the only operational line is `UseMethod("summary")`

. This handles the method dispatch based on the object’s class. So when `summary(USArrests)`

is executed, the generic `summary`

function passes `USArrests`

to the function `summary.data.frame`

.

This simple mechanism enables us to quickly create our own functions. Consider the distance object:

`dist_usa = dist(USArrests)`

`dist_usa`

has class `dist`

. To visualise the distances, we can create an image method. First we’ll check if the existing `image`

function is generic, via

`image`

```
## function (x, ...)
## UseMethod("image")
## <bytecode: 0x6dcfd78>
## <environment: namespace:graphics>
```

Since `image`

is already a generic method, we just have to create a specific `dist`

method

```
image.dist = function(x, ...) {
x_mat = as.matrix(x)
image(x_mat, main=attr(x, "method"), ...)
}
```

The `...`

argument allows us to pass arguments to the main image method, such as `axes`

(see figure 7.1.

Many S3 methods work in the same way as the simple `image.dist`

function created above: the object is converted into a standard format, then passed to the standard method. Creating S3 methods for standard functions such as `summary`

, `mean`

, and `plot`

provides a nice uniform interface to a wide variety of data types.

- Use a combination of
`unclass`

and`str`

on a data frame to confirm that it is a list. - Use the function
`length`

on a data frame. What is return? Why?

Even when our data set is small, the analysis can generate large objects. For example suppose we want to perform standard cluster analysis. Using the built-in data set `USAarrests`

, we calculate a distance matrix:

`dist_usa = dist(USArrests)`

The resulting object `dist_usa`

measures the similarity between two states with respect to the input data. Since there are \(50\) states in the `USAarrests`

data set, this results in a matrix with \(50\) columns and \(50\) rows. Intuitively, since the matrix `dist_usa`

is symmetric around the diagonal, it makes sense to exploit this characteristic for efficiency, allowing storage to be halved. If we examine the object `dist_usa`

, with `str(dist_usa)`

, it becomes apparent that the data is efficiently stored as a vector with some attributes.

Another efficient data structure is a sparse matrix. This is simply a matrix in where most of the elements are zero. Conversely, if most elements are non-zero, the matrix is considered dense. The proportion of non-zero elements is called the sparsity. Large sparse matrices often crop up when performing numerical calculations. Typically, our data isn’t sparse but the resulting data structures we create may be sparse. There are a number of techniques/methods used to store sparse matrices. Methods for creating sparse matrices can be found in the **Matrix** package. For this `dist`

object, since the structure is regular.