7.1 Data types

A data type is an object that has a set of predefined characteristics, such as a number or a character. When programming in C or FORTRAN, the data type of every object must be specified by the user. he advantage is that it allows the compiler to perform type-specific optimisation. The downside is verbose and fragile code, which is inefficient to type. In R data types are less critical, but understanding them will help you debug and optimize for computational efficiency. Essentially, we have a trade-off between CPU run time and developer thinking time. However an understanding of data types can help when debugging and optimizing for computational efficiency. In this chapter, we will pick out the key point data types from an efficiency perspective. Chapter 2 of Advanced R Programming (Wickham 2014a) provides a more comprehensive treatment.

7.1.1 Vectors

The vector is a fundamental data structure in R. Confusingly there are two varieties:

  • Atomic vectors are where all elements have the same type and are usually created using the c() function;
  • Lists are where elements can have different types.

To test if an object is a vector, we must use is.atomic(x) || is.list(x). The more obvious choice for determining if an object is a vector, is.vector(x), only returns TRUE is an object is a vector with no attributes other than names. For example, when we use the table function

x = table(rpois(100, 5))

the object x has additional attributes (such as dim), so is.vector(x) return FALSE. But the contents x is clearly a vector, so is.atomic(x) returns TRUE.

The core vector data types are logicals, integers, doubles and characters. When an atomic vector is created with a mixture of types, the output type is coerced to highest type in the following hierarchy:

logical < integer < double < character 

This means that any vector containing a character string will be coerced to class, as illustrated below.

7.1.1.1 Numerics: doubles and integers

Numbers in R are usually stored in double-precision floating-point format - see Braun and Murdoch (2007) and Goldberg (1991). The term ‘double’ refers to the fact that on \(32\) bit systems (for which the format was developed) two memory locations are used to store a single number. Each double-precision number occupies \(8\) bytes and is accurate to around \(17\) decimal places (R does not print all of these, as you will see by typing pi). Somewhat surprisingly, when we run the command

x = 1

we have created an atomic vector, contain a single double-precision floating point number. When comparing floating point numbers, we should be particularly careful, since

y = sqrt(2)*sqrt(2)
y == 2
## [1] FALSE

This is because the value of y is not exactly \(2\), instead it’s almost \(2\)

sprintf("%.16f", y)
## [1] "2.0000000000000004"

To compare numbers in R it is advisable to use all.equal and set an appropriate tolerance, e.g.

all.equal(y, 2, tolerance = 1e-9)
## [1] TRUE

Although using double precision objects is the most common type, R does have other ways of storing numbers:

  • single: R doesn’t have a single precision data type. Instead, all real numbers are stored in double precision format. The functions as.single and single are identical to as.double and double except they set the attribute Csingle that is used in the .C and .Fortran interface.

  • integer: Integers primarily exist to be passed to C or Fortran code. Typically we don’t worry about creating integers. However they are occasionally used to optimise sub-setting operations. When we subset a data frame or matrix, we are interacting with C code. For example, if we look at the arguments for the head function

args(head.matrix)
## function (x, n = 6L, ...) 
## NULL

The default argument is 6L (the L, is short for Literal and is used to create an integer). Since this function is being called by almost everyone that uses R, this low level optimisation is useful. To illustrate the speed increase, suppose we are selecting the first \(100\) rows from a data frame (clock_speed, from the efficient package). The speed increase is illustrated below, using the microbenchmark package:

s_int = 1:100; s = seq(1, 100, 1.0)
microbenchmark(clock_speed[s_int, 2L], clock_speed[s, 2.0], times=1000000L)
## Unit: microseconds
## expr   min    lq  mean median    uq   max neval cld
## clock_speed[s_int, 2L] 11.79 13.43 15.30  13.81 14.22 87979 1e+06  a 
## clock_speed[s, 2] 12.79 14.37 16.04  14.76 15.18 21964 1e+06   b

The above result shows that using integers is slightly faster, but probably not worth worrying about.

  • numeric: The function numeric() is identical to double(); it creates is a double-precision number. However, is.numeric() isn’t the same as as.double(), instead is.numeric() returns TRUE for both numeric and double types.

To find out the type of data stored in an R vector use the command typeof():

typeof(c("a", "b"))
## [1] "character"

7.1.1.2 Exercises

A good way of determining how to use more advanced programming concepts, is to examine the source code of R.

  1. What are the data types of c(1, 2, 3) and 1:3?
  2. Have a look at the following function definitions:
    • tail.matrix
    • lm
  3. How does the function seq.int, which was used in the tail.matrix function, differ to the standard seq function?

7.1.2 Factors

A factor is useful when you know all of the possible values a variable may take. For example, suppose our data set related to months of the year

m = c("January", "December", "March")

If we sort m in the usual way sort(m), we use standard alpha-numeric ordering, placing December first. While this is completely correct, it is also not that helpful. We can use factors to remedy this problem by specifying the admissible levels

## month.name contains the 12 months
fac_m = factor(m, levels=month.name)
sort(fac_m)
## [1] January  March    December
## 12 Levels: January February March April May June July August ... December

Most users interact with factors via the read.csv function where character columns are automatically converted to factors. It is generally recommended to avoid this feature using the stringsAsFactors=FALSE argument. Although this argument can be also placed in the global options() list, this leads to non-portable code, so should be avoided.

Although factors look similar to character vectors, they are actually integers. This leads to initially surprising behaviour

c(m)
## [1] "January"  "December" "March"
c(fac_m)
## [1]  1 12  3

In this case the c() function is using the underlying integer representation of the factor. Overall factors are useful, but can lead to unwanted side-effects if we are not careful.

In early versions of R, storing character data as a factor was more space efficient. However since identical character strings now share storage, the space gain in factors is now space.

7.1.3 Data frames

A data frame is a tabular (two dimensional or ‘rectangular’) object in which the columns may be composed of differing vector types such as numeric, logical, character and so on. Matrices can only accept a single data type for all cells as explained in the next section. Data frames are the workhorses of R. Many R functions, such as boxplot, lm and ggplot, expect your data set to be in a data frame. As a general rule, columns in your data should be variables and rows should be the thing of interest. This is illustrated in the USAarrests data set:

head(USArrests, 2)
##         Murder Assault UrbanPop Rape
## Alabama   13.2     236       58 21.2
## Alaska    10.0     263       48 44.5

Note that each row corresponds to a particular state and each column to a variable. One particular trap to be wary of is when using read.csv and read.table characters are automatically converted to factors. One can avoid this pitfall by using the argument stringsAsFactors = FALSE.

Since working with R frequently involves interacting with data frames, it’s useful to be fluent a few key functions:

Useful data frame functions.
Name Description
dim Data frame dimensions
ncol/nrow No. of columns/rows
NCOL/NROW As above, but also works with vectors
cbind/rbind Column/row bind
head/tail Select the first/last few rows
colnames/rownames Column and row

When loading a dataset called df into R, a typical workflow would be:

  • Check dimensions using dim(df);
  • Look at the first/last few rows using head(df) and tail(df);
  • Rename columns using colnames(df) =.

7.1.4 Matrix

A matrix is similar to a data frame: it is a two dimensional object and sub-setting and other functions work in the same way. However all matrix columns must have the same type. Matrices tend to be used during statistical calculations. Linear regression using lm(), for example, internally converts the data to a matrix before calculating the results; any characters are thus recoded as numeric dummy variables.

Matrices are generally faster than data frames. The datasets ex_mat and ex_df from the efficient package each have \(1000\) rows and \(100\) columns. They contain the same random numbers. However, selecting rows from a data frame is around \(150\) times slower than a matrix. This illustrates the reason for using matrices instead of data frames for efficient modelling in R:

data(ex_mat, ex_df, package="efficient")
benchmark(replications=10000, 
          ex_mat[1,], ex_df[1,], 
          columns=c("test", "elapsed", "relative"))
##          test elapsed relative
## 2  ex_df[1, ]   6.308   137.13
## 1 ex_mat[1, ]   0.046     1.00

7.1.5 S3 objects

R has three built-in object oriented systems. These systems differ in how classes and methods are defined. The easiest and oldest system is the S3 system. S3 refers to the third version of S. The syntax of R is largely based on this version of S. In R there has never been S1 and S2 classes.

The S3 system implements a generic-function object oriented (OO) system. This type of OO is different to the message-passing style of Java and C++. In a message-passing framework, messages/methods are sent to objects and the object determines which function to call, e.g. normal.rand(1). The S3 class system is different. In S3, the generic function decides which method to call - it would have the form rand(normal, 1).

The S3 system is based on the class of an object. In this system, a class is just an attribute. The S3 class(es) of a object can be determined with the class function.

## [1] "data.frame"

The S3 system can be used to great effect. For example, a data.frame is simply a standard R list, with class data.frame. When we pass an object to a generic function, the function first examines the class of the object, and then decides what to do: it dispatches to another method. The generic summary function, for example, contains the following:

summary
## function (object, ...) 
## UseMethod("summary")
## <bytecode: 0x5f5c890>
## <environment: namespace:base>

Note that the only operational line is UseMethod("summary"). This handles the method dispatch based on the object’s class. So when summary(USArrests) is executed, the generic summary function passes USArrests to the function summary.data.frame.

This simple mechanism enables us to quickly create our own functions. Consider the distance object:

dist_usa = dist(USArrests)

dist_usa has class dist. To visualise the distances, we can create an image method. First we’ll check if the existing image function is generic, via

image
## function (x, ...) 
## UseMethod("image")
## <bytecode: 0x6dcfd78>
## <environment: namespace:graphics>

Since image is already a generic method, we just have to create a specific dist method

image.dist = function(x, ...) {
  x_mat = as.matrix(x)
  image(x_mat, main=attr(x, "method"), ...)  
}

The ... argument allows us to pass arguments to the main image method, such as axes (see figure 7.1.

S3 image method for data of class `dist`.

Figure 7.1: S3 image method for data of class dist.

Many S3 methods work in the same way as the simple image.dist function created above: the object is converted into a standard format, then passed to the standard method. Creating S3 methods for standard functions such as summary, mean, and plot provides a nice uniform interface to a wide variety of data types.

7.1.5.1 Exercises

  1. Use a combination of unclass and str on a data frame to confirm that it is a list.
  2. Use the function length on a data frame. What is return? Why?

7.1.6 Efficient data structures

Even when our data set is small, the analysis can generate large objects. For example suppose we want to perform standard cluster analysis. Using the built-in data set USAarrests, we calculate a distance matrix:

dist_usa = dist(USArrests)

The resulting object dist_usa measures the similarity between two states with respect to the input data. Since there are \(50\) states in the USAarrests data set, this results in a matrix with \(50\) columns and \(50\) rows. Intuitively, since the matrix dist_usa is symmetric around the diagonal, it makes sense to exploit this characteristic for efficiency, allowing storage to be halved. If we examine the object dist_usa, with str(dist_usa), it becomes apparent that the data is efficiently stored as a vector with some attributes.

Another efficient data structure is a sparse matrix. This is simply a matrix in where most of the elements are zero. Conversely, if most elements are non-zero, the matrix is considered dense. The proportion of non-zero elements is called the sparsity. Large sparse matrices often crop up when performing numerical calculations. Typically, our data isn’t sparse but the resulting data structures we create may be sparse. There are a number of techniques/methods used to store sparse matrices. Methods for creating sparse matrices can be found in the Matrix package. For this dist object, since the structure is regular.