4.6 Data processing with data.table

data.table is a mature package for fast data processing that presents an alternative to dplyr. There is some controversy about which is more appropriate for different tasks19 so it should be stated at the outset that dplyr and data.table are not mutually exclusive competitors. Both are excellent packages and the important thing from an efficiency perspective is that they can help speed up data processing tasks.

The foundational object class of data.table is the data.table. Like dplyr’s tbl_df, data.table’s data.table objects behave in the same was as the base data.frame class. However the data.table paradigm has some unique features that make it highly computationally efficient for many common tasks in data analysis. Building on subsetting methods using [ and filter() presented in Section 4.5.4, we’ll see data.tables’s unique approach to subsetting. Like base R data.table uses square brackets but you do not need to refer to the object name inside the brackets:

idata = readRDS("data/idata-renamed.Rds")
idata_dt = data.table(idata) # convert to data.table class
aus3a = idata_dt[Country == "Australia"]

To boost performance, one can set ‘keys’. These are ‘supercharged rownames’ which order the table based on one or more variables. This allows a binary search algorithm to subset the rows of interest, which is much, much faster than the vector scan approach used in base R (see vignette("datatable-keys-fast-subset")). data.table uses the key values for subsetting by default so the variable does not need to be mentioned again. Instead, using keys, the search criteria is provided as a list (invoked below with the concise .() syntax below).

setkey(idata_dt, Country)
aus3b = idata_dt[.("Australia")]

The result is the same, so why add the extra stage of setting the key? The reason is that this one-off sorting operation can lead to substantial performance gains in situations where repeatedly subsetting rows on large datasets consumes a large proportion of computational time in your workflow. This is illustrated in Figure 4.4, which compares 4 methods of subsetting incrementally larger versions of the idata dataset.

Benchmark illustrating the performance gains to be expected for different dataset sizes.

Figure 4.4: Benchmark illustrating the performance gains to be expected for different dataset sizes.

Figure 4.4 demonstrates that data.table is much faster than base R and dplyr at subsetting. As with using external packages to read in data (see Section 4.3.1), the relative benefits of data.table improve with dataset size, approaching a ~70 fold improvement on base R and a ~50 fold improvement on dplyr as the dataset size reaches half a Gigabyte. Interestingly, even the ‘non key’ implementation of data.table subset method is faster than the alternatives: this is because data.table creates a key internally by default before subsetting. The process of creating the key accounts for the ~10 fold speed-up in cases where the key has been pre-generated.

This section has introduced data.table as a complimentary approach to base and dplyr methods for data processing and illustrated the performance gains of using keys for subsetting tables. data.table is a mature and powerful package which uses clever computational principles implemented in C to provide efficient methods for a number of other operations for data analysis. These include highly efficient data reshaping, dataset merging (also known as joining, as with left_join in dplyr) and grouping. These are explained in the vignettes datatable-intro and datatable-reshape. The datatable-reference-semantics vignette explains data.table’s unique syntax.

  1. One question on the stackoverflow website titled ‘data.table vs dplyr’ illustrates this controversey and delves into the philosophy underlying each approach.