5.6 Updating column classes

The class of R objects is critical to how it performs. If a class is incorrectly specified (if numbers are treated as factors, for example), R will likely generate error messages. Try typing mean(idata$gini), for example.

We can re-assign the classes of the numeric variables one-by one:

idata$gini = as.numeric(idata$gini)
## Warning: NAs introduced by coercion
mean(idata$gini, na.rm = TRUE) # now the mean is calculated
## [1] 40.50363

However, the purpose of programming languages is to automate arduous tasks and reduce typing. The following command re-classifies all of the numeric variables using the apply function (we’ll seem more of apply’s relatives later):

idata[5:9] = apply(idata[5:9], 2,
  function(x) as.numeric(x))
countries = group_by(idata, Country)
summarise(countries, gini = mean(gini, na.rm  = TRUE))
## Source: local data frame [176 x 2]
## 
##         Country     gini
##           (chr)    (dbl)
## 1   Afghanistan      NaN
## 2       Albania 30.43167
## 3       Algeria 37.76000
## 4        Angola 50.65000
## 5     Argentina 48.06739
## 6       Armenia 33.72929
## 7     Australia 33.14167
## 8       Austria 29.15167
## 9    Azerbaijan 24.79000
## 10 Bahamas, The      NaN
## ..          ...      ...

Note that summarise is highly versatile, and can be used to return a customised range of summary statistics:

summarise(countries,
  # number of rows per country
  obs = n(), 
  med_t10 = median(top10, na.rm  = TRUE),
  # standard deviation
  sdev = sd(gini, na.rm  = TRUE), 
  # number with gini > 30
  n30 = sum(gini > 30, na.rm  = TRUE), 
  sdn30 = sd(gini[ gini > 30 ], na.rm  = TRUE),
  # range
  dif = max(gini, na.rm  = TRUE) - min(gini, na.rm  = TRUE)
  )
## Source: local data frame [176 x 7]
## 
##         Country   obs med_t10      sdev   n30      sdn30   dif
##           (chr) (int)   (dbl)     (dbl) (int)      (dbl) (dbl)
## 1   Afghanistan    40      NA       NaN     0         NA    NA
## 2       Albania    40  24.435  1.252524     3  0.3642801  2.78
## 3       Algeria    40  29.780  3.436539     2  3.4365390  4.86
## 4        Angola    40  38.555 11.299566     2 11.2995664 15.98
## 5     Argentina    40  36.320  3.182462    23  3.1824622 11.00
## 6       Armenia    40  27.835  4.019532    12  3.9567778 14.84
## 7     Australia    40  24.785  1.075089     6  1.0750891  2.81
## 8       Austria    40  23.120  3.120849     4  0.6859300  8.48
## 9    Azerbaijan    40  17.960  9.479029     3  1.7386489 20.27
## 10 Bahamas, The    40      NA       NaN     0         NA    NA
## ..          ...   ...     ...       ...   ...        ...   ...

To showcase the power of summarise used on a grouped_df, the above code reports a wide range of customised summary statistics per country:

  • the number of rows in each country group
  • standard deviation of gini indices
  • median proportion of income earned by the top 10%
  • the number of years in which the gini index was greater than 30
  • the standard deviation of gini index values over 30
  • the range of gini index values reported for each country.

Challenge: explore the dplyr’s documentation, starting with the introductory vignette, accessed by entering vignette("introduction") and test out its capabilities on the idata dataset. (More vignette names can be discovered by typing vignette(package = "dplyr"))

5.6.1 Chaining operations with dplyr

Another interesting feature of dplyr is its ability to chain operations together. This overcomes one of the aesthetic issues with R code: you can end end-up with very long commands with many functions nested inside each other to answer relatively simple questions.

What were, on average, the 5 most unequal years for countries containing the letter g?

Here’s how chains work to organise the analysis in a logical step-by-step manner:

idata %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)
## Selecting by gini
## Source: local data frame [5 x 2]
## 
##    Year   gini
##   (int)  (dbl)
## 1  1980 46.850
## 2  1993 45.996
## 3  2013 44.550
## 4  1981 43.650
## 5  2012 43.560

The above function consists of 6 stages, each of which corresponds to a new line and dplyr function:

  1. Filter-out the countries we’re interested in (any selection criteria could be used in place of grepl("g", Country)).
  2. Group the output by year.
  3. Summarise, for each year, the mean gini index.
  4. Arrange the results by average gini index
  5. Select only the top 5 most unequal years.

To see why this method is preferable to the nested function approach, take a look at the latter. Even after indenting properly it looks terrible and is almost impossible to understand!

top_n(
  arrange(
    summarise(
      group_by(
        filter(idata, grepl("g", Country)),
        Year),
      gini = mean(gini, na.rm  = TRUE)),
    desc(gini)),
  n = 5)

This section has provided only a taster of what is possible dplyr and why it makes sense from code writing and computational efficiency perspectives. For a more detailed account of data processing with R using this approach we recommend R for Data Science (Grolemund and Wickham 2016).

5.6.2 data.table

data.table is a mature package for fast data processing that presents an alternative to dplyr. There is some controversy about which is more appropriate for different tasks19 so it should be stated at the outset that dplyr and data.table are not mutually exclusive competitors or that must be ‘better’ than another. These are both excellent packages and the important thing from an efficiency perspective is that they can help speed up data processing tasks.

data.table does have some unique features that make it very fast at accomplishing some tasks very efficiently that it is worth being aware of, however. Building on the filter() example above, we’ll see data.tables’s unique approach to subsetting.

library(data.table)
idata = readRDS("data/idata-renamed.Rds")
idata_dt = data.table(idata)
setkey(idata_dt, Country)
aus3 = idata_dt["Australia",]
Benchmark illustrating the performance gains to be expected for different dataset sizes.

Figure 5.4: Benchmark illustrating the performance gains to be expected for different dataset sizes.

Figure 5.4 illustrates the speed improvement benefits of data.table for different datasets sizes. As with the readr vs base R issue, the results show that the relative benefits of data.table improve with dataset size.