- 1 Building the book
- 2 Introduction
- 3 Efficient set-up
- 4 Efficient hardware
- 5 Efficient workflow
- 6 Efficient collaboration
- 7 Efficient programming
- 8 Efficient Rcpp
- 9 Efficient Memory
- 10 Efficient Learning

The *class* of R objects is critical to how it performs. If a class is incorrectly specified (if numbers are treated as factors, for example), R will likely generate error messages. Try typing `mean(idata$gini)`

, for example.

We can re-assign the classes of the numeric variables one-by one:

`idata$gini = as.numeric(idata$gini)`

`## Warning: NAs introduced by coercion`

`mean(idata$gini, na.rm = TRUE) # now the mean is calculated`

`## [1] 40.50363`

However, the purpose of programming languages is to *automate* arduous tasks and reduce typing. The following command re-classifies all of the numeric variables using the `apply`

function (we’ll seem more of `apply`

’s relatives later):

```
idata[5:9] = apply(idata[5:9], 2,
function(x) as.numeric(x))
```

```
countries = group_by(idata, Country)
summarise(countries, gini = mean(gini, na.rm = TRUE))
```

```
## Source: local data frame [176 x 2]
##
## Country gini
## (chr) (dbl)
## 1 Afghanistan NaN
## 2 Albania 30.43167
## 3 Algeria 37.76000
## 4 Angola 50.65000
## 5 Argentina 48.06739
## 6 Armenia 33.72929
## 7 Australia 33.14167
## 8 Austria 29.15167
## 9 Azerbaijan 24.79000
## 10 Bahamas, The NaN
## .. ... ...
```

Note that `summarise`

is highly versatile, and can be used to return a customised range of summary statistics:

```
summarise(countries,
# number of rows per country
obs = n(),
med_t10 = median(top10, na.rm = TRUE),
# standard deviation
sdev = sd(gini, na.rm = TRUE),
# number with gini > 30
n30 = sum(gini > 30, na.rm = TRUE),
sdn30 = sd(gini[ gini > 30 ], na.rm = TRUE),
# range
dif = max(gini, na.rm = TRUE) - min(gini, na.rm = TRUE)
)
```

```
## Source: local data frame [176 x 7]
##
## Country obs med_t10 sdev n30 sdn30 dif
## (chr) (int) (dbl) (dbl) (int) (dbl) (dbl)
## 1 Afghanistan 40 NA NaN 0 NA NA
## 2 Albania 40 24.435 1.252524 3 0.3642801 2.78
## 3 Algeria 40 29.780 3.436539 2 3.4365390 4.86
## 4 Angola 40 38.555 11.299566 2 11.2995664 15.98
## 5 Argentina 40 36.320 3.182462 23 3.1824622 11.00
## 6 Armenia 40 27.835 4.019532 12 3.9567778 14.84
## 7 Australia 40 24.785 1.075089 6 1.0750891 2.81
## 8 Austria 40 23.120 3.120849 4 0.6859300 8.48
## 9 Azerbaijan 40 17.960 9.479029 3 1.7386489 20.27
## 10 Bahamas, The 40 NA NaN 0 NA NA
## .. ... ... ... ... ... ... ...
```

To showcase the power of `summarise`

used on a `grouped_df`

, the above code reports a wide range of customised summary statistics *per country*:

- the number of rows in each country group
- standard deviation of gini indices
- median proportion of income earned by the top 10%
- the number of years in which the gini index was greater than 30
- the standard deviation of gini index values over 30
- the range of gini index values reported for each country.

Challenge: explore thedplyr’s documentation, starting with the introductory vignette, accessed by entering`vignette("introduction")`

and test out its capabilities on the`idata`

dataset. (More vignette names can be discovered by typing`vignette(package = "dplyr")`

)

Another interesting feature of **dplyr** is its ability to chain operations together. This overcomes one of the aesthetic issues with R code: you can end end-up with very long commands with many functions nested inside each other to answer relatively simple questions.

What were, on average, the 5 most unequal years for countries containing the letter g?

Here’s how chains work to organise the analysis in a logical step-by-step manner:

```
idata %>%
filter(grepl("g", Country)) %>%
group_by(Year) %>%
summarise(gini = mean(gini, na.rm = TRUE)) %>%
arrange(desc(gini)) %>%
top_n(n = 5)
```

`## Selecting by gini`

```
## Source: local data frame [5 x 2]
##
## Year gini
## (int) (dbl)
## 1 1980 46.850
## 2 1993 45.996
## 3 2013 44.550
## 4 1981 43.650
## 5 2012 43.560
```

The above function consists of 6 stages, each of which corresponds to a new line and **dplyr** function:

- Filter-out the countries we’re interested in (any selection criteria could be used in place of
`grepl("g", Country)`

). - Group the output by year.
- Summarise, for each year, the mean gini index.
- Arrange the results by average gini index
- Select only the top 5 most unequal years.

To see why this method is preferable to the nested function approach, take a look at the latter. Even after indenting properly it looks terrible and is almost impossible to understand!

```
top_n(
arrange(
summarise(
group_by(
filter(idata, grepl("g", Country)),
Year),
gini = mean(gini, na.rm = TRUE)),
desc(gini)),
n = 5)
```

This section has provided only a taster of what is possible **dplyr** and why it makes sense from code writing and computational efficiency perspectives. For a more detailed account of data processing with R using this approach we recommend *R for Data Science* (Grolemund and Wickham 2016).

**data.table** is a mature package for fast data processing that presents an alternative to **dplyr**. There is some controversy about which is more appropriate for different tasks^{19} so it should be stated at the outset that **dplyr** and **data.table** are not mutually exclusive competitors or that must be ‘better’ than another. These are both excellent packages and the important thing from an efficiency perspective is that they can help speed up data processing tasks.

**data.table** does have some unique features that make it very fast at accomplishing some tasks very efficiently that it is worth being aware of, however. Building on the `filter()`

example above, we’ll see **data.tables**’s unique approach to subsetting.

```
library(data.table)
idata = readRDS("data/idata-renamed.Rds")
idata_dt = data.table(idata)
setkey(idata_dt, Country)
aus3 = idata_dt["Australia",]
```

Figure 5.4 illustrates the speed improvement benefits of data.table for different datasets sizes. As with the **readr** vs base R issue, the results show that the relative benefits of **data.table** improve with dataset size.