6 Efficient data carpentry

There are many words for data processing. You can clean, hack, manipulate, munge, refine and tidy your dataset, ready for the next stage, typically modelling and visualisation. Each word says something about perceptions towards the process: data processing is often seen as dirty work, an unpleasant necessity that must be endured before the real, fun and important work begins. This perception is wrong. Getting your data ‘ship shape’ is a respectable and in some cases vital skill. For this reason we use the more admirable term data carpentry.

The metaphor is not accidental. Carpentry is the process of taking rough pieces of wood and working with care, diligence and precision to create a finished product. A carpenter does not hack at the wood at random. He or she will inspect the raw material and select the right tool for the job. In the same way data carpentry is the process of taking rough, raw and to some extent randomly arranged input data and creating neatly organised and tidy data. Learning the skill of data carpentry early will yield benefits for years to come. “Give me six hours to chop down a tree and I will spend the first four sharpening the axe” as Abraham Lincoln put it, continuing the metaphor.

Data processing is a critical stage in any project involving any datasets from external sources, i.e. most real world applications. In the same way that technical debt, discussed in the previous Chapter, can cripple your workflow, working with messy data can lead to project management hell.

Fortunately, done efficiently, at the outset of your project (rather than half way through, when it may be too late) and using appropriate tools this data processing stage can be highly rewarding. More importantly from an efficiency perspective, working with clean data will be beneficial for every subsequent stage of your R project. So, for data intensive applications, this could be the most important chapter of book. In it we cover the following topics:

  • Tidying data with tidyr
  • Processing data with dplyr
  • Working with databases
  • Data processing with data.table

6.1 Top 5 tips for efficient data carpentry

  • Give data processing the proper time and attention it needs as it could save hours of frustration in the long run.

  • Tidy your data carefully at the earliest stage of the analysis process, perhaps using functions provided by tidyr.

  • Use the tbl class defined by the tibble package and the default object type of dplyr, to make data frames efficient to print and easier to work with.

  • Don’t rely on base R for data processing. We recommend dplyr for most applications, although data.table may be optimal in contexts where speed is critical.

  • Using the %>% ‘pipe’ operator in combination with dplyr’s verbs can help clarify complex data processing workflows, easing the writing of analysis code and communication with others.

6.2 Efficient data frames with tibble

tibble is a package that defines a new data frame class for R, the tbl_df. These ‘tibble diffs’ (as their inventor suggests they should be pronounced) are like the base class data.frame, but with more user friendly printing, subsetting, and factor handling.

A tibble data frame is an S3 object with three classes, tbl_df, tbl, and data.frame. Since the object has the data.frame tag, this means that if a tbl_df or tbl method isn’t available, the object will be passed on to the appropriate data.frame function.

To create a tibble data frame, we use tibble function

library("tibble")
tibble(x = 1:3, y = c("A", "B", "C"))
#> # A tibble: 3 x 2
#>       x     y
#>   <int> <chr>
#> 1     1     A
#> 2     2     B
#> 3     3     C

The example above illustrates the main differences between the tibble and base R approach to data frames:

  • When printed, the tibble diff reports the class of each variable. data.frame objects do not.
  • Character vectors are not coerced into factors when they are incorporated into a tbl_df, as can be seen by the <chr> heading between the variable name and the second column. By contrast, data.frame() coerces characters into factors which can cause problems further down the line.
  • When printing a tibble diff to screen, only the first ten rows are displayed. The number columns printed depends on the window size.

Other differences can be found in the associated help page - help("tibble").

You can create a tibble data frame row-by-row using the frame_data function.

Exericse

Create the following data frame

df_base = data.frame(colA = "A")

Try and guess the output of the following commands

print(df_base)
df_base$colA
df_base$col
df_base$colB

Now create a tibble data frame and repeat the above commands.

6.3 Tidying data with tidyr

A key skill in data analysis is understanding the structure of datasets and being able to ‘reshape’ them. This is important from a workflow efficiency perspective: more than half of a data analyst’s time can be spent re-formatting datasets (H. Wickham 2014b), so getting it into a suitable form early could save hours in the future. Converting data into a ‘tidy’ form is also advantageous from a computational efficiency perspective: it is usually faster to run analysis and plotting commands on tidy data.

Data tidying includes data cleaning and data reshaping. Data cleaning is the process of re-formatting and labelling messy data. Packages including stringi and stringr can help update messy character strings using regular expressions; assertive and assertr packages can perform diagnostic checks for data integrity at the outset of a data analysis project. A common data cleaning task is the conversion of non-standard text strings into date formats as described in the lubridate vignette (see vignette("lubridate")). Tidying is a broader concept, however, and also includes re-shaping data so that it is in a form more conducive to data analysis and modelling. The process of reshaping is illustrated by Tables 6.1 and 6.2, provided by H. Wickham (2014b).

These tables may look different, but they contain precisely the same information. Column names in the ‘wide’ form in Table 6.1 became a new variable in the ‘long’ form in Table 6.2. According to the concept of ‘tidy data’, the long form is correct. Note that ‘correct’ here is used in the context of data analysis and graphical visualisation. For tabular presentation the wide form may be better. Wide data can also be less memory consuming because it requires fewer cells.

Tidy data has the following characteristics (H. Wickham 2014b):

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Because there is only one observational unit in the example (religions), it can be described in a single table. Large and complex datasets are usually represented by multiple tables, with unique identifiers or ‘keys’ to join them together (Codd 1979).

Two common operations facilitated by tidyr are gathering and splitting columns.

  • Gathering: this means making ‘wide’ tables ‘long’ by converting column names to a new variable. This is done is done with the function gather (the inverse of which is spread) , as illustrated in Table 6.1 and Table 6.2 and in the code block below:
library("tidyr")
library("readr")
raw = read_csv("data/pew.csv") # read in the 'wide' dataset
dim(raw)
#> [1] 18 10
rawt = gather(raw, Income, Count, -religion)
dim(rawt)
#> [1] 162   3
rawt[1:3,]
#> # A tibble: 3 x 3
#>   religion Income Count
#>      <chr>  <chr> <int>
#> 1 Agnostic  <$10k    27
#> 2  Atheist  <$10k    12
#> 3 Buddhist  <$10k    27

Note that the dimensions of the data change from having 10 observations across 18 columns to 162 rows in only 3 columns. Note that when we print the object rawt[1:3,], the class of each variable is given (chr, fctr, int refer to character, factor and integer classes, respectively). This is because readr creates tbl objects from the tibble package.

Table 6.1: First 6 rows of the aggregated ‘pew’ dataset from Wickham (2014a) in an ‘untidy’ form.
religion <$10k $10–20k $20–30k
Agnostic 27 34 60
Atheist 12 27 37
Buddhist 27 21 30
Table 6.2: Long form of the Pew dataset represented above.
religion Income Count
Agnostic <$10k 27
Atheist <$10k 12
Buddhist <$10k 27
Agnostic $10–20k 34
Atheist $10–20k 27
Buddhist $10–20k 21
Agnostic $20–30k 60
Atheist $20–30k 37
Buddhist $20–30k 30
  • Splitting: this means taking a variable that is really two variables combined and creating two separate columns from it. A classic example is age-sex variables (e.g. m0-10 and f0-15 to represent males and females in the 0 to 10 age band). Splitting such variables can be done with separate, as illustrated in Table 6.3 and 6.4.
agesex = c("m0-10", "f0-10") # create compound variable
n = c(3, 5) # create a value for each observation
agesex_df = data.frame(agesex, n) # create a data frame
separate(agesex_df, agesex, c("sex", "age"), 1)
#>   sex  age n
#> 1   m 0-10 3
#> 2   f 0-10 5
Table 6.3: Joined age and sex variables in one column
agesex n
m0-10 3
f0-10 5
Table 6.4: Age and sex variables separated by the funtion separate.
sex age n
m 0-10 3
f 0-10 5

There are other tidying operations that tidyr can perform, as described in the package’s vignette (vignette("tidy-data")). The wider issue of manipulation is a large topic with major potential implications for efficiency (Spector 2008) and this section only covers some of the key operations. More important is understanding the principles behind converting messy data into standard output forms. These same principles can also be applied to the representation of model results: the broom package provides a standard output format for model results, easing interpretation (see the broom vignette).

Note that the above code used non standard evaluation, which means not surrounding variable names in quote marks for ease of typing and autocompletion. This is fine when using R interactively. But when you’d like to use R non-interactively, code is generally more robust using standard evaluation.

Usually it is more efficient to use the non-standard evaluation version of variable names, as these can be auto completed by RStudio. In some cases you may want to use standard evaluation and refer to variable names using quote marks. To do this, affix _ can be added to dplyr and tidyr function names to allow the use of standard evaluation. Thus the standard evaluation version of separate(agesex_df, agesex, c("sex", "age"), 1) is separate_(agesex_df, "agesex", c("sex", "age"), 1).

6.4 Efficient data processing with dplyr

After tidying your data, the next stage is generally data processing. This includes the creation of new data, for example a new column that is some function of existing columns, or data analysis, the process of asking directed questions of the data and exporting the results in a user-readable form.

Following the advice in Section 4.4, we have carefully selected an appropriate package for these tasks: dplyr, which roughly means ‘data pliers’. dplyr has a number of advantages over base R and data.table approaches to data processing:

  • dplyr is fast to run (due to its C++ backend) and intuitive to type
  • dplyr works well with tidy data, as described above
  • dplyr works well with databases, providing efficiency gains on large datasets

Furthermore, dplyr is efficient to learn (see Chapter 10. It has small number of intuitively named functions, or ‘verbs’. These can be used in combination with each other, and the grouping function group_by(), to solve a wide range of data processing challenges (see Table 6.5).

Table 6.5: dplyr verb functions.
dplyr function(s) Description Base R functions
filter(), slice() Subset rows by attribute (filter) or position (slice) subset(), []
arrange() Return data ordered by variable(s) sort()
select() Subset columns subset()
rename() Rename columns names()
distinct() Return unique rows unique()
mutate() Create new variables (transmute drops existing variables) [[]]
summarise() Collapse data into a single row aggregate()
sample_n() Return a sample of the data sample()

Unlike the base R analogues, dplyr‘s data processing functions work in a consistent way. Each function takes a data frame object as its first argument and results in another data frame. Variables can be called directly without using the $ operator. dplyr was designed to be used with the ’pipe’ operator %>% provided by the magittr package, allowing each data processing stage to be represented as a new line. This is illustrated in the code chunk below, which loads a tidy country level dataset of greehouse gas emissions from the efficient package, and then identifies the countries with the greatest absolute growth in emissions from 1971 to 2012:

library("dplyr")
data("ghg_ems", package = "efficient")
top_table =
  ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  group_by(Country) %>% 
  summarise(Mean = mean(Transportation),
            Growth = diff(range(Transportation))) %>%
  top_n(3, Growth) %>%
  arrange(desc(Growth))

The results, illustrated in table 6.6, show that the USA has the highest growth and average emissions from the transport sector, followed closely by China. The aim of this code chunk, which you are not expected to completely understand, is to provide a taster of dplyr’s unique syntax.

Table 6.6: The top 3 countries in terms of average CO2 emissions from transport since 1971, and growth in transport emissions over that period (MTCO2e/yr).
Country Mean Growth
United States 1462 709
China 214 656
India 85 170

Building on the ‘learning by doing’ ethic, the remainder of this section works through these functions to process and begin to analyse a dataset on economic equality provided by the World Bank. The input dataset can be loaded as follows:

fname = system.file("extdata/world-bank-ineq.csv", package = "efficient")
idata = read_csv(fname)
idata # print the dataset 
#> # A tibble: 6,925 x 9
#>   Country Name Country Code  Year Year Code
#>          <chr>        <chr> <int>     <chr>
#> 1  Afghanistan          AFG  1974    YR1974
#> 2  Afghanistan          AFG  1975    YR1975
#> 3  Afghanistan          AFG  1976    YR1976
#> 4  Afghanistan          AFG  1977    YR1977
#> # ... with 6,921 more rows, and 5 more variables: Income share held by
#> #   highest 10% [SI.DST.10TH.10] <chr>, Income share held by lowest 10%
#> #   [SI.DST.FRST.10] <chr>, GINI index [SI.POV.GINI] <chr>, Survey mean
#> #   consumption or income per capita, bottom 40% (2005 PPP $ per day)
#> #   [SI.SPR.PC40] <chr>, Survey mean consumption or income per capita,
#> #   total population (2005 PPP $ per day) [SI.SPR.PCAP] <chr>

You should not be expecting to master dplyr‘s functionality in one sitting: the package is large and can be seen as a language in its own right. Following the ’walk before you run’ principle, we’ll start simple, by filtering and aggregating rows.

6.4.1 Renaming columns

Renaming data columns is a common task that can make writing code faster by using short, intuitive names. The dplyr function rename() makes this easy.

Note in this code block the variable name is surrounded by back-quotes (). This allows R to refer to column names that are non-standard. Note also the syntax:renametakes the data frame as the first object and then creates new variables by specifyingnew_variable_name = original_name`.

library("dplyr")
idata = rename(idata, Country = `Country Name`)

To rename multiple columns the variable names are simply separated by commas. The base R and dplyr way of doing this is illustrated for clarity.

# The dplyr way (rename two variables)
idata = rename(idata,
 top10 = `Income share held by highest 10% [SI.DST.10TH.10]`,
 bot10 = `Income share held by lowest 10% [SI.DST.FRST.10]`)
# The base R way (rename five variables)
names(idata)[5:9] = c("top10", "bot10", "gini", "b40_cons", "gdp_percap")

6.4.2 Changing column classes

The class of R objects is critical to performance. If a class is incorrectly specified (e.g. if numbers are treated as factors or characters) this will lead to incorrect results. The class of all columns in a data.frame can be queried using the function str() (short for display the structure of an object), as illustrated below, with the inequality data loaded previously.14

str(idata)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    6925 obs. of  9 variables:
#>  $ Country     : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#>  $ Country Code: chr  "AFG" "AFG" "AFG" "AFG" ...
#>  $ Year        : int  1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 ...
#>  $ Year Code   : chr  "YR1974" "YR1975" "YR1976" "YR1977" ...
#>  $ top10       : chr  ".." ".." ".." ".." ...
#>  $ bot10       : chr  ".." ".." ".." ".." ...
#>  $ gini        : chr  ".." ".." ".." ".." ...
#>  $ b40_cons    : chr  ".." ".." ".." ".." ...
#>  $ gdp_percap  : chr  ".." ".." ".." ".." ...

This shows that although we loaded the data correctly all columns are seen by R as characters. This means we cannot perform numerical calculations on the dataset: mean(idata$gini) fails.

Visual inspection of the data (e.g. via View(idata)) clearly shows that all columns except for 1 to 4 (Country, Country Code, Year and Year Code) should be numeric. We can re-assign the classes of the numeric variables one-by one:

idata$gini = as.numeric(idata$gini)
#> Warning: NAs introduced by coercion
mean(idata$gini, na.rm = TRUE) # now the mean is calculated
#> [1] 40.5

However the purpose of programming languages is to automate tasks and reduce typing. The following code chunk re-classifies all of the numeric variables using data.matrix(), which converts the data frame to a numeric matrix:

id = 5:9 # column ids to change
idata[id] = data.matrix(idata[id])
vapply(idata, class, character(1))
#>      Country Country Code         Year    Year Code        top10 
#>  "character"  "character"    "integer"  "character"    "numeric" 
#>        bot10         gini     b40_cons   gdp_percap 
#>    "numeric"    "numeric"    "numeric"    "numeric"

As is so often the case with R, there are many ways to solve the problem. Below is a one-liner using unlist() which converts list objects into vectors:

idata[id] = as.numeric(unlist(idata[id]))

Another one-liner to achieve the same result uses dplyr’s mutate_each function:

idata_mutate = mutate_each(idata, "as.numeric", id)

As with other operations there are other ways of achieving the same result in R, including the use of loops via apply() and for(). These are shown in the chapter’s source code.

6.4.3 Filtering rows

dplyr offers an alternative and more flexible way of filtering data, using filter().

# Base R: idata[idata$Country == "Australia",]
aus2 = filter(idata, Country == "Australia")

In addition to being more flexible (see ?filter), filter is slightly faster than base R’s notation.15 Note that dplyr does not use the $ symbol: it knows that that Country is a variable of idata: the first argument of dplyr functions usually a data.frame, and subsequent in this context variable names can be treated as vector objects.16

There are dplyr equivalents of many base R functions but these usually work slightly differently. The dplyr equivalent of aggregate, for example is to use the grouping function group_by in combination with the general purpose function summarise (not to be confused with summary in base R), as we shall see in Section 6.4.5. For consistency, however, we next look at filtering columns.

6.4.4 Filtering columns

Large datasets often contain much worthless or blank information. This consumes RAM and reduces computational efficiency. Being able to focus quickly only on the variables of interest becomes especially important when handling large datasets.

Imagine that we have a text file called miniaa which is large enough to consume most of your computer’s RAM. We can load it with the following command:

data(miniaa, package="efficient") # load imaginary large data
dim(miniaa)
#> [1]   9 329

Note that the data frame has 329 columns, and imagine it has millions of rows, instead of $9. That’s a lot of variables. Do we need them all? It’s worth taking a glimpse at this dataset to find out:

glimpse(miniaa)
#> Observations: 9
#> Variables: 329
#> $ NPI          
#>     <int> ...
#> ...

The majority of variables in this dataset only contain NA. To clean the giant dataset, removing the empty columns, we need to identify which variables these are.

# Identify the variable which are all NA
all_na = vapply(miniaa, function(x) anyNA(x), logical(1))
summary(all_na) # summary of the results
#>    Mode   FALSE    TRUE    NA's 
#> logical     321       8       0
miniaa1 = miniaa[!all_na] # subset the dataframe

The new miniaa object has fewer than a third of the original columns.

6.4.5 Data aggregation

Data aggregation is the process of creating summaries of data based on a grouping variable. The end result usually has the same number of rows as there are groups. Because aggregation is a way of condensing datasets it can be a very useful technique for making sense of large datasets. The following code finds the number of unique countries (country being the grouping variable) from the ghg_ems dataset stored in the efficient package.

# data available online, from github.com/csgillespie/efficient_pkg
data(ghg_ems, package = "efficient")
names(ghg_ems)
#> [1] "Country"        "Year"           "Electricity"    "Manufacturing" 
#> [5] "Transportation" "Other"          "Fugitive"
nrow(ghg_ems)
#> [1] 7896
length(unique(ghg_ems$Country))
#> [1] 188

Note that while there are almost \(8000\) rows, there are fewer than 200 countries.

Note that the ghg_ems column names were cleaned using the command word(names(ghg_ems), sep = " |/“) from the stringr package before it was saved as a dataset within the efficient package. ” |/“ in this context means”any space or forward slash is a word break“. See the Pattern matching section in vignette(”stringr“) for more on this. For details about the ghg_ems dataset, see the documentation in ?efficient::ghg_ems.

To aggregate the dataset using dplyr package, you divide the task in two: to group the dataset first and then to summarise, as illustrated below.17

library("dplyr")
group_by(ghg_ems, Country) %>%
  summarise(mean_eco2 = mean(Electricity, na.rm  = TRUE))
#> # A tibble: 188 x 2
#>       Country mean_eco2
#>         <chr>     <dbl>
#> 1 Afghanistan       NaN
#> 2     Albania     0.641
#> 3     Algeria    23.015
#> 4      Angola     0.791
#> # ... with 184 more rows

The example above relates to a philosophical question in programming: how much work should one function do? The Unix philosophy states that programs should “do one thing well”. And shorter functions are easier to understand and debug. But having too many functions can also make your call stack confusing, and the code hard to maintain. In general, being modular and specific is advantageous for clarity, and this modular approach is illustrated in the above example with the dual group_by and summarise stages.

To reinforce the point, this operation is also performed below on the idata dataset:

data(idata_mutate, package="efficient")
countries = group_by(idata_mutate, Country)
summarise(countries, gini = mean(gini, na.rm  = TRUE))
#> # A tibble: 176 x 2
#>       Country  gini
#>         <chr> <dbl>
#> 1 Afghanistan   NaN
#> 2     Albania  30.4
#> 3     Algeria  37.8
#> 4      Angola  50.6
#> # ... with 172 more rows

Note that summarise is highly versatile, and can be used to return a customised range of summary statistics:

summarise(countries,
  # number of rows per country
  obs = n(), 
  med_t10 = median(top10, na.rm  = TRUE),
  # standard deviation
  sdev = sd(gini, na.rm  = TRUE), 
  # number with gini > 30
  n30 = sum(gini > 30, na.rm  = TRUE), 
  sdn30 = sd(gini[ gini > 30 ], na.rm  = TRUE),
  # range
  dif = max(gini, na.rm  = TRUE) - min(gini, na.rm  = TRUE)
  )
#> # A tibble: 176 x 7
#>       Country   obs med_t10  sdev   n30  sdn30   dif
#>         <chr> <int>   <dbl> <dbl> <int>  <dbl> <dbl>
#> 1 Afghanistan    40      NA   NaN     0     NA    NA
#> 2     Albania    40    24.4  1.25     3  0.364  2.78
#> 3     Algeria    40    29.8  3.44     2  3.437  4.86
#> 4      Angola    40    38.6 11.30     2 11.300 15.98
#> # ... with 172 more rows

To showcase the power of summarise used on a grouped_df, the above code reports a wide range of customised summary statistics per country:

  • the number of rows in each country group
  • standard deviation of gini indices
  • median proportion of income earned by the top 10%
  • the number of years in which the gini index was greater than 30
  • the standard deviation of gini index values over 30
  • the range of gini index values reported for each country.

Exercises

  1. Refer back to the greenhouse gas emissions example at the outset of section 6.4, in which we found the top 3 countries in terms of emissions growth in the transport sector. a) Explain in words what is going on in each line. b) Try to find the top 3 countries in terms of emissions in 2012 - how is the list different?

  2. Referring back to Section 6.4.1, rename the variables 4 to 8 using the dplyr function rename. Follow the pattern ECO2, MCO2 etc.

  3. Explore dplyr’s documentation, starting with the introductory vignette, accessed by entering vignette("introduction").

  4. Test additional dplyr ‘verbs’ on the idata dataset. (More vignette names can be discovered by typing vignette(package = "dplyr").)

6.4.6 Chaining operations

Another interesting feature of dplyr is its ability to chain operations together. This overcomes one of the aesthetic issues with R code: you can end end-up with very long commands with many functions nested inside each other to answer relatively simple questions.

What were, on average, the 5 most unequal years for countries containing the letter g?

Here’s how chains work to organise the analysis in a logical step-by-step manner:

idata_mutate %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)
#> Selecting by gini
#> # A tibble: 5 x 2
#>    Year  gini
#>   <int> <dbl>
#> 1  1980  46.9
#> 2  1993  46.0
#> 3  2013  44.5
#> 4  1981  43.6
#> # ... with 1 more rows

The above function consists of 6 stages, each of which corresponds to a new line and dplyr function:

  1. Filter-out the countries we’re interested in (any selection criteria could be used in place of grepl("g", Country)).
  2. Group the output by year.
  3. Summarise, for each year, the mean gini index.
  4. Arrange the results by average gini index
  5. Select only the top 5 most unequal years.

To see why this method is preferable to the nested function approach, take a look at the latter. Even after indenting properly it looks terrible and is almost impossible to understand!

top_n(
  arrange(
    summarise(
      group_by(
        filter(idata, grepl("g", Country)),
        Year),
      gini = mean(gini, na.rm  = TRUE)),
    desc(gini)),
  n = 5)

This section has provided only a taster of what is possible dplyr and why it makes sense from code writing and computational efficiency perspectives. For a more detailed account of data processing with R using this approach we recommend R for Data Science (Grolemund and Wickham, n.d.).

6.5 Combining datasets

The usefulness of a dataset can sometimes be greatly enhanced by combining it with other data. If we could merge the global ghg_ems dataset with geographic data, for example, we could visualise the spatial distribution of climate pollution. For the purposes of this section we join ghg_ems to the world data provided by ggmap to illustrate the concepts and methods of data joining (also referred to as merging).

library("ggmap")
world = map_data("world")
names(world)
#> [1] "long"      "lat"       "group"     "order"     "region"    "subregion"

Visually compare this new dataset of the world with ghg_ems (e.g. via View(world); View(ghg_ems)). It is clear that the column region in the former contains the same information as Country in the latter. This will be the joining variable; renaming it in world will make the join more efficient.

world = rename(world, Country = region)
ghg_ems$All = rowSums(ghg_ems[3:7])

Ensure that both joining variables have the same class (combining character and factor columns can cause havoc).

How large is the overlap between ghg_ems$Country and world$Country? We can find out using the %in% operator, which finds out how many elements in one vector match those in another vector. Specifically, we will find out how many unique country names from ghg_ems are present in the world dataset:

# save the result as 'm', for match
c_u = unique(ghg_ems$Country)
w_u = unique(world$Country)
summary({m = c_u %in% w_u})
#>    Mode   FALSE    TRUE    NA's 
#> logical      20     168       0

This comparison exercise has been fruitful: most of the countries in the co2 dataset exist in the world dataset. But what about the 20 country names that do not match? We can identify these as follows:

(n = c_u[!m]) # non-matching country names in world data
#>  [1] "Antigua & Barbuda"          "Bahamas, The"              
#>  [3] "Bosnia & Herzegovina"       "Congo, Dem. Rep."          
#>  [5] "Congo, Rep."                "Cote d'Ivoire"             
#>  [7] "European Union (15)"        "European Union (28)"       
#>  [9] "Gambia, The"                "Korea, Dem. Rep. (North)"  
#> [11] "Korea, Rep. (South)"        "Macedonia, FYR"            
#> [13] "Russian Federation"         "Saint Kitts & Nevis"       
#> [15] "Saint Vincent & Grenadines" "Sao Tome & Principe"       
#> [17] "Trinidad & Tobago"          "United Kingdom"            
#> [19] "United States"              "World"

It is clear from the output that some of the non-matches (e.g. the European Union) are not countries at all. However, others, such as ‘Gambia, The’ and the United States clearly should have matches. Fuzzy matching can help to rectify this, as illustrated the first non-matching country below:

grep(n[1], w_u) # no match
#> integer(0)
(fm = agrep(n[1], w_u, max.distance = 10))
#> [1] 15 16
(w_u1 = w_u[fm])
#> [1] "Antigua" "Barbuda"

What just happened? We verified that first unmatching country in the ghg_ems dataset was not in the world country names with a grep. So we used the more powerful agrep to search for fuzzy matches (with the max.distance argument set to 10 after trial and error). The results show that the country Antigua & Barbuda from the ghg_ems data is matched two countries in the world dataset. We can update the names in the dataset we are joining to accordingly:

world$Country[world$Country %in% w_u1] = n[1]

Fuzzy matching is still a laborious process that must be complemented by human judgement. It takes a human to know for sure that United States is represented as USA in the world dataset, without risking false matches via agrep. This can be fixed manually:

world$Country[world$Country == "USA"] = "United States"

To fix the remaining issues, we simply continued with the same method, using a for loop and verifying the results instead of doing all by hand. The code used to match the remaining unmatched countries can be seen on the book’s GitHub page.

There is one more stage that is needed before global CO2 emissions can be mapped for any year: the data must be joined. The base function merge can do this but we strongly recommend using one of the join functions from dplyr, such as left_join (which keeps all rows in the original dataset) and inner_join (which keeps only rows with matches in both datasets), as illustrated below:

nrow({world_co2 = left_join(world, ghg_ems)})
#> Joining, by = "Country"
#> [1] 3330794
nrow(inner_join(world, ghg_ems))
#> Joining, by = "Country"
#> [1] 3310272

Note that inner_join removes rows from the world dataset which have no match in ghg_ems: if we were to plot the resulting dataset, the continent of Antarctica and a number of countries not represented in the ghg_ems dataset would be absent. Figure 6.1 shows the results of this data carpentry, produced using a modified version of the ggplot2 code below, were worth the effort.

world_co2_2012 = filter(world_co2, Year == 2012 | is.na(Year))
ggplot(world_co2_2012, aes(long, lat)) +
  geom_polygon(aes(fill = All, group = group))
The geographical distribution of carbon dioxide emissions in 2012.

Figure 6.1: The geographical distribution of carbon dioxide emissions in 2012.

6.6 Working with databases

Instead of loading all the data into RAM, as R does, databases query data from the hard-disk. This can allow a subset of a very large dataset to be defined and read into R quickly, without having to load it first. R can connect to databases in a number of ways, which are briefly touched on below. Databases is a large subject area undergoing rapid evolution. Rather than aiming at comprehensive coverage, we will provide pointers to developments that enable efficient access to a wide range of database types. An up-to-date history of R’s interfaces to databases can be found in README of the DBI package, which provides a common interface and set of classes for driver packages (such as RSQLite).

RODBC is a veteran package for querying external databases from within R, using the Open Database Connectivity (ODBC) API. The functionality of RODBC is described in the package’s vignette (see vignette("RODBC")). RODBC connects to ‘traditional’ databases such as MySQL, PostgreSQL, Oracle and SQLite. Since then, the DBI package has created a unified structure for accessing databases allowing for other drivers to be added as modular packages. Thus new packages that build on DBI can be seen as a replacement of RODBC (RMySQL, RPostgreSQL, and RSQLite) (see vignette("backend") for more on how DBI drivers work). Because the DBI syntax applies to a wide range of database types we use it here with a worked example.

Imagine you have access to a database that contains the ghg_ems data set.

# Connect to a database driver
library("RSQLite")
con = dbConnect(SQLite(), dbname = ghg_db) # Also username & password arguments
dbListTables(con)
rs = dbSendQuery(con, "SELECT * FROM `ghg_ems` WHERE (`Country` != 'World')")
df_head = dbFetch(rs, n = 6) # extract first 6 row

The above code chunk shows how the function dbConnect connects to an external database, in this case a MySQL database. The username and password arguments are used to establish the connection. Next we query which tables are available with dbListTables, query the database (without yet extracting the results to R) with dbSendQuery and, finally, load the results into R with dbFetch.

Be sure never to release your password by entering it directly into the command. Instead, we recommend saving sensitive information such as database passwords and API keys in .Renviron, described in Chapter 2. Assuming you had saved your password as the environment variable PSWRD, you could enter pwd = Sys.getenv(“PSWRD”) to minimise the risk of exposing your password through accidentally releasing the code or your session history.

Recently there has been a shift to the ‘noSQL’ approach for storing large datasets. This is illustrated by the emergence and uptake of software such as MongoDB and Apache Cassandra, which have R interfaces via packages mongolite and RJDBC, which can connect to Apache Cassandra data stores and any source compliant with the Java Database Connectivity (JDBC) API.

MonetDB is a recent alternative to traditional and noSQL approaches which offers substantial efficiency advantages for handling large datasets (Kersten et al. 2011). A tutorial on the MonetDB website provides an excellent introduction to handling databases from within R.

There are many wider considerations in relation to databases that we will not cover here: who will manage and maintain the database? How will it be backed up locally (local copies should be stored to reduce reliance on the network)? What is the appropriate database for your project. These issues can have major efficiency, especially on large, data intensive projects. However, we will not cover them here because it strays too far out of the R ecosystem and into the universe of databases for the purposes of this book. Instead, we direct the interested reader towards further resources on the subject, including:

  • db-engines.com/en/: a website comparing the relative merits of different databases.
  • The databases vignette from the dplyr package.
  • Getting started with MongoDB in R, an introductory vignette on non-relational databases and map reduce from the mongolite package.

6.6.1 Databases and dplyr

To access a database in R via dplyr, one must use one of the src_ functions to create a source. Continuing with the SQLite example above, one would create a tbl object, that can be queried by dplyr as follows:

library("dplyr")
ghg_db = src_sqlite(ghg_db)
ghg_tbl = tbl(ghg_db, "ghg_ems")

The ghg_tbl object can then be queried in a similar way as a standard data frame. For example, suppose we wished to filter by Country. Then we use the filter function as before

rm_world = ghg_tbl %>%
  filter(Country != "World")

In the above code, dplyr has actually generated the necessary SQL command, which can be examined using explain(rm_world). When working with databases, dplyr use lazy evaluation. The SQL command associated with rm_world hasn’t yet been executed, this is why tail(rm_world) doesn’t work. By using lazy evaluation, dplyr is more efficient at handling large data structures since it avoids unnecessary copying. When you want your SQL command to be executed, use collect(rm_world).

Exercises

Follow the worked example below to create and query a database on land prices in the UK using dplyr as a front end to an SQLite database. The first stage is to read-in the data:

# See help("land_df", package="efficient") for detials
data(land_df, package="efficient")

The next stage is to create an SQLite database to hold the data:

# install.packages("RSQLite") # Requires RSQLite package
my_db = src_sqlite("land.sqlite3", create = TRUE)
land_sqlite = copy_to(my_db, land_df, indexes = list("trnsctnuni", "dtoftrnsfr"))
class(land_sqlite)
#> [1] "tbl_sqlite" "tbl_sql"    "tbl_lazy"   "tbl"

From the above code we can see that we have created a tbl. This can be accessed using dplyr in the same way as any data frame can. Now we can query the data. You can use SQL code to query the database directly or use standard dplyr verbs on the table.

# Method 1: using sql
tbl(my_db, sql('SELECT "price", "dtoftrnsfr", "postcode"  FROM land_df'))
#> Source:   query [?? x 3]
#> Database: sqlite 3.8.6 [land.sqlite3]
#> 
#>    price dtoftrnsfr postcode
#>    <int>      <dbl>    <chr>
#> 1  84000   1.15e+09  CW9 5EU
#> 2 123500   1.14e+09 TR13 8JH
#> 3 217950   1.17e+09 PL33 9DL
#> 4 147000   1.16e+09 EX39 5XT
#> # ... with more rows

# Method 2: using dplyr
select(land_sqlite, price, dtoftrnsfr, postcode, proprtytyp)
#> Source:   query [?? x 4]
#> Database: sqlite 3.8.6 [land.sqlite3]
#> 
#>    price dtoftrnsfr postcode proprtytyp
#>    <int>      <dbl>    <chr>      <chr>
#> 1  84000   1.15e+09  CW9 5EU          F
#> 2 123500   1.14e+09 TR13 8JH          T
#> 3 217950   1.17e+09 PL33 9DL          D
#> 4 147000   1.16e+09 EX39 5XT          T
#> # ... with more rows

6.7 Data processing with data.table

data.table is a mature package for fast data processing that presents an alternative to dplyr. There is some controversy about which is more appropriate for different tasks18 so it should be stated at the outset that which to use can be a matter of personal preference. Both are powerful and efficient packages that take time to learn, so it is best to learn one and stick with it, rather than have the duality of using two for similar purposes. There are situations in which one works better than another: dply provides a more consistent and flexible interface (e.g. with its interface to databases) so for most purposes we recommend dplyr. However, data.table has a few features that make it very fast for some operations that it is worth at least being aware of from an efficiency perspective.

This section provides a few examples to illustrate how data.table differs and (at the risk of inflaming the debate further) some benchmarks to explore which is more efficient. As emphasised throughout the book, efficient code writing is often more important than efficient execution on many everyday tasks so to some extent it’s a matter of preference.

The foundational object class of data.table is the data.table. Like dplyr’s tbl_df, data.table’s data.table objects behave in the same was as the base data.frame class. However the data.table paradigm has some unique features that make it highly computationally efficient for many common tasks in data analysis. Building on subsetting methods using [ and filter() presented in Section 6.4.4, we’ll see data.tables’s unique approach to subsetting. Like base R data.table uses square brackets but you do not need to refer to the object name inside the brackets:

library("data.table")
idata = readRDS("data/idata-renamed.Rds")
idata_dt = data.table(idata) # convert to data.table class
aus3a = idata_dt[Country == "Australia"]

To boost performance, one can set ‘keys’. These are ‘supercharged rownames’ which order the table based on one or more variables. This allows a binary search algorithm to subset the rows of interest, which is much, much faster than the vector scan approach used in base R (see vignette("datatable-keys-fast-subset")). data.table uses the key values for subsetting by default so the variable does not need to be mentioned again. Instead, using keys, the search criteria is provided as a list (invoked below with the concise .() syntax below).

setkey(idata_dt, Country)
aus3b = idata_dt[.("Australia")]

The result is the same, so why add the extra stage of setting the key? The reason is that this one-off sorting operation can lead to substantial performance gains in situations where repeatedly subsetting rows on large datasets consumes a large proportion of computational time in your workflow. This is illustrated in Figure 6.2, which compares 4 methods of subsetting incrementally larger versions of the idata dataset.

Benchmark illustrating the performance gains to be expected for different dataset sizes.

Figure 6.2: Benchmark illustrating the performance gains to be expected for different dataset sizes.

Figure 6.2 demonstrates that data.table is much faster than base R and dplyr at subsetting. As with using external packages to read in data (see Section 5.8), the relative benefits of data.table improve with dataset size, approaching a ~70 fold improvement on base R and a ~50 fold improvement on dplyr as the dataset size reaches half a Gigabyte. Interestingly, even the ‘non key’ implementation of data.table subset method is faster than the alternatives: this is because data.table creates a key internally by default before subsetting. The process of creating the key accounts for the ~10 fold speed-up in cases where the key has been pre-generated.

In summary, this section has introduced data.table as a complimentary approach to base and dplyr methods for data processing. It offers performance gains due to its implementation in C and use of keys for subsetting tables. data.table offers much more, however, including: highly efficient data reshaping; dataset merging (also known as joining, as with left_join in dplyr); and grouping. For further information on data.table, we recommend reading the package’s datatable-intro, datatable-reshape and datatable-reference-semantics vignettes.

References

Wickham, Hadley. 2014b. “Tidy Data.” The Journal of Statistical Software 14 (5).

Codd, E. F. 1979. “Extending the database relational model to capture more meaning.” ACM Transactions on Database Systems 4 (4): 397–434. doi:10.1145/320107.320109.

Spector, Phil. 2008. Data Manipulation with R. Springer Science & Business Media.

Kersten, Martin L, Stratos Idreos, Stefan Manegold, Erietta Liarou, and others. 2011. “The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds.” PVLDB Challenges and Visions 3.


  1. vapply(idata, class, character(1)) is an alternative way to query the class of columns in a dataset.

  2. Note that filter is also the name of a function used in the base stats library. Usually packages avoid using names already taken in base R but this is an exception.

  3. Note that this syntax is a defining feature of dplyr and many of its functions work in the same way. Later we’ll learn how this syntax can be used alongside the %>% ‘pipe’ command to write clear data manipulation commands.

  4. The equivalent code in base R is: e_ems = aggregate(ghg_ems$Electricity, list(ghg_ems$Country), mean, na.rm = TRUE, data = ghg_ems) nrow(e_ems).

  5. One question on the stackoverflow website titled ‘data.table vs dplyr’ illustrates this controversy and delves into the philosophy underlying each approach.