5.5 Data processing

Tidy data is easier to process than messy data. As with many aspects of R programming there are many ways to process a dataset, some more efficient than others. Following our own advice to decide appropriate packages for the work early on (see Section 5.2) uses dplyr, which has a number of advantages compared with base R and data.table:

  • dplyr is fast to run and intuitive to type
  • dplyr works well with tidy data, as described above
  • dplyr works well with databases, providing efficiency gains on large datasets
fname <- system.file("extdata/world-bank-ineq.csv", package = "efficient")
idata <- read_csv(fname)
idata # print the dataset in the **dplyr** way

dplyr is much faster than base implementations of various operations, but it has the potential to be even faster, as parallelisation is planned and the multidplyr package, a parallel backend for dplyr, is under development.

You should not be expecting to learn the dplyr package in one sitting: the package is large and can be seen as a language in its own right. Following the ‘walk before you run’ principle, we’ll start simple, by filtering and aggregating rows, building on the previous section on tidying data.

5.5.1 Renaming columns

Renaming data columns is a common task that can make writing code faster by using short, intuitive names. The dplyr function rename() makes this easy.16

idata = rename(idata, Country = `Country Name`)

To rename multiple columns the variable names are simply separated by commas. The base R and dplyr way of doing this is illustrated for clarity.

# The dplyr way (rename two variables)
idata <- rename(idata,
 top10 = `Income share held by highest 10% [SI.DST.10TH.10]`,
 bot10 = `Income share held by lowest 10% [SI.DST.FRST.10]`)

# The base R way (rename five variables)
names(idata)[5:9] =
  c("top10", "bot10", "gini", "b40_cons", "gdp_percap")

Now we have usefully renamed the object we save the result for future reference:

saveRDS(idata, "data/idata-renamed.Rds")

5.5.2 Filtering rows

The standard way to subset data by rows in R is with square brackets, for example:

aus1 = idata[idata$Country == "Australia",]

dplyr offers an alternative and more flexible way of filtering data, using filter().

aus2 = filter(idata, Country == "Australia")

Note that dplyr does not rely on the $ symbol: it knows that that Country is a variable of idata Because idata was the first argument, dplyr assumes that any subsequent are variables.17

The dplyr equivalent of aggregate is to use the grouping function group_by in combination with the general purpose function summarise (not to be confused with summary in base R).

5.5.3 Filtering columns

Large datasets often contain much worthless or blank information, clogging up RAM and reducing computational efficiency. Being able to focus quickly only on the variables of interest becomes especially important when handling large datasets.

Imagine that we have large (4+ GB) text file called miniaa which is loaded with the following command:

fname = system.file("extdata/miniaa", package = "efficient")
df = read_csv(fname) # load imaginary large data
## [1]   9 329

Note that the data frame has 329 columns (and imagine it has 4 million+ rows as the original does). That’s a lot of variables. Do we need them all? It’s worth taking a glimpse at this dataset to find out:

# $ NPI                   (int) 1679576722, ...
# $ Entity Type Code      (int) 1, 1, 2,    ...
# $ Replacement NPI       (lgl) NA, NA, NA, ...
# ...

Looking at the output, it becomes clear that the majority of the variables only contain NA. To clean the giant dataset, removing the empty columns, we need to identify which these variables are.

# Identify the variable which are all NA
all_na = sapply(df, function(x) all(is.na(x)))
summary(all_na) # summary of the results
##    Mode   FALSE    NA's 
## logical     329       0
df = df[!all_na] # subset the dataframe

The new df object has fewer than a third of the original columns.

Challenge: find out how much space was saved by the above operation using object.size()

5.5.4 Data aggregation

Data aggregation is the process of creating summaries of data based on a grouping variable. The end result usually has the same number of rows as there are groups. Because aggregation is a way of condensing datasets it can be a very useful technique for making sense of large datasets. The following code finds the average emissions per country (country being the grouping variable) from the ‘GHG’ dataset rescued from a spreadsheet and converted into a .csv file in the previous chapter.

fname = system.file("extdata/ghg-ems.csv", package = "efficient")
df = read.csv(fname)
## [1] "X"                                       
## [2] "Country"                                 
## [3] "Year"                                    
## [4] "Electricity.Heat..CO2...MtCO2."          
## [5] "Manufacturing.Construction..CO2...MtCO2."
## [6] "Transportation..CO2...MtCO2."            
## [7] "Other.Fuel.Combustion..CO2...MtCO2."     
## [8] "Fugitive.Emissions..CO2...MtCO2."
## [1] 7896
## [1] 188

Challenge: rename the variables 4 to 8 so they are much shorter, following the pattern ECO2, MCO2 etc. That will make the code for manipulating the dataset easier to write

After the variable names have been updated, we can aggregate.18

e_ems = aggregate(df$ECO2, list(df$Country), mean, na.rm  = TRUE)
## [1] 188

Note that the resulting data frame has the same number of rows as there are countries: the aggregation has successfully reduced the number of rows we need to deal with. Now it is easier to find out per-country statistics, such as the three lowest emitters from electricity production:

head(e_ems[order(e_ems$x),], 3)
##     Group.1          x
## 77  Iceland 0.01785714
## 121   Nepal 0.02333333
## 18    Benin 0.04642857

Another way to specify the by argument is with the tilde (~). The following command creates the same object as e_ems, but with less typing.

e_ems = aggregate(ECO2 ~ Country, df, mean, na.rm  = TRUE)

To aggregate the dataset using dplyr package one would divide the task in two: to group the dataset first and then to summarise, as illustrated below:

group_by(df, Country) %>%
  summarise(mean_eco2 = mean(ECO2, na.rm  = TRUE))
## Source: local data frame [188 x 2]
##              Country   mean_eco2
##               (fctr)       (dbl)
## 1        Afghanistan         NaN
## 2            Albania   0.6411905
## 3            Algeria  23.0147619
## 4             Angola   0.7914286
## 5  Antigua & Barbuda         NaN
## 6          Argentina  39.1054762
## 7            Armenia   1.8000000
## 8          Australia 150.5961905
## 9            Austria  17.3202381
## 10        Azerbaijan  16.0430435
## ..               ...         ...