5.4 Tidying data with tidyr

A key skill in data analysis is understanding the structure of datasets and being able to ‘reshape’ them. This is important from a workflow efficiency perspective: more than half of a data analyst’s time can be spent re-formatting datasets (Wickham 2014b). Converting data into a ‘tidy’ form (described below) is also advantageous from a computational efficiency perspective: it usually faster to run analysis and plotting commands on a few large vectors than many short vectors.

This is illustrated by Tables 5.3 and 5.4, provided by Wickham (2014b).

Table 5.3: First 6 rows of the aggregated ‘pew’ dataset from Wickham (2014a) in an ‘untidy’ form.
religion <$10k $10–20k $20–30k
Agnostic 27 34 60
Atheist 12 27 37
Buddhist 27 21 30
Table 5.4: First 3 and last rows of the ‘tidied’ Pew dataset.
religion Income Count
Agnostic <$10k 27
Atheist <$10k 12
Buddhist <$10k 27
Unaffiliated >150k 258

These tables may look different, but they contain precisely the same information Column names in the ‘flat’ form in Table 5.3 became a new variable in the ‘long’ form in Table 5.4. According to the concept of ‘tidy data’, the long form is correct. Note that ‘correct’ here is used in the context of data analysis and graphical visualisation. For tabular presentation (i.e. tables) the ‘wide’ or ‘untidy’ form may be better.

Tidy data has the following characteristics (Wickham 2014b):

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Because there is only one observational unit in the example (religions), it can be described in a single table. Large and complex datasets are usually represented by multiple tables, with unique identifiers or ‘keys’ to join them together (Codd 1979).

Two common operations facilitated by tidyr are gathering and splitting columns, as illustrated below.

  • ‘Wide’ tables can become ‘long’, so that column names become a new variable. This is illustrated in Table 5.3 and Table 5.4 and can be achieved with the function gather:15
raw = read_csv("data/pew.csv") # read in the 'wide' dataset
## [1] 18 10
rawt = gather(raw, Income, Count, -religion)
## [1] 162   3
## Source: local data frame [3 x 3]
##   religion Income Count
##      (chr)  (chr) (int)
## 1 Agnostic  <$10k    27
## 2  Atheist  <$10k    12
## 3 Buddhist  <$10k    27
  • Splitting compound variables in two. A classic example is age-sex variables (e.g. m0-10 and f0-15 to represent males and females in the 0 to 10 age band). Splitting such variables can be done with separate, as illustrated in the difference between Table 5.5 and ??:
agesex = c("m0-10", "f0-10") # create compound variable
n = c(3, 5) # create a value for each observation
df = data.frame(agesex, n) # create a data frame
separate(df, agesex, c("sex", "age"), 1)
##   sex  age n
## 1   m 0-10 3
## 2   f 0-10 5
Table 5.5: Joined age and sex variables in one column
agesex n
m0-10 3
f0-10 5
Table 5.6: Age and sex variables separated by the funtion separate.
sex age n
m 0-10 3
f 0-10 5

There are other tidying operations that tidyr can perform, as described in the package’s vignette (vignette("tidy-data")). Data manipulation is a large topic with major potential implications for efficiency (???).