5.3 Importing data

Before reading in data, it is worth considering a general principle for reproducible data management: never modify raw data files. Raw data should be seen as read-only, and contain information about its provenance. Keeping the original file name and including a comment about its origin are a couple of ways to improve reproducibility, even when the data are not publicly available. This is illustrated below with functions download.file13 and unzip to download and unzip the dataset,14 illustrating how R can be used to automate processes that are conventionally done by hand. The result is data stored in the data directory ready to be read-in (note part of the dataset is also stored in the efficient package).

url = "https://www.monetdb.org/sites/default/files/voc_tsvs.zip"
download.file(url, "voc_tsvs.zip") # download file
unzip("voc_tsvs.zip", exdir = "data") # unzip files
file.remove("voc_tsvs.zip") # tidy up by removing the zip file

To avoid the file download stage, many functions for reading in data can accept urls and read directly from the internet. This is illustrated below for read.csv():

url = "https://www.osha.gov/dep/fatcat/FatalitiesFY10.csv"
df = read.csv(url)

There are now many R packages designed specifically to assist with the download and import of data. The organisation ROpenSci supports a number of these. The example below illustrates this using the WDI package (not supported by ROpenSci) which accesses the World Bank’s World Development Indicators:

library("WDI") # load the WDI library (must be installed)
WDIsearch("CO2") # search for data on a topic
df = WDI(indicator = "EN.CO2.TRAN.ZS" ) # import data

There will be situations where you cannot download the data directly or when the data cannot be made available. In this case, simply providing a comment relating to the data’s origin (e.g. # Downloaded from http://example.com) before referring to the dataset can greatly improve the utility of the code to yourself and others.

5.3.1 Fast data reading

There is often more than one way to read data into R. Even a simple .csv file can be imported using a range of methods, with implications for computational efficiency. This section looks at three approaches: base R’s reading functions such as read.csv, which are derived from read.table; the data.table approach, which uses the function fread; and the new readr package which provides read_csv and other read_ functions such as read_tsv.

Although this section is focussed on reading text files, it demonstrate the wider principle that the speed and flexibility advantages of additional read functions can be offset by the disadvantages of addition package dependency (in terms of complexity and maintaining the code) for small datasets. The real benefits kick in on large datasets. Of course, there are some data types that require a certain package to load in R: the readstata13 package, for example, was developed solely to read in .dta files generated by versions of Stata 13 and above.

Figure 5.3 demonstrates that the relative performance gains of the data.table and readr approaches increase with data size, especially so for data with many rows. Below around 1 MB read.csv is actually faster than read_csv while fread is much faster than both, although these savings are likely to be inconsequential for such small datasets.

For files beyond 100 MB in size fread and read_csv can be expected to be around 5 times faster than read.csv. This efficiency gain may be inconsequential for a one-off file of 100 MB running on a fast computer (which still take less than a minute with read.csv), but could represent an important speed-up if you frequently load large text files.

Benchmarks of base vs **data.table** vs **readr** functions for reading csv files. The facets ranging from 2 to 200 represent the number of columns.

Figure 5.3: Benchmarks of base vs data.table vs readr functions for reading csv files. The facets ranging from 2 to 200 represent the number of columns.

When tested on a large (4 GB) .csv file it was found that fread and read_csv were almost identical in load times and that read.csv took around 5 times longer. This consumed more than 10 GB of RAM, making it unsuitable to run on many computers (see Section 4.2 for more on memory). Note that both readr and base methods can be made faster by pre-specifying the column types at the outset, as illustrated in 5.3 and described in the help files.

In some cases with R programming there is a trade-off between speed and robustness. This is illustrated below with reference to differences in how readr, data.table and base R approaches handle unexpected values. Table 5.1 shows that read_tsv is around 3 times faster, re-enforcing the point that the benefits of efficient functions increase with dataset size (made with Figure 5.3). This is a small (1 MB) dataset: the relative difference between fread and read_ functions will tend to decrease as dataset size increases.

fname = system.file("extdata/voc_voyages.tsv", package = "efficient")
res_v <- microbenchmark(times = 10,
  base_read = voyages_base <- read.csv(fname, sep = "\t"),
  readr_read = voyages_readr <- read_tsv(fname),
  dt_fread = voyages_dt <- fread(fname))
Table 5.1: Execution time of base, readr and data.table functions for reading in a 1 MB dataset relative to the mean execution time of fread, around 0.02 seconds on a modern computer.
Function min mean max
base_read 13.6 16.3 25.4
readr_read 3.7 5.3 14.3
dt_fread 0.9 1.0 1.1

The benchmark above produces warning messages (not shown) for the read_tsv and fread functions but not the slowest base function read.csv. An exploration of these can shed light on the speed/robustness trade-off.

  • The readr function generates a warning for row 2841 in the built variable. This is because read_*() decides what class each variable is based on the first 1000 rows, rather than all rows as base read.* functions do. As illustrated by printing the result for the row which generated a warning, the read_tsv output is probably more sensible than the read.csv output: the latter coerced the date field into a factor based on a single entry which is a text whereas the latter coerced the variable into a numeric data, as illustrated below.
class(voyages_base$built) # coerced to a factor
## [1] "factor"
class(voyages_readr$built) # numeric based on first 1000 rows
## [1] "numeric"
voyages_base$built[2841] # contains the text responsible for coercion
## [1] 1721-01-01
## 182 Levels:  1 784 1,86 1135 1594 1600 1612 1613 1614 1615 1619 ... taken 1672
voyages_readr$built[2841] # an NA: text cannot be converted to numeric
## [1] NA
  • The data.table function fread generates 5 warning messages stating that columns 2, 4, 9, 10 and 11 were Bumped to type character on data row ..., with the offending rows printed in place of .... Instead of changing the offending values to NA, as readr does for the built column (9), fread automatically converts any columns it thought of as numeric into characters.

To summarise, the differences between base, readr and data.table functions for reading in data go beyond code execution times. The functions read_csv and fread boost speed partially at the expense of robustness because they decide column classes based on a small sample of available data. The similarities and differences between the approaches are summarised for the Dutch shipping data in Table 5.2, which shows 4 main similarities and differences:

  • For uniform data such as the ‘number’ variable in Table 5.2, all reading methods yield the same result (integer in this case).
  • For columns that are obviously characters such as ‘boatname’, the base method results in factors (unless stringsAsFactors is set to TRUE) whereas fread and read_csv functions return characters.
  • For columns in which the first 1000 rows are of one type but which contain anomalies, such as ‘built’ and ‘departure_data’ in the shipping example, fread coerces the result to characters. read_csv and siblings, by contrast, keep the class that is correct for the first 1000 rows and sets the anomalous records to NA.
  • read_* functions generate objects of class tbl_df, an extension of the data.frame, as discussed in Section 5.5. fread generates objects of class data.table. These can be used as standard data frames but differ subtly in their behaviour.
Table 5.2: Execution time of base, readr and data.table functions for reading in a 1 MB dataset
Function number boatname built departure_date
base_read integer factor factor factor
readr_read integer character numeric Date
dt_fread integer character character character

The wider point associated with these tests is that functions that save time can also lead to additional considerations or complexities your workflow. Taking a look at what is going on ‘under the hood’ of fast functions to increase speed, as we have done in this section, can help understand the knock-on consequences of choosing fast functions over slower functions from base R.

5.3.2 Preprocessing outside R

There are circumstances when datasets become too large to read directly into R. Reading in 4 GB text file using the functions tested above, for example, consumed all available RAM on an 16 GB machine! To overcome the limitation that R reads all data directly into RAM, external stream processing tools can be used to preprocess large text files. The following command, using the shell command split, for example, would break a large multi GB file many one GB chunks, each of which is more manageable for R:

split -b100m bigfile.csv

The result is a series of files, set to 100 MB each with the -b100m argument in the above code. By default these will be called xaa, xab and which could be read in one chunk at a time (e.g. using read.csv, fread or read_csv, described in Section 5.3.1) without crashing most modern computers.

Splitting a large file into individual chunks may allow it to be read into R . But is not an efficient way to import large datasets because you will only ever have access to a non-random sample of the data this way. A more efficient way to work with very large datasets is via databases.

5.3.3 Working with databases

Instead of loading all the data into RAM, as R does, databases query data from the hard-disk. This can allow a subset of a very large dataset to be defined and read into R quickly, without having to load it first.

R can connect to databases in a number of ways. The most mature of these is via the RODBC package, which sets up links to external databases using the Open Database Connectivity (ODBC) API, as described in the packages vignette (which can be accessed with vignette("RODBC"), once the package has been installed). RODBC connects to ‘traditional’ databases such as MySQL, PostgreSQL, Oracle and SQLite.

The function used to set-up a connection to an external database with RODBC is odbcConnect, which takes Data Source Name (dsn =), User ID (uid =) and password (pwd) as required arguments. Be sure never to release your password by entering it directly into the command. Instead, we recommend saving sensitive information such as database passwords and API keys in .Renviron, described in 3.3.3. Assuming you had saved your password as the environment variable PSWRD, you could enter pwd = Sys.getenv("PSWRD") to minimise the risk of exposing your password through accidentally releasing the code or your session history.

Recently there has been a shift ‘noSQL’ approach to data storage for handling large datasets. This is illustrated by the emergence and uptake of software such as MongoDB and Apache Cassandra, which have R interfaces via packages mongolite and RJDBC, which can connect to Apache Cassandra data stores and any source compliant with the Java Database Connectivity (JDBC) API.

MonetDB is a recent alternative to traditional and noSQL approaches which offers substantial efficiency advantages for handling large datasets (Kersten et al. 2011). A tutorial on the MonetDB website provides an excellent introduction to handling databases from within R. A new development showcased in this tutorial is the ability to interact with databases using exactly the same syntax used to interact with R objects stored in RAM. This innovation was made possible by dplyr, an R library for data processing that aims to provide a unified ‘front end’ to perform a wide range of analysis task on datasets using a variety of ‘back ends’ which do the number crunching. This is one of the main advantages of dplyr (see Section 5.5).