# 5 Efficient input/output

Input/output (I/O) is the process of getting information into a particular computer system (in this case R) and then exporting it to the ‘outside world’ again (in this case as a file format that other software can read). Data I/O will be needed on projects where data comes from, or goes to, external sources. Yet the majority of R resources and documentation start with the assumption that your data has already been loaded. But importing datasets into R, and exporting them to the world outside the R ecosystem, can be a time-consuming and frustrating process. If the process is tricky, slow or ultimately unsuccessful, this can represent a major inefficiency right at the outset of a project. Conversely, reading and writing your data efficiently will make it much easier for your R projects to interact with the outside world. This chapter explains how to efficiently read a wide range of datasets into R.

With the accelerating digital revolution and growth in open data, an increasing proportion of the world’s data can be downloaded from the internet. This trend is set to continue. Downloading and importing data from the web is therefore covered first. Next we briefly outline two developments for efficient data import: the rio package and the .feather data format. Benchmarks throughout the chapter demonstrate that choice of file format and packages for data I/O can have a huge impact on computational efficiency. The chapter finishes with an exploration of how functions for reading in files stored in common plain text file formats from the readr and data.table packages can improve load speeds when working with these files.

Before reading in a single line of data, however, it is worth considering a general principle for reproducible data management: never modify raw data files. Raw data should be seen as read-only, and contain information about its provenance. Keeping the original file name and commenting on its origin are a couple of ways to improve reproducibility, even when the data are not publicly available.

## 5.1 Top 5 tips for efficient data I/O

• Keep the names of local files download from the internet unchanged. This will help you traces the provenance of the data in the future.

• R’s native file format is .Rds. These files can imported and exported using readRDS and saveRDS for fast and space efficient data storage.

• Use import() from the rio package to efficiently import data from a wide range of formats, avoiding the hassle of loading format-specific libraries.

• Use readr or data.table versions of read.table() to efficiently import large text files.

• Use file.size() and object.size() to keep track of the size of files and R objects and take action if they get too big.

## 5.2 Getting data from the internet

The code chunk below shows how the functions download.file13 and unzip can be used to download and unzip a dataset from the internet. R can automate processes that are often performed manually, e.g. through the graphical user interface of a web browser, with potential advantages for reproducibility and programmer efficiency. The result is data stored neatly in the data directory ready to be imported. Note we deliberately kept the file name intact help with documentation, enhancing understanding of the data’s provenance. Note also that part of the dataset is stored in the efficient package.

Using R for basic file management can help create a reproducible workflow, as illustrated below. The data downloaded in the following code chunk is a multi-table dataset on Dutch naval expeditions used with permission from the CWI Database Architectures Group and described more fully at monetdb.org. From this dataset we primarily use the ‘voyages’ table with lists Dutch shipping expeditions by their date of departure.

url = "https://www.monetdb.org/sites/default/files/voc_tsvs.zip"
unzip("voc_tsvs.zip", exdir = "data") # unzip files
file.remove("voc_tsvs.zip") # tidy up by removing the zip file

This workflow equally applies to downloading and loading single files. Note that one could make the code more concise by entering replacing the second line with df = read.csv(url). However, we recommend downloading the file to disk so that if for some reason it fails (e.g. if you would like to skip the first few lines), you don’t have to keep downloading the file over and over again. The code below downloads and loads data on atmospheric concentrations of CO2. Note that this dataset is also available from the datasets package.

url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/co2.csv"
download.file(url, "data/co2.csv")
df_co2 = read.csv("data/co2.csv")

There are now many R packages to assist with the download and import of data. The organisation ROpenSci supports a number of these. The example below illustrates this using the WDI package (not supported by ROpenSci) to accesses World Bank data on CO2 emissions in the transport sector:

library("WDI") # load the WDI library (must be installed)
WDIsearch("CO2") # search for data on a topic
co2_transport = WDI(indicator = "EN.CO2.TRAN.ZS") # import data

There will be situations where you cannot download the data directly or when the data cannot be made available. In this case, simply providing a comment relating to the data’s origin (e.g. # Downloaded from http://example.com) before referring to the dataset can greatly improve the utility of the code to yourself and others.

## 5.3 Versatile data import with rio

rio is is a ‘A Swiss-Army Knife for Data I/O’, providing easy-to-use and highly performant wrapper functions for importing a range of file formats. At the time of writing, these include .csv, .feather, .json, .dta, .xls, .xlsx and Google Sheets (see the package’s github page for up-to-date information). Below we illustrate three of rio’s key functions.

library("rio")

# Specify a file
fname = system.file("extdata/voc_voyages.tsv", package = "efficient")

# Import the file (uses the fread function from data.table)
voyages = import(fname)

# Export the file as an Excel spreadsheet
export(voyages, "output.xlsx")

The ability to import and use .json data is becoming increasingly common as it a standard output format for many APIs and the jsonlite and geojsonio packages have been developed to make this as easy as possible.

### Exercises

The final line in the code chunk above shows a neat feature of rio and some other packages: the output format is determined by the suffix of the file-name, which make for concise code. Try opening the output.xlsx file with an editor such as LibreOffice Calc or Microsoft Excel to ensure that the export worked, before removing this rather inefficient and non-secure file format from your system to preserve precious disk space:

file.remove("output.xlsx")

## 5.4 Accessing data stored in packages

Most well documented packages provide some example data for you to play with. This can help demonstrate use cases in specific domains, that uses a particular data format. The command data(package = "package_name") will show the datasets in a package. Datasets provided by dplyr, for example, can be viewed with data(package = "dplyr").

Raw data (i.e. data which has not been converted into R’s native .Rds format) is usually located with the sub-folder extdata in R (which corresponds to inst/extdata when developing packages. The function system.file outputs file paths associated with specific packages. To see all of the external files within the readr package, for example, one could use the following command:

list.files(system.file("extdata", package = "readr"))
#> [1] "compound.log"      "epa78.txt"         "example.log"
#> [4] "fwf-sample.txt"    "massey-rating.txt" "mtcars.csv"
#> [7] "mtcars.csv.bz2"    "mtcars.csv.zip"

Further, to ‘look around’ to see what files are stored in a particular package, one could type the following, taking advantage of RStudio’s intellisense file completion capabilities (using copy and paste to enter the file path):

system.file(package = "readr")
"/home/robin/R/x86_64-pc-linux-gnu-library/3.3/readr/"

Hitting Tab after the second command should trigger RStudio to create a miniature pop-up box listing the files within the folder, as illustrated in figure 5.1.

## 5.5 The feather file format

Feather was developed as collaboration between R and Python developers to create a fast, light and language agnostic format for storing data frames. The code chunk below shows how it can be used to save and then re-load the df_co2 dataset loaded previously in both R and Python:

library("feather")
write_feather(df_co2, "data/co2.feather")
df_co2_feather = read_feather("data/co2.feather")
import feather
import feather
path = 'data/co2.feather'
df_co2_feather = feather.read_dataframe(path)

## 5.6 Efficient data export: .Rdata or .Rds?

Once you have tidied you data (described in the next Section), it will hopefully be suitably ship shape to save. Beyond the raw data, which should also be saved, saving it after tidying is recommended to reduce the chance of having to run all the data cleaning code again. However, it may also make sense to save your data in a new format early on, not least because read and write speeds of proprietary formats can be very slow. A large .shp file, for example, can take more than ten times longer to load than a .Rds or .Rdata file.

.Rds and .RData are R’s native file format. This is a binary file format optimised for speed and compression ratios. But what is the difference between them? The follow code chunk demonstrates the key difference between these two (but surprisingly little known and used) file formats:

save(df_co2, file = "data/co2.RData")
saveRDS(df_co2, "data/co2.Rds")
identical(df_co2, df_co2_rds)
#> [1] TRUE

The first method is the most widely used. It uses uses the save function which takes any number of R objects and writes them to a file, which must be specified by the file = argument. save is like save.image, which saves all the objects currently loaded in R.

The second method is slightly less used but we recommend it. Apart from being slightly more concise for saving single R objects, the readRDS function is more flexible: as shown in the subsequent line, the resulting object can be assigned to any name. In this case we called it df_co2_rds (which we show to be identical to df_co2, loaded with the load command) but we could have called it anything or simply printed it to the console.

Using saveRDS is good practice because it forces you to specify object names. If you use save without care, you could forget the names of the objects you saved and accidentally overwrite objects that already existed.

How space efficient are these file export methods? We can explore this question using the functions list.files and file.size, as illustrated below. The results, which also show how the relative space saving of native R formats increase with dataset size, are shown in Table 5.1.

files_co2 = list.files(path = "data", pattern = "co2.", full.names = TRUE)
filesize_co2 = data.frame(
Format = gsub(pattern = "data/", replacement = "", files_co2),
Size = file.size(files_co2) / 1000
)
Table 5.1: Absolute (MB) and relative (compared with the smallest size for each column) disk usage for the 3 column, 468 row ‘co2’ dataset, saved in different formats. Columns headed 10x, 100x and 1000x show the results for disk usage after increasing the number of rows by 10, 100 and 1000 fold respectively.
Format Size Rel Size (10x) Rel (10x) Size (100x) Rel (100x) Size (1000x) Rel (1000x)
co2.csv 13.8 3.0 160.3 29.4 1649.7 147.3 16964.9 271
co2.feather 9.7 2.1 93.9 17.2 936.3 83.6 9360.3 150
co2.RData 4.7 1.0 5.5 1.0 11.2 1.0 62.6 1
co2.Rds 4.7 1.0 5.5 1.0 11.2 1.0 62.6 1

The results of this simple disk usage benchmark show the advantages of saving data in a compressed binary format can be great, from hard-disk and, if your data will be shared on-line, data download time and bandwidth usage perspectives. It is striking to note that R’s native formats can be over 100 times more space efficient than plain text (.csv) and other binary (.feather) formats. But how does each method compare from a computational efficiency perceptive?

## 5.7 A benchmark of methods for file import and export

The read and write times for the functions showcased above are presented in Table 5.2 and Table 5.3 respectively.

Table 5.2: Absolute and relative (compared with the smallest size for each column) read times for the 3 column, 468 row ‘co2’ dataset, saved with different functions. Columns headed 10x, 100x and 1000x show the results for disk usage after increasing the number of rows by 10, 100 and 1000 fold respectively.
Function Time Rel Time (10x) Rel (10x) Time (100x) Rel (100x) Time (1000x) Rel (1000x)
read.csv 1.8 15.1 13.1 110.8 141.5 250.7 1639.3 450.0
read_feather 0.1 1.0 0.1 1.0 0.6 1.0 3.6 1.0
load 0.2 1.4 0.4 3.3 2.6 4.6 28.1 7.7
readRDS 0.1 1.1 0.4 3.0 2.5 4.5 27.7 7.6
Table 5.3: Absolute and relative (compared with the smallest size for each column) write times for the 3 column, 468 row ‘co2’ dataset, saved with different functions. Columns headed 10x, 100x and 1000x show the results for disk usage after increasing the number of rows by 10, 100 and 1000 fold respectively.
Function Time Rel Time (10x) Rel (10x) Time (100x) Rel (100x) Time (1000x) Rel (1000x)
write.csv 2.8 6.9 21.1 33.0 197.3 27.5 2312.4 27.6
save_feather 0.4 1.0 0.6 1.0 7.2 1.0 142.8 1.7
save 1.7 4.2 1.6 2.5 10.5 1.5 95.3 1.1
saveRDS 1.7 4.2 1.5 2.3 9.5 1.3 83.9 1.0

The results show that the relative size of different formats is not a reliable predictor of data read and write times. This is due to the computational overheads of compression. Although the binary .feather format did not perform well in terms of read and write times, the function read_feather is faster than R’s native functions for saving .Rds and .RData formats, for the datasets used in the benchmark. write_feather is also faster than save and saveRDS for all but the largest dataset. In all cases, read.csv and write.csv is several times slower than the binary formats and this relative slowness worsens with increasing dataset size. In the next section we explore the performance of alternatives to these base R functions for reading and writing plain text data files.

## 5.8 Fast import of plain text formats

There is often more than one way to read data into R. A simple .csv, for example, file can be imported using a wide range of methods, with implications for computational efficiency. This section investigates methods for getting data into R, with a focus on delimited text formats, as these are ubiquitous, and a focus on three approaches: base R’s plain text reading functions such as read.delim, which are derived from read.table; the data.table approach, which uses the function fread; and the newer readr package which provides read_csv and other read_ functions such as read_tsv.

Note that a function ‘derived from’ another in this context means that it calls another function. The functions such as read.csv and read.delim in fact are wrappers for the more generic function read.table. This can be seen in the source code of read.csv, for example, which shows that the function is roughly the equivalent of read.table(file, header = TRUE, sep = “,”).

Although this section is focussed on reading text files, it demonstrate the wider principle that the speed and flexibility advantages of additional read functions can be offset by the disadvantages of addition package dependency (in terms of complexity and maintaining the code) for small datasets. The real benefits kick in on large datasets. Of course, there are some data types that require a certain package to load in R: the readstata13 package, for example, was developed solely to read in .dta files generated by versions of Stata 13 and above.

Figure 5.2 demonstrates that the relative performance gains of the data.table and readr approaches increase with data size, especially so for data with many rows. Below around 1 MB read.delim is actually faster than read_csv while fread is much faster than both, although these savings are likely to be inconsequential for such small datasets.

For files beyond 100 MB in size fread and read_csv can be expected to be around 5 times faster than read.delim. This efficiency gain may be inconsequential for a one-off file of 100 MB running on a fast computer (which still takes less than a minute with read.csv), but could represent an important speed-up if you frequently load large text files.

When tested on a large (4 GB) .csv file it was found that fread and read_csv were almost identical in load times and that read.csv took around 5 times longer. This consumed more than 10 GB of RAM, making it unsuitable to run on many computers (see Section 8.3 for more on memory). Note that both readr and base methods can be made significantly faster by pre-specifying the column types at the outset (see below). Further details are provided by the help in ?read.table.

read.csv(file_name, colClasses = c("numeric", "numeric"))

In some cases with R programming there is a trade-off between speed and robustness. This is illustrated below with reference to differences in how readr, data.table and base R approaches handle unexpected values. Table 5.4 shows that read_tsv is around 3 times faster, re-enforcing the point that the benefits of efficient functions increase with dataset size (made with Figure 5.2). This is a small (1 MB) dataset: the relative difference between fread and read_ functions will tend to decrease as dataset size increases.

library("microbenchmark")
library("data.table")
fname = system.file("extdata/voc_voyages.tsv", package = "efficient")
res_v = microbenchmark(times = 10,
dt_fread = voyages_dt <- fread(fname))
Table 5.4: Execution time of base, readr and data.table functions for reading in a 1 MB dataset relative to the mean execution time of fread, around 0.02 seconds on a modern computer.
Function min mean max
base_read 10.7 11.1 11.4
dt_fread 1.0 1.0 1.0

The benchmark above produces warning messages (not shown) for the read_tsv and fread functions but not the slowest base function read.delim. An exploration of these functions can shed light on the speed/robustness trade-off.

• The readr function read_csv generates a warning for row 2841 in the built variable. This is because read_*() decides what class each variable is based on the first 1000 rows, rather than all rows, as base read.* functions do.

As illustrated by printing the result for the row which generated a warning, the read_tsv output is more sensible than the read.delim output: read.delim coerced the date field into a factor based on a single entry which is a text. read_tsv coerced the variable into a numeric vector, as illustrated below.

class(voyages_base$built) # coerced to a factor #> [1] "factor" class(voyages_readr$built) # numeric based on first 1000 rows
#> [1] "numeric"
voyages_base$built[2841] # contains the text responsible for coercion #> [1] 1721-01-01 #> 182 Levels: 1 784 1,86 1135 1594 1600 1612 1613 1614 1615 1619 ... taken 1672 voyages_readr$built[2841] # an NA: text cannot be converted to numeric
#> [1] NA
• The data.table function fread generates 5 warning messages stating that columns 2, 4, 9, 10 and 11 were Bumped to type character on data row ..., with the offending rows printed in place of .... Instead of changing the offending values to NA, as readr does for the built column (9), fread automatically converts any columns it thought of as numeric into characters. An additional feature of fread is that it can read-in a selection of the columns, either by their index or name, using the select argument. This is illustrated below by reading in only half (the first 11) columns from the voyages dataset and comparing the result with fread’ing all the columns in.
microbenchmark(times = 5,
with_select = fread(fname, select = 1:11),
)
#> Unit: milliseconds
#>            expr  min   lq mean median   uq  max neval
#>     with_select 10.5 10.6 10.6   10.6 10.7 10.7     5
#>  without_select 19.0 19.0 19.1   19.1 19.2 19.3     5

To summarise, the differences between base, readr and data.table functions for reading in data go beyond code execution times. The functions read_csv and fread boost speed partially at the expense of robustness because they decide column classes based on a small sample of available data. The similarities and differences between the approaches are summarised for the Dutch shipping data (described in a note at the beginning of this section) in Table 5.5.

Table 5.5: Execution time of base, readr and data.table functions for reading in a 1 MB dataset
Function number boatname built departure_date
base_read integer factor factor factor
dt_fread integer character character character

Table 5.5 shows 4 main similarities and differences between the three read types of read function:

• For uniform data such as the ‘number’ variable in Table 5.5, all reading methods yield the same result (integer in this case).
• For columns that are obviously characters such as ‘boatname’, the base method results in factors (unless stringsAsFactors is set to TRUE) whereas fread and read_csv functions return characters.
• For columns in which the first 1000 rows are of one type but which contain anomalies, such as ‘built’ and ‘departure_data’ in the shipping example, fread coerces the result to characters. read_csv and siblings, by contrast, keep the class that is correct for the first 1000 rows and sets the anomalous records to NA. This is illustrated in 5.5, where read_tsv produces a numeric class for the ‘built’ variable, ignoring the non numeric text in row 2841.
• read_* functions generate objects of class tbl_df, an extension of the data.frame, as discussed in Section 6.4. fread generates objects of class data.table. These can be used as standard data frames but differ subtly in their behaviour.

The wider point associated with these tests is that functions that save time can also lead to additional considerations or complexities your workflow. Taking a look at what is going on ‘under the hood’ of fast functions to increase speed, as we have done in this section, can help understand the knock-on consequences of choosing fast functions over slower functions from base R.

### 5.8.1 Preprocessing outside R

There are circumstances when datasets become too large to read directly into R. Reading in 4 GB text file using the functions tested above, for example, consumed all available RAM on an 16 GB machine! To overcome the limitation that R reads all data directly into RAM, external stream processing tools can be used to preprocess large text files. The following command, using the shell command split, for example, would break a large multi GB file many one GB chunks, each of which is more manageable for R:

split -b100m bigfile.csv

The result is a series of files, set to 100 MB each with the -b100m argument in the above code. By default these will be called xaa, xab and which could be read in one chunk at a time (e.g. using read.csv, fread or read_csv, described in the previous section) without crashing most modern computers.

Splitting a large file into individual chunks may allow it to be read into R. This is not an efficient way to import large datasets, however, because it results in a non-random sample of the data this way. A more efficient way to work with very large datasets is via databases, covered in the next chapter.

1. Since R 3.2.3 the base function download.file() can be used to download from secure (https://) connections on any operating system.