5 Efficient input/output
Input/output (I/O) is the process of getting information into a particular computer system (in this case R) and then exporting it to the ‘outside world’ again (in this case as a file format that other software can read). Data I/O will be needed on projects where data comes from, or goes to, external sources. Yet the majority of R resources and documentation start with the assumption that your data has already been loaded. But importing datasets into R, and exporting them to the world outside the R ecosystem, can be a time-consuming and frustrating process. If the process is tricky, slow or ultimately unsuccessful, this can represent a major inefficiency right at the outset of a project. Conversely, reading and writing your data efficiently will make it much easier for your R projects to interact with the outside world. This chapter explains how to efficiently read a wide range of datasets into R.
With the accelerating digital revolution and growth in open data, an increasing proportion of the world’s data can be downloaded from the internet. This trend is set to continue. Downloading and importing data from the web is therefore covered first. Next we briefly outline two developments for efficient data import: the rio package and the
.feather data format. Benchmarks throughout the chapter demonstrate that choice of file format and packages for data I/O can have a huge impact on computational efficiency. The chapter finishes with an exploration of how functions for reading in files stored in common plain text file formats from the readr and data.table packages can improve load speeds when working with these files.
Before reading in a single line of data, however, it is worth considering a general principle for reproducible data management: never modify raw data files. Raw data should be seen as read-only, and contain information about its provenance. Keeping the original file name and commenting on its origin are a couple of ways to improve reproducibility, even when the data are not publicly available.
5.1 Top 5 tips for efficient data I/O
Keep the names of local files download from the internet unchanged. This will help you traces the provenance of the data in the future.
R’s native file format is
.Rds. These files can imported and exported using
saveRDSfor fast and space efficient data storage.
import()from the rio package to efficiently import data from a wide range of formats, avoiding the hassle of loading format-specific libraries.
Use readr or data.table versions of
read.table()to efficiently import large text files.
object.size()to keep track of the size of files and R objects and take action if they get too big.
5.2 Getting data from the internet
The code chunk below shows how the functions
unzip can be used to download and unzip a dataset from the internet. R can automate processes that are often performed manually, e.g. through the graphical user interface of a web browser, with potential advantages for reproducibility and programmer efficiency. The result is data stored neatly in the
data directory ready to be imported. Note we deliberately kept the file name intact help with documentation, enhancing understanding of the data’s provenance. Note also that part of the dataset is stored in the efficient package.
Using R for basic file management can help create a reproducible workflow, as illustrated below. The data downloaded in the following code chunk is a multi-table dataset on Dutch naval expeditions used with permission from the CWI Database Architectures Group and described more fully at monetdb.org. From this dataset we primarily use the ‘voyages’ table with lists Dutch shipping expeditions by their date of departure.
url = "https://www.monetdb.org/sites/default/files/voc_tsvs.zip" download.file(url, "voc_tsvs.zip") # download file unzip("voc_tsvs.zip", exdir = "data") # unzip files file.remove("voc_tsvs.zip") # tidy up by removing the zip file
This workflow equally applies to downloading and loading single files. Note that one could make the code more concise by entering replacing the second line with
df = read.csv(url). However, we recommend downloading the file to disk so that if for some reason it fails (e.g. if you would like to skip the first few lines), you don’t have to keep downloading the file over and over again. The code below downloads and loads data on atmospheric concentrations of CO2. Note that this dataset is also available from the datasets package.
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/co2.csv" download.file(url, "data/co2.csv")
df_co2 = read.csv("data/co2.csv")
There are now many R packages to assist with the download and import of data. The organisation ROpenSci supports a number of these. The example below illustrates this using the WDI package (not supported by ROpenSci) to accesses World Bank data on CO2 emissions in the transport sector:
library("WDI") # load the WDI library (must be installed) WDIsearch("CO2") # search for data on a topic co2_transport = WDI(indicator = "EN.CO2.TRAN.ZS") # import data
There will be situations where you cannot download the data directly or when the data cannot be made available. In this case, simply providing a comment relating to the data’s origin (e.g.
# Downloaded from http://example.com) before referring to the dataset can greatly improve the utility of the code to yourself and others.
5.3 Versatile data import with rio
rio is is a ‘A Swiss-Army Knife for Data I/O’, providing easy-to-use and highly performant wrapper functions for importing a range of file formats. At the time of writing, these include
.xlsx and Google Sheets (see the package’s github page for up-to-date information). Below we illustrate three of rio’s key functions.
library("rio") # Specify a file fname = system.file("extdata/voc_voyages.tsv", package = "efficient") # Import the file (uses the fread function from data.table) voyages = import(fname) # Export the file as an Excel spreadsheet export(voyages, "output.xlsx")
The ability to import and use
.json data is becoming increasingly common as it a standard output format for many APIs and the jsonlite and geojsonio packages have been developed to make this as easy as possible.
The final line in the code chunk above shows a neat feature of rio and some other packages: the output format is determined by the suffix of the file-name, which make for concise code. Try opening the
output.xlsx file with an editor such as LibreOffice Calc or Microsoft Excel to ensure that the export worked, before removing this rather inefficient and non-secure file format from your system to preserve precious disk space:
5.4 Accessing data stored in packages
Most well documented packages provide some example data for you to play with. This can help demonstrate use cases in specific domains, that uses a particular data format. The command
data(package = "package_name") will show the datasets in a package. Datasets provided by dplyr, for example, can be viewed with
data(package = "dplyr").
Raw data (i.e. data which has not been converted into R’s native
.Rds format) is usually located with the sub-folder
extdata in R (which corresponds to
inst/extdata when developing packages. The function
system.file outputs file paths associated with specific packages. To see all of the external files within the readr package, for example, one could use the following command:
list.files(system.file("extdata", package = "readr")) #>  "compound.log" "epa78.txt" "example.log" #>  "fwf-sample.txt" "massey-rating.txt" "mtcars.csv" #>  "mtcars.csv.bz2" "mtcars.csv.zip"
Further, to ‘look around’ to see what files are stored in a particular package, one could type the following, taking advantage of RStudio’s intellisense file completion capabilities (using copy and paste to enter the file path):
system.file(package = "readr") #>  "/home/robin/R/x86_64-pc-linux-gnu-library/3.3/readr" "/home/robin/R/x86_64-pc-linux-gnu-library/3.3/readr/"
Tab after the second command should trigger RStudio to create a miniature pop-up box listing the files within the folder, as illustrated in figure 5.1.
5.5 The feather file format
Feather was developed as collaboration between R and Python developers to create a fast, light and language agnostic format for storing data frames. The code chunk below shows how it can be used to save and then re-load the
df_co2 dataset loaded previously in both R and Python:
library("feather") write_feather(df_co2, "data/co2.feather") df_co2_feather = read_feather("data/co2.feather")
import feather import feather path = 'data/co2.feather' df_co2_feather = feather.read_dataframe(path)
5.6 Efficient data export: .Rdata or .Rds?
Once you have tidied you data (described in the next Section), it will hopefully be suitably ship shape to save. Beyond the raw data, which should also be saved, saving it after tidying is recommended to reduce the chance of having to run all the data cleaning code again. However, it may also make sense to save your data in a new format early on, not least because read and write speeds of proprietary formats can be very slow. A large
.shp file, for example, can take more than ten times longer to load than a
.RData are R’s native file format. This is a binary file format optimised for speed and compression ratios. But what is the difference between them? The follow code chunk demonstrates the key difference between these two (but surprisingly little known and used) file formats:
save(df_co2, file = "data/co2.RData") saveRDS(df_co2, "data/co2.Rds") load("data/co2.RData") df_co2_rds = readRDS("data/co2.Rds") identical(df_co2, df_co2_rds) #>  TRUE
The first method is the most widely used. It uses uses the
save function which takes any number of R objects and writes them to a file, which must be specified by the
file = argument.
save is like
save.image, which saves all the objects currently loaded in R.
The second method is slightly less used but we recommend it. Apart from being slightly more concise for saving single R objects, the
readRDS function is more flexible: as shown in the subsequent line, the resulting object can be assigned to any name. In this case we called it
df_co2_rds (which we show to be identical to
df_co2, loaded with the
load command) but we could have called it anything or simply printed it to the console.
saveRDS is good practice because it forces you to specify object names. If you use
save without care, you could forget the names of the objects you saved and accidentally overwrite objects that already existed.
How space efficient are these file export methods? We can explore this question using the functions
file.size, as illustrated below. The results, which also show how the relative space saving of native R formats increase with dataset size, are shown in Table 5.1.
files_co2 = list.files(path = "data", pattern = "co2.", full.names = TRUE) filesize_co2 = data.frame( Format = gsub(pattern = "data/", replacement = "", files_co2), Size = file.size(files_co2) / 1000 )
|Format||Size||Rel||Size (10x)||Rel (10x)||Size (100x)||Rel (100x)||Size (1000x)||Rel (1000x)|
The results of this simple disk usage benchmark show the advantages of saving data in a compressed binary format can be great, from hard-disk and, if your data will be shared on-line, data download time and bandwidth usage perspectives. It is striking to note that R’s native formats can be over 100 times more space efficient than plain text (.csv) and other binary (
.feather) formats. But how does each method compare from a computational efficiency perceptive?
5.7 A benchmark of methods for file import and export
|Function||Time||Rel||Time (10x)||Rel (10x)||Time (100x)||Rel (100x)||Time (1000x)||Rel (1000x)|
|Function||Time||Rel||Time (10x)||Rel (10x)||Time (100x)||Rel (100x)||Time (1000x)||Rel (1000x)|
The results show that the relative size of different formats is not a reliable predictor of data read and write times. This is due to the computational overheads of compression. Although the binary
.feather format did not perform well in terms of read and write times, the function
read_feather is faster than R’s native functions for saving
.RData formats, for the datasets used in the benchmark.
write_feather is also faster than
saveRDS for all but the largest dataset. In all cases,
write.csv is several times slower than the binary formats and this relative slowness worsens with increasing dataset size. In the next section we explore the performance of alternatives to these base R functions for reading and writing plain text data files.
5.8 Fast import of plain text formats
There is often more than one way to read data into R. A simple
.csv, for example, file can be imported using a wide range of methods, with implications for computational efficiency. This section investigates methods for getting data into R, with a focus on delimited text formats, as these are ubiquitous, and a focus on three approaches: base R’s plain text reading functions such as
read.delim, which are derived from
read.table; the data.table approach, which uses the function
fread; and the newer readr package which provides
read_csv and other
read_ functions such as
Note that a function ‘derived from’ another in this context means that it calls another function. The functions such as
read.delim in fact are wrappers for the more generic function
read.table. This can be seen in the source code of
read.csv, for example, which shows that the function is roughly the equivalent of
read.table(file, header = TRUE, sep = “,”).
Although this section is focussed on reading text files, it demonstrate the wider principle that the speed and flexibility advantages of additional read functions can be offset by the disadvantages of addition package dependency (in terms of complexity and maintaining the code) for small datasets. The real benefits kick in on large datasets. Of course, there are some data types that require a certain package to load in R: the readstata13 package, for example, was developed solely to read in
.dta files generated by versions of Stata 13 and above.
Figure 5.2 demonstrates that the relative performance gains of the data.table and readr approaches increase with data size, especially so for data with many rows. Below around 1 MB
read.delim is actually faster than
fread is much faster than both, although these savings are likely to be inconsequential for such small datasets.
For files beyond 100 MB in size
read_csv can be expected to be around 5 times faster than
read.delim. This efficiency gain may be inconsequential for a one-off file of 100 MB running on a fast computer (which still takes less than a minute with
read.csv), but could represent an important speed-up if you frequently load large text files.
When tested on a large (4 GB) .csv file it was found that
read_csv were almost identical in load times and that
read.csv took around 5 times longer. This consumed more than 10 GB of RAM, making it unsuitable to run on many computers (see Section 8.3 for more on memory). Note that both readr and base methods can be made significantly faster by pre-specifying the column types at the outset (see below). Further details are provided by the help in
read.csv(file_name, colClasses = c("numeric", "numeric"))
In some cases with R programming there is a trade-off between speed and robustness. This is illustrated below with reference to differences in how readr, data.table and base R approaches handle unexpected values. Table 5.4 shows that
read_tsv is around 3 times faster, re-enforcing the point that the benefits of efficient functions increase with dataset size (made with Figure 5.2). This is a small (1 MB) dataset: the relative difference between
read_ functions will tend to decrease as dataset size increases.
library("microbenchmark") library("readr") library("data.table") fname = system.file("extdata/voc_voyages.tsv", package = "efficient") res_v = microbenchmark(times = 10, base_read = voyages_base <- read.delim(fname), readr_read = voyages_readr <- read_tsv(fname), dt_fread = voyages_dt <- fread(fname))
The benchmark above produces warning messages (not shown) for the
fread functions but not the slowest base function
read.delim. An exploration of these functions can shed light on the speed/robustness trade-off.
- The readr function
read_csvgenerates a warning for row 2841 in the
builtvariable. This is because
read_*()decides what class each variable is based on the first 1000 rows, rather than all rows, as base
As illustrated by printing the result for the row which generated a warning, the
read_tsv output is more sensible than the
read.delim coerced the date field into a factor based on a single entry which is a text.
read_tsv coerced the variable into a numeric vector, as illustrated below.
class(voyages_base$built) # coerced to a factor #>  "factor" class(voyages_readr$built) # numeric based on first 1000 rows #>  "numeric" voyages_base$built # contains the text responsible for coercion #>  1721-01-01 #> 182 Levels: 1 784 1,86 1135 1594 1600 1612 1613 1614 1615 1619 ... taken 1672 voyages_readr$built # an NA: text cannot be converted to numeric #>  NA
- The data.table function
freadgenerates 5 warning messages stating that columns 2, 4, 9, 10 and 11 were
Bumped to type character on data row ..., with the offending rows printed in place of
.... Instead of changing the offending values to
NA, as readr does for the
freadautomatically converts any columns it thought of as numeric into characters. An additional feature of
freadis that it can read-in a selection of the columns, either by their index or name, using the
selectargument. This is illustrated below by reading in only half (the first 11) columns from the voyages dataset and comparing the result with
fread’ing all the columns in.
microbenchmark(times = 5, with_select = fread(fname, select = 1:11), without_select = fread(fname) ) #> Unit: milliseconds #> expr min lq mean median uq max neval #> with_select 10.5 10.6 10.6 10.6 10.7 10.7 5 #> without_select 19.0 19.0 19.1 19.1 19.2 19.3 5
To summarise, the differences between base, readr and data.table functions for reading in data go beyond code execution times. The functions
fread boost speed partially at the expense of robustness because they decide column classes based on a small sample of available data. The similarities and differences between the approaches are summarised for the Dutch shipping data (described in a note at the beginning of this section) in Table 5.5.
Table 5.5 shows 4 main similarities and differences between the three read types of read function:
- For uniform data such as the ‘number’ variable in Table 5.5, all reading methods yield the same result (integer in this case).
- For columns that are obviously characters such as ‘boatname’, the base method results in factors (unless
stringsAsFactorsis set to
read_csvfunctions return characters.
- For columns in which the first 1000 rows are of one type but which contain anomalies, such as ‘built’ and ‘departure_data’ in the shipping example,
freadcoerces the result to characters.
read_csvand siblings, by contrast, keep the class that is correct for the first 1000 rows and sets the anomalous records to
NA. This is illustrated in 5.5, where
numericclass for the ‘built’ variable, ignoring the non numeric text in row 2841.
read_*functions generate objects of class
tbl_df, an extension of the
data.frame, as discussed in Section 6.4.
freadgenerates objects of class
data.table. These can be used as standard data frames but differ subtly in their behaviour.
The wider point associated with these tests is that functions that save time can also lead to additional considerations or complexities your workflow. Taking a look at what is going on ‘under the hood’ of fast functions to increase speed, as we have done in this section, can help understand the knock-on consequences of choosing fast functions over slower functions from base R.
5.8.1 Preprocessing outside R
There are circumstances when datasets become too large to read directly into R. Reading in 4 GB text file using the functions tested above, for example, consumed all available RAM on an 16 GB machine! To overcome the limitation that R reads all data directly into RAM, external stream processing tools can be used to preprocess large text files. The following command, using the shell command
split, for example, would break a large multi GB file many one GB chunks, each of which is more manageable for R:
split -b100m bigfile.csv
The result is a series of files, set to 100 MB each with the
-b100m argument in the above code. By default these will be called
xab and which could be read in one chunk at a time (e.g. using
read_csv, described in the previous section) without crashing most modern computers.
Splitting a large file into individual chunks may allow it to be read into R. This is not an efficient way to import large datasets, however, because it results in a non-random sample of the data this way. A more efficient way to work with very large datasets is via databases, covered in the next chapter.
Since R 3.2.3 the base function
download.file()can be used to download from secure (
https://) connections on any operating system.↩