10 Efficient Learning

Berkun, Scott. 2005. The Art of Project Management. O’Reilly.

Braun, John, and Duncan J Murdoch. 2007. A First Course in Statistical Programming with R. Vol. 25. Cambridge University Press Cambridge.

Burns, Patrick. 2011. The R Inferno. Lulu.com.

Codd, E. F. 1979. “Extending the database relational model to capture more meaning.” ACM Transactions on Database Systems 4 (4): 397–434. doi:10.1145/320107.320109.

Eddelbuettel, Dirk. 2010. “Benchmarking Single-and Multi-Core BLAS Implementations and GPUs for Use with R.” Mathematica.

Eddelbuettel, Dirk, Romain François, J. Allaire, John Chambers, Douglas Bates, and Kevin Ushey. 2011. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18.

Goldberg, David. 1991. “What Every Computer Scientist Should Know About Floating-Point Arithmetic.” ACM Computing Surveys (CSUR) 23 (1). ACM: 5–48.

Grolemund, Garrett, and Hadley Wickham. 2016. R for Data Science. 1 edition. O’Reilly Media.

Kersten, Martin L, Stratos Idreos, Stefan Manegold, Erietta Liarou, and others. 2011. “The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds.” PVLDB Challenges and Visions 3.

PMBoK, A. 2000. “Guide to the Project Management Body of Knowledge.” Project Management Institute, Pennsylvania USA.

Sekhon, Jasjeet S. 2006. “The Art of Benchmarking: Evaluating the Performance of R on Linux and OS X.” The Political Methodologist 14 (1): 15–19.

Wickham, Hadley. 2014a. Advanced R. CRC Press.

———. 2014b. “Tidy Data.” The Journal of Statistical Software 14 (5). http://www.jstatsoft.org/v59/i10 http://vita.had.co.nz/papers/tidy-data.html.

———. 2015. R Packages. O’Reilly Media, Inc.

  1. Benchmarking conducted for a presentation “R on Different Platforms” at useR 2006 found that R was marginally faster on Windows than Linux set-ups. Similar results were reported in an academic paper, with R completing statistical analyses faster on a Linux than Mac OS’s (Sekhon 2006). In 2015 Revolution R supported these results with slightly faster run times for certain benchmarks on Ubuntu than Mac systems. The data from the benchmarkme package also suggests that running code under the Linux OS is faster.

  2. See jason-french.com/blog/2013/03/11/installing-r-in-linux/ for more information on installing R on a variety of Linux distributions.

  3. In the previous section we specified only a few packages to update.

  4. See vignette("api-packages") from the httr package for more on this.

  5. Other open source R IDEs exist, including RKWard, Tinn-R and JGR. emacs is another popular software environment. However, it has a very steep learning curve.

  6. ‘Slots’ are sub-elements of an object analogous to a column in a data.frame but referred to with @ not $.

  7. See brodrigues.co/2014/11/11/benchmarks-r-blas-atlas-rro/, which finds Revolution R to be marginally faster than R using OpenBLAS and ATLAS BLAS implementations and Faster BLAS in R, which does not.

  8. The Oxford Dictionary’s definition of workflow is similar, with a more industrial feel: “The sequence of industrial, administrative, or other processes through which a piece of work passes from initiation to completion.”

  9. The importance of workflow has not gone unnoticed by the R community and there are a number of different suggestions to boost R productivity. Rob Hyndman, for example, advocates the strategy of using four self-contained scripts to break up R work into manageable chunks: load.R, clean.R, func.R and do.R.

  10. A number of programs have been developed to assist project management and planning, however. These include ProjectLibre and GanttProject.

  11. For a more comprehensive discussion of Gantt charts in R, please refer to stackoverflow.com/questions/3550341.

  12. An excellent overview of the ‘hadleyverse’ and its benefits is available from barryrowlingson.github.io/hadleyverse.

  13. Since R 3.2.3 the base function download.file() can be used to download from secure (https://) connections on any operating system.

  14. This is a multi-table dataset on Dutch naval expeditions used with permission from the CWI Database Architectures Group and described more fully at monetdb.org.

  15. Note that the dimensions of the data change from having 10 observations across 18 columns to 162 rows in only 3 columns. Note that when we print the object rawt[1:3,], the class of each variable is given (chr, fctr, int refer to character, factor and integer classes, respectively). This is because read_csv uses the tbl class from the dplyr package (described below).

  16. Note in this code block the variable name is surrounded by back-quotes (). This allows R to refer to column names that are non-standard. Note also the syntax:renametakes thedata.frameas the first object and then creates new variables by specifyingnew_variable_name = original_name`.

  17. Note that this syntax is a defining feature of dplyr and many of its functions work in the same way. Later we’ll learn how this syntax can be used alongside the %>% ‘pipe’ command to write clear data manipulation commands.

  18. Note the first argument in the function is the vector we’re aiming to aggregate and the second is the grouping variable (in this case Countries). A quirk of R is that the grouping variable must be supplied as a list. Next we’ll see a way of writing this that is neater.

  19. One question on the stackoverflow website titled ‘data.table vs dplyr’ has received much attention and sets out the advantages of each approach. The question and subsequent responses do not provide a conclusive answer to the issue and the responses may be out of date in some cases but it certainly makes for interesting reading and delves into the philosophy underlying each approach.

  20. The authors have yet to find a situation where byte compiled code runs significantly slower.