You're reading from Mastering Data analysis with R Gain sharp insights into your data and solve real-world data science problems with R—from data munging to modeling and visualization

Product type Paperback

Published in Sep 2015

Publisher Packt

ISBN-13 9781783982028

Length 396 pages

Edition 1st Edition

Languages

Concepts

Data Analysis

Author (1):

Gergely Dar√≥czi

View More author details

Table of Contents (17) Chapters

Preface

1. Hello, Data!

2. Getting Data from the Web FREE CHAPTER

3. Filtering and Summarizing Data

4. Restructuring Data

5. Building Models (authored by Renata Nemeth and Gergely Toth)

6. Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)

7. Unstructured Data

8. Polishing Data

9. From Big to Small Data

10. Classification and Clustering

11. Social Network Analysis of the R Ecosystem

12. Analyzing Time-series

13. Data Around Us

14. Analyzing the R Community

A. References

Index

Benchmarking text file parsers

Another notable alternative for handling and loading reasonable sized data from flat files to R is the data.table package. Although it has a unique syntax differing from the traditional S-based R markup, the package comes with great documentation, vignettes, and case studies on the indeed impressive speedup it can offer for various database actions. Such uses cases and examples will be discussed in the Chapter 3, Filtering and Summarizing Data and Chapter 4, Restructuring Data.

The package ships a custom R function to read text files with improved performance:

> library(data.table)
> system.time(dt <- fread('hflights.csv'))
   user  system elapsed 
  0.153   0.003   0.158

Loading the data was extremely quick compared to the preceding examples, although it resulted in an R object with a custom data.table class, which can be easily transformed to the traditional data.frame if needed:

> df <- as.data.frame(dt)

Or by using the setDF function, which provides a very fast and in-place method of object conversion without actually copying the data in the memory. Similarly, please note:

> is.data.frame(dt)
[1] TRUE

This means that a data.table object can fall back to act as a data.frame for traditional usage. Leaving the imported data as is or transforming it to data.frame depends on the latter usage. Aggregating, merging, and restructuring data with the first is faster compared to the standard data frame format in R. On the other hand, the user has to learn the custom syntax of data.table—for example, DT[i, j, by] stands for "from DT subset by i, then do j grouped by by". We will discuss it later in the Chapter 3, Filtering and Summarizing Data.

Now, let's compare all the aforementioned data import methods: how fast are they? The final winner seems to be fread from data.table anyway. First, we define some methods to be benchmarked by declaring the test functions:

> .read.csv.orig   <- function() read.csv('hflights.csv')
> .read.csv.opt    <- function() read.csv('hflights.csv',
+     colClasses = colClasses, nrows = 227496, comment.char = '',
+     stringsAsFactors = FALSE)
> .read.csv.sql    <- function() read.csv.sql('hflights.csv')
> .read.csv.ffdf   <- function() read.csv.ffdf(file = 'hflights.csv')
> .read.big.matrix <- function() read.big.matrix('hflights.csv',
+     header = TRUE)
> .fread           <- function() fread('hflights.csv')

Now, let's run all these functions 10 times each instead of several hundreds of iterations like previously—simply to save some time:

> res <- microbenchmark(.read.csv.orig(), .read.csv.opt(),
+   .read.csv.sql(), .read.csv.ffdf(), .read.big.matrix(), .fread(),
+   times = 10)

And print the results of the benchmark with a predefined number of digits:

> print(res, digits = 6)
Unit: milliseconds
               expr      min      lq   median       uq      max neval
   .read.csv.orig() 2109.643 2149.32 2186.433 2241.054 2421.392    10
    .read.csv.opt() 1525.997 1565.23 1618.294 1660.432 1703.049    10
    .read.csv.sql() 2234.375 2265.25 2283.736 2365.420 2599.062    10
   .read.csv.ffdf() 1878.964 1901.63 1947.959 2015.794 2078.970    10
 .read.big.matrix() 1579.845 1603.33 1647.621 1690.067 1937.661    10
           .fread()  153.289  154.84  164.994  197.034  207.279    10

Please note that now we were dealing with datasets fitting in actual physical memory, and some of the benchmarked packages are designed and optimized for far larger databases. So it seems that optimizing the read.table function gives a great performance boost over the default settings, although if we are after really fast importing of reasonable sized data, using the data.table package is the optimal solution.

The rest of the chapter is locked

You're reading from Mastering Data analysis with R Gain sharp insights into your data and solve real-world data science problems with R—from data munging to modeling and visualization

Table of Contents (17) Chapters

Benchmarking text file parsers

Authors (1)

Personalised recommendations for you

You're reading from Mastering Data analysis with R Gain sharp insights into your data and solve real-world data science problems with R—from data munging to modeling and visualization

Table of Contents (17) Chapters

Benchmarking text file parsers

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you