Loading text files of a reasonable size
The title of this chapter might also be Hello, Big Data!, as now we concentrate on loading relatively large amount of data in an R session. But what is Big Data, and what amount of data is problematic to handle in R? What is reasonable size?
R was designed to process data that fits in the physical memory of a single computer. So handling datasets that are smaller than the actual accessible RAM should be fine. But please note that the memory required to process data might become larger while doing some computations, such as principal component analysis, which should be also taken into account. I will refer to this amount of data as reasonable sized datasets.
Loading data from text files is pretty simple with R, and loading any reasonable sized dataset can be achieved by calling the good old read.table
function. The only issue here might be the performance: how long does it take to read, for example, a quarter of a million rows of data? Let's see:
> library('hflights') > write.csv(hflights, 'hflights.csv', row.names = FALSE)
Note
As a reminder, please note that all R commands and the returned output are formatted as earlier in this book. The commands starts with >
on the first line, and the remainder of multi-line expressions starts with +
, just as in the R console. To copy and paste these commands on your machine, please download the code examples from the Packt homepage. For more details, please see the What you need for this book section in the Preface.
Yes, we have just written an 18.5 MB text file to your disk from the hflights
package, which includes some data on all flights departing from Houston in 2011:
> str(hflights) 'data.frame': 227496 obs. of 21 variables: $ Year : int 2011 2011 2011 2011 2011 2011 2011 ... $ Month : int 1 1 1 1 1 1 1 1 1 1 ... $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ... $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ... $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 ... $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 ... $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ... $ FlightNum : int 428 428 428 428 428 428 428 428 428 ... $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ... $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ... $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ... $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ... $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ... $ Origin : chr "IAH" "IAH" "IAH" "IAH" ... $ Dest : chr "DFW" "DFW" "DFW" "DFW" ... $ Distance : int 224 224 224 224 224 224 224 224 224 ... $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ... $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ... $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ... $ CancellationCode : chr "" "" "" "" ... $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
Note
The hflights
package provides an easy way to load a subset of the huge Airline Dataset of the Research and Innovation Technology Administration at the Bureau of Transportation Statistics. The original database includes the scheduled and actual departure/arrival times of all US flights along with some other interesting information since 1987, and is often used to demonstrate machine learning and Big Data technologies. For more details on the dataset, please see the column description and other meta-data at http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0.
We will use this 21-column data to benchmark data import times. For example, let's see how long it takes to import the CSV file with read.csv
:
> system.time(read.csv('hflights.csv')) user system elapsed 1.730 0.007 1.738
It took a bit more than one and a half seconds to load the data from an SSD here. It's quite okay, but we can achieve far better results by identifying then specifying the classes of the columns instead of calling the default type.convert
(see the docs in read.table
for more details or search on StackOverflow, where the performance of read.csv
seems to be a rather frequent and popular question):
> colClasses <- sapply(hflights, class) > system.time(read.csv('hflights.csv', colClasses = colClasses)) user system elapsed 1.093 0.000 1.092
It's much better! But should we trust this one observation? On our way to mastering data analysis in R, we should implement some more reliable tests—by simply replicating the task n times and providing a summary on the results of the simulation. This approach provides us with performance data with multiple observations, which can be used to identify statistically significant differences in the results. The
microbenchmark
package provides a nice framework for such tasks:
> library(microbenchmark) > f <- function() read.csv('hflights.csv') > g <- function() read.csv('hflights.csv', colClasses = colClasses, + nrows = 227496, comment.char = '') > res <- microbenchmark(f(), g()) > res Unit: milliseconds expr min lq median uq max neval f() 1552.3383 1617.8611 1646.524 1708.393 2185.565 100 g() 928.2675 957.3842 989.467 1044.571 1284.351 100
So we defined two functions: f
stands for the default settings of read.csv
while, in the g
function, we passed the aforementioned column classes along with two other parameters for increased performance. The comment.char
argument tells R not to look for comments in the imported data file, while the nrows
parameter defined the exact number of rows to read from the file, which saves some time and space on memory allocation. Setting stringsAsFactors
to FALSE
might also speed up importing a bit.
Note
Identifying the number of lines in the text file could be done with some third-party tools, such as wc
on Unix, or a slightly slower alternative would be the countLines
function from the
R.utils
package.
But back to the results. Let's also visualize the median and related descriptive statistics of the test cases, which was run 100 times by default:
> boxplot(res, xlab = '', + main = expression(paste('Benchmarking ', italic('read.table'))))
The difference seems to be significant (please feel free to do some statistical tests to verify that), so we made a 50+ percent performance boost simply by fine-tuning the parameters of read.table
.
Data files larger than the physical memory
Loading a larger amount of data into R from CSV files that would not fit in the memory could be done with custom packages created for such cases. For example, both the
sqldf
package and the
ff
package have their own solutions to load data from chunk to chunk in a custom data format. The first uses SQLite or another SQL-like database backend, while the latter creates a custom data frame with the ffdf
class that can be stored on disk. The
bigmemory
package provides a similar approach. Usage examples (to be benchmarked) later:
> library(sqldf) > system.time(read.csv.sql('hflights.csv')) user system elapsed 2.293 0.090 2.384 > library(ff) > system.time(read.csv.ffdf(file = 'hflights.csv')) user system elapsed 1.854 0.073 1.918 > library(bigmemory) > system.time(read.big.matrix('hflights.csv', header = TRUE)) user system elapsed 1.547 0.010 1.559
Please note that the header defaults to FALSE
with read.big.matrix
from the bigmemory
package, so be sure to read the manual of the referenced functions before doing your own benchmarks. Some of these functions also support performance tuning like read.table
. For further examples and use cases, please see the Large memory and out-of-memory data section of the High-Performance and Parallel Computing with R CRAN Task View at http://cran.r-project.org/web/views/HighPerformanceComputing.html.