Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Data analysis with R

You're reading from   Mastering Data analysis with R Gain sharp insights into your data and solve real-world data science problems with R—from data munging to modeling and visualization

Arrow left icon
Product type Paperback
Published in Sep 2015
Publisher Packt
ISBN-13 9781783982028
Length 396 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Gergely Daróczi Gergely Daróczi
Author Profile Icon Gergely Daróczi
Gergely Daróczi
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Hello, Data! 2. Getting Data from the Web FREE CHAPTER 3. Filtering and Summarizing Data 4. Restructuring Data 5. Building Models (authored by Renata Nemeth and Gergely Toth) 6. Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth) 7. Unstructured Data 8. Polishing Data 9. From Big to Small Data 10. Classification and Clustering 11. Social Network Analysis of the R Ecosystem 12. Analyzing Time-series 13. Data Around Us 14. Analyzing the R Community A. References Index

Loading a subset of text files

Sometimes we only need some parts of the dataset for an analysis, stored in a database backend or in flat files. In such situations, loading only the relevant subset of the data frame will result in much more speed improvement compared to any performance tweaks and custom packages discussed earlier.

Let's imagine we are only interested in flights to Nashville, where the annual useR! conference took place in 2012. This means we need only those rows of the CSV file where the Dest equals BNA (this International Air Transport Association airport code stands for Nashville International Airport).

Instead of loading the whole dataset in 160 to 2,000 milliseconds (see the previous section) and then dropping the unrelated rows (see in Chapter 3, Filtering and Summarizing Data), let's see the possible ways of filtering the data while loading it.

The already mentioned sqldf package can help with this task by specifying a SQL statement to be run on the temporary SQLite database created for the importing task:

> df <- read.csv.sql('hflights.csv',
+   sql = "select * from file where Dest = '\"BNA\"'")

This sql argument defaults to "select * from file", which means loading all fields of each row without any filters. Now we extended that with a filter statement. Please note that in our updated SQL statements, we also added the double quotes to the search term, as sqldf does not automatically recognize the quotes as special; it regards them as part of the fields. One may overcome this issue also by providing a custom filter argument, such as the following example on Unix-like systems:

> df <- read.csv.sql('hflights.csv',
+   sql = "select * from file where Dest = 'BNA'",
+   filter = 'tr -d ^\\" ')

The resulting data frame holds only 3,481 observations out of the 227,496 cases in the original dataset, and filtering inside the temporary SQLite database of course speeds up data importing a bit:

> system.time(read.csv.sql('hflights.csv'))
   user  system elapsed 
  2.117   0.070   2.191 
> system.time(read.csv.sql('hflights.csv',
+   sql = "select * from file where Dest = '\"BNA\"'"))
   user  system elapsed 
  1.700   0.043   1.745

The slight improvement is due to the fact that both R commands first loaded the CSV file to a temporary SQLite database; this process of course takes some time and cannot be eliminated from this process. To speed up this part of the evaluation, you can specify dbname as NULL for a performance boost. This way, the SQLite database would be created in memory instead of a tempfile, which might not be an optimal solution for larger datasets.

Filtering flat files before loading to R

Is there a faster or smarter way to load only a portion of such a text file? One might apply some regular expression-based filtering on the flat files before passing them to R. For example, grep or ack might be a great tool to do so in a Unix environment, but it's not available by default on Windows machines, and parsing CSV files by regular expressions might result in some unexpected side-effects as well. Believe me, you never want to write a CSV, JSON, or XML parser from scratch!

Anyway, a data scientist nowadays should be a real jack-of-all-trades when it comes to processing data, so here comes a quick and dirty example to show how one could read the filtered data in less than 100 milliseconds:

> system.time(system('cat hflights.csv | grep BNA', intern = TRUE))
   user  system elapsed 
  0.040   0.050   0.082

Well, that's a really great running time compared to any of our previous results! But what if we want to filter for flights with an arrival delay of more than 13.5 minutes?

Another way, and probably a more maintainable approach, would be to first load the data into a database backend, and query that when any subset of the data is needed. This way we could for example, simply populate a SQLite database in a file only once, and then later we could fetch any subsets in a fragment of read.csv.sql's default run time.

So let's create a persistent SQLite database:

> sqldf("attach 'hflights_db' as new")

This command has just created a file named to hflights_db in the current working directory. Next, let's create a table named hflights and populate the content of the CSV file to the database created earlier:

> read.csv.sql('hflights.csv',
+   sql = 'create table hflights as select * from file',
+   dbname = 'hflights_db')

No benchmarking was made so far, as these steps will be run only once, while the queries for sub-parts of the dataset will probably run multiple times later:

> system.time(df <- sqldf(
+   sql = "select * from hflights where Dest = '\"BNA\"'",
+   dbname = "hflights_db"))
   user  system elapsed 
  0.070   0.027   0.097

And we have just loaded the required subset of the database in less than 100 milliseconds! But we can do a lot better if we plan to often query the persistent database: why not dedicate a real database instance for our dataset instead of a simple file-based and server-less SQLite backend?

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image