Loading a subset of text files
Sometimes we only need some parts of the dataset for an analysis, stored in a database backend or in flat files. In such situations, loading only the relevant subset of the data frame will result in much more speed improvement compared to any performance tweaks and custom packages discussed earlier.
Let's imagine we are only interested in flights to Nashville, where the annual useR! conference took place in 2012. This means we need only those rows of the CSV file where the Dest
equals BNA
(this International Air Transport Association airport code stands for Nashville International Airport).
Instead of loading the whole dataset in 160 to 2,000 milliseconds (see the previous section) and then dropping the unrelated rows (see in Chapter 3, Filtering and Summarizing Data), let's see the possible ways of filtering the data while loading it.
The already mentioned
sqldf
package can help with this task by specifying a SQL statement to be run on the temporary SQLite database created for the importing task:
> df <- read.csv.sql('hflights.csv', + sql = "select * from file where Dest = '\"BNA\"'")
This sql
argument defaults to "select * from file"
, which means loading all fields of each row without any filters. Now we extended that with a filter
statement. Please note that in our updated SQL statements, we also added the double quotes to the search term, as sqldf
does not automatically recognize the quotes as special; it regards them as part of the fields. One may overcome this issue also by providing a custom filter argument, such as the following example on Unix-like systems:
> df <- read.csv.sql('hflights.csv', + sql = "select * from file where Dest = 'BNA'", + filter = 'tr -d ^\\" ')
The resulting data frame holds only 3,481 observations out of the 227,496 cases in the original dataset, and filtering inside the temporary SQLite database of course speeds up data importing a bit:
> system.time(read.csv.sql('hflights.csv')) user system elapsed 2.117 0.070 2.191 > system.time(read.csv.sql('hflights.csv', + sql = "select * from file where Dest = '\"BNA\"'")) user system elapsed 1.700 0.043 1.745
The slight improvement is due to the fact that both R commands first loaded the CSV file to a temporary SQLite database; this process of course takes some time and cannot be eliminated from this process. To speed up this part of the evaluation, you can specify dbname
as NULL
for a performance boost. This way, the SQLite database would be created in memory instead of a tempfile
, which might not be an optimal solution for larger datasets.
Filtering flat files before loading to R
Is there a faster or smarter way to load only a portion of such a text file? One might apply some regular expression-based filtering on the flat files before passing them to R. For example, grep
or ack
might be a great tool to do so in a Unix environment, but it's not available by default on Windows machines, and parsing CSV files by regular expressions might result in some unexpected side-effects as well. Believe me, you never want to write a CSV, JSON, or XML parser from scratch!
Anyway, a data scientist nowadays should be a real jack-of-all-trades when it comes to processing data, so here comes a quick and dirty example to show how one could read the filtered data in less than 100 milliseconds:
> system.time(system('cat hflights.csv | grep BNA', intern = TRUE)) user system elapsed 0.040 0.050 0.082
Well, that's a really great running time compared to any of our previous results! But what if we want to filter for flights with an arrival delay of more than 13.5 minutes?
Another way, and probably a more maintainable approach, would be to first load the data into a database backend, and query that when any subset of the data is needed. This way we could for example, simply populate a SQLite database in a file only once, and then later we could fetch any subsets in a fragment of read.csv.sql
's default run time.
So let's create a persistent SQLite database:
> sqldf("attach 'hflights_db' as new")
This command has just created a file named to hflights_db
in the current working directory. Next, let's create a table named hflights
and populate the content of the CSV file to the database created earlier:
> read.csv.sql('hflights.csv', + sql = 'create table hflights as select * from file', + dbname = 'hflights_db')
No benchmarking was made so far, as these steps will be run only once, while the queries for sub-parts of the dataset will probably run multiple times later:
> system.time(df <- sqldf( + sql = "select * from hflights where Dest = '\"BNA\"'", + dbname = "hflights_db")) user system elapsed 0.070 0.027 0.097
And we have just loaded the required subset of the database in less than 100 milliseconds! But we can do a lot better if we plan to often query the persistent database: why not dedicate a real database instance for our dataset instead of a simple file-based and server-less SQLite backend?