Time for action – statistics about internal commercial flights in the USA from 1998 to 2011
SAP markets their HANA
database as being really, really fast. But how fast is it really, when there is a large amount of data in the system? The author decided to do a couple of tests to find out. Rather than creating some "fake" data, it was decided that using publicly available information would be better. This means that anyone else wanting to do some performance tests can repeat the experiment.
The US government, as a part of their "open" data initiative of making data available for analysis, released all the data concerning commercial flights inside the USA, starting in 1998 and going right through to the present day. You can find this at http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time.
After a couple of days of downloading the data (49 GB of CSV files), tidying it up a bit, and then loading it into SAP HANA and MySQL, it was possible to do a handful of comparisons on the two different database systems. To be fair, it has to be said that the MySQL machine has only 8 GB of memory, but it has an SSD, while the SAP HANA machine has 128 GB, but "real" disks.
As an indication of SAP HANA's data loading performance, each data file of around 1.5 GB and around five million lines, loaded using SAP HANA's native IMPORT FROM CSV FILE instruction, took approximately 45 seconds (on an average). That means around 100000 lines are loaded every second.
The first thing to note is the sheer data volume—we're talking about over 140 million lines of data, which is a reasonable dataset size. Many companies have this much data in their orders
or stock movement
tables, so doing a couple of stress tests on this dataset allows extrapolation on the performance which we can expect when we use SAP HANA for large applications.
Without explaining all the data fields available, we have some interesting information, such as the origin and destination airports, as well as the flight delay at departure and on arrival. With these fields, we can perform some interesting queries.
We'll start by taking a look at something relatively simple: count the number of flights leaving Los Angeles airport, per year. The SQL statement to achieve this is as follows; the syntax is same in any relational database. Don't worry, we won't be spending 300 pages writing SQL code.
SELECT YEAR, COUNT(*) FROM FLIGHT.ONTIME WHERE ORIGIN = 'LAX' GROUP BY YEAR
First up, let's see how MySQL fares with this query (age before beauty, as they say):
Two minutes for MySQL. Not bad at all. We could probably optimize this, at least a little, by creating an index on the ORIGIN
and YEAR
columns though.
Now how does SAP HANA do on the same query and same dataset:
Well the results of the SQL are the same—which proves no cheating is going on. The time taken to resolve a query on SAP HANA is slightly under 800 milliseconds and around 160 times faster than MySQL. Impressive stuff.
How about something a little more taxing? Let's count the number of flights per day of the week, and then sort them by the number of results.
To see how MySQL fares with this query, check the following screenshot:
Not too shabby I suppose. MySQL looks at the full 140 million-line dataset in less than three minutes. How much time does SAP HANA take for the same query?
Once again, we can see that SAP HANA returns the result in a couple of seconds, where MySQL needs nearly three minutes to do the same.
If we were to summarize these results, we could say that with SAP HANA:
The machine, working more efficiently, is less busy for less time, and can therefore do more work
With a more efficient database such as SAP HANA, your servers will be able to do even more work, getting necessary information into your users' hands faster, enabling them to spend more time analyzing data, rather than waiting for it.
What just happened?
We've just glanced through the simplest usage of SAP HANA: a "standard" (if rather fast) database. SAP HANA can be used to replace legacy databases in more or less any situation: the system provides connectors for several standard database-access protocols (ODBC and JDBC, notably), so just by unplugging your MySQL, PostgreSQL, or SQL server database, and replacing it with SAP HANA, you'll be getting immediate speedups.