Understanding the performance of SAP HANA
We noticed that SAP HANA is very fast, but how exactly are these performance measures achieved? In this section we'll take a look at some of the characteristics that make up a SAP HANA server and just how the performance that we have just seen is achieved.
Massively parallel execution
The first thing that SAP HANA does to speed up SQL queries is to break them down into manageable chunks. For the dataset we used in the previous examples, there were 140 million lines. The server that the SAP HANA
database was running on in the previous example had 12 processors, so to speed up the queries, SAP HANA split the data into 12 chunks (of just under 12 million lines each) and ran the 12 queries all at the same time. Once the results from each subquery came in, SAP HANA then combined the results to provide the overall result.
The advantages of working in this way are immediate: today's computers, especially servers, have several CPUs, each of which has many CPU cores. By putting as many cores to work as possible, SAP HANA maximizes the power which the server can provide to every incoming query.
Most traditional databases claim to work in a similar manner—they run a handful of processes, and so are able to serve several queries at the same time. But there's a critical difference to this: nobody gets the full power of the server. In our example, the only time the server would be fully used is when 12 queries arrive all at the same time (and when a query only takes 800 milliseconds to serve in HANA, this would be relatively rare).
If we imagine a server as an image, it might look like the following image for a query on a legacy database:
We can see that the server runs the query on one of the CPU cores, and the other 11 sit idle waiting for work.
The same query running on SAP HANA might be shown something like this:
The query is running on all 12 CPU cores, making the most possible use of the server's power—and returning results around 12 times faster.
If the only innovation were parallel execution of queries, then on our example server shown here, SAP HANA would be faster 11 times out of every 12 queries received if all queries were received at exactly the same time. In all other scenarios, SAP HANA would systematically return query results faster than the legacy database.
Of course, parallel execution isn't the only trick up SAP HANA's sleeve, there's also….
Column-oriented table storage
In any given database, no matter how sophisticated, simple, fast, or slow it is, when it is boiled down, you end up with files of information on the computer's hard disk. The speed at which the database program can get at, interpret, and generally unravel these files impacts the database's perceived speed directly.
Let's picture a table in a database system. This table is a simplified (just for comprehension's sake) version of the table used in the previous section, concerning the US flight records.
Date |
DayOfWeek |
Origin |
Destination |
FlightTime |
DepDelay |
ArrDelay |
---|---|---|---|---|---|---|
19880101 |
5 |
LAX |
JFK |
305 |
10 |
15 |
19880102 |
6 |
LAX |
JFK |
301 |
3 |
4 |
19880103 |
7 |
LAX |
JFK |
311 |
-4 |
7 |
19880104 |
1 |
LAX |
JFK |
282 |
37 |
15 |
19880105 |
2 |
LAX |
JFK |
300 |
0 |
0 |
19880106 |
3 |
LAX |
JFK |
294 |
4 |
-2 |
19880107 |
4 |
LAX |
JFK |
284 |
16 |
0 |
In just about every database system used today (with some exceptions, mostly research projects at universities and a couple of relatively small open-source projects), data is stored in the files basically as you would expect, and might look something like this on disk:
198801015LAXJFK3051015198801026LAXJFK30134198801037LAXJFK311-47198801041LAXJFK2823715198801052LAXJFK30000198801063LAXJFK2944-2198801074LAXJFK284160
The fields have been laid out in the file as they appear in the table, starting with the first row, first field, then second field, and so on to the end of the first row. The operation is then repeated for the other rows in the table. While this is fairly unreadable to us, to a computer program, interpreting this file structure is no trouble at all.
Storing data in this way has one huge advantage for a database server: when you need to add a new line to a table, you just open the file, skip to the end, and write out your data. Adding data to the file is extremely fast.
Inversely, however, reading data can be relatively slow. If we ask the database for the time of the flight from LAX to JFK on the 6th of January 1988 (294 minutes), it has to read the data from the file, parse it until it gets to the Date
field which contains 19880106, then skip to the next Origin
field, check it contains LAX, then the Destination
, checking it contains JFK, then to the FlightTime
field to return the result.
Imagine the work necessary when there are millions of lines in the table.
Databases use indexes to speed up these searches. We might create an index on the FlightDate
field, which would contain the address of the first character of the data for a given flight date. In the first line of the table, the FlightDate
field starts at position 1 in the file. For the second line, it is position 23, and for the third line, position 43. Our index file might therefore look like this: 19880101-1, 19980102-23, 19930103-43, and so on.
This will speed up our data reads, especially if we create an index for FlightDate, Origin, and Destination in the case of the example query we've been looking at. The database system will open the index file (which is a lot simpler than the real data file), find the starting character of the needed data, then open the real data file, skip right to the correct place in the file, and read the data.
Of course, this increase in speed when reading data has one large drawback, we now have two files to maintain when we insert data. The data file must be updated and then any index files need to be updated, too. The more indexes we create to improve reading performance, the more we degrade the writing performance of our database. This, traditionally, has been a big problem in databases used both for transaction creation (so called OLTP, or online transaction processing) and reporting (OLAP, or online analytic processing). To put things simply, writing data and reading data are just not easily optimized at the same time on the same server.
If a query comes in for which there is no index, then the whole data file has to be read anyway, leading to the database administrator's worst nightmare: a "full table scan". This does just what its name suggests, it reads the whole data file (since no index exists there might be another line for 19980106 right at the end of the file, but we won't know that until we've read the whole file).
As you've probably figured out by now, this storage system is not usually how SAP HANA does things. As well as being able to use row-based tables, SAP HANA usually stores data in a column-oriented manner.
Consider a data table such as the following one existing in the database:
Row1-Field1 |
Row1-Field2 |
Row1-Field3 |
Row2-Field1 |
Row2-Field2 |
Row2-Field3 |
Row3-Field1 |
Row3-Field2 |
Row3-Field3 |
Rather than storing the data as:
Row1-Field1, Row1-Field2, Row1-Field3, Row2-Field1, Row2-Field2, and so on.
A column-oriented database stores data as:
Row1-Field1, Row2-Field1, Row3-Field1, Row1-Field2, Row1-Field2, and so on.
If we take another look at the example table we saw earlier in this chapter, that would give us a storage file something like this:
Note
This layout is shown for example purposes, this layout type is true for some column-oriented databases, but SAP HANA does not store the data files exactly like the following storage file, it takes the storage to another level that will be explained in a moment.
198801011988010219880103198801041988010519880106198801075671234LAXLAXLAXLAXLAXLAXLAXJFKJFKJFKJFKJFKJFKJFK305301311282300294284103-43704161547150-20
What difference does this way of storing data make? When you read the table to do reporting, for example, all the values of a column are stored together, so if we need to find "all the rows for January 6th, 1988" then we don't need to look at the whole file, just the first part, which contains the first column. There's no need to read, skip forward, check, read, skip forward, and so on.
SAP HANA takes this one step further, storing each column in a separate file:
Col1:19880101198801021988010319880104198801051988010619880107 |
Col2:5671234 |
Col3:LAXLAXLAXLAXLAXLAXLAX |
Col4:JFKJFKJFKJFKJFKJFKJFK |
Col5:305301311282300294284 |
Col6:103-4370416 |
Col7:1547150-20 |
If SAP HANA needs to answer the query we saw earlier (the time of the flight from LAX to JFK on the 6th of January 1988), then the server can again cut the query into several components:
FlightDate = 19880106 |
Origin = LAX |
Destination = JFK |
Information to return = FLIGHTDATE |
These components can be run in parallel, one CPU handling each of the tests, finding the row numbers which match:
FlightDate = 19880106 – Matching row(s) = 6 |
Origin = LAX – Matching row(s) = 1,2,3,4,5,6,7 |
Destination = JFK – Matching Row(s) = 1,2,3,4,5,6,7 |
Finally all that needs to be done is to find the union of each sub-result set (row 6), and return the final result (which is 294 minutes).
Of course, if reading is optimized then we can wonder whether writing to the database is much slower. As it happens, by storing each column in a separate file, SAP HANA can provide the optimum speed we saw originally in the row-based database system. When a new database row needs to be added to the table, SAP HANA just has to open each individual file, add the appropriate column-value to the end of the file, and then close the file. This can even be done to the columns in parallel.
As we can see, storing data in columns can greatly speed up queries on the database, but column storage allows SAP HANA one more trick which would be difficult with a row-storage database and that is…
Data compression
If we take another look at our example table (admittedly, the example is rather extreme), we have in the ORIGIN
and DESTINATION
columns, seven values and they're all the same. It's rather a shame to store all seven values; it would be much more space-friendly to store each unique value, and a list of the rows it was present in, something like this:
Col1:19880101198801021988010319880104198801051988010619880107 |
Col2:5671234 |
Col3:LAX(7) |
Col4:JFK(7) |
Col5:305301311282300294284 |
Col6:103-4370416 |
Col7:1547150-20 |
The database could see that in column 3, there are seven occurrences of the value LAX. The space used by this column has just been divided by seven. Imagine the space savings on a database table of 200 million sales order lines, all of which are in USD. The savings in disk space, and therefore in the time it takes for the disk to find data are potentially huge.
It has to be noted that compression has an impact on performance, especially when data is added to the database. Column files need to be uncompressed, new data added, and then recompressed to maintain both compression quality and optimum read speed. In order not to penalize users when data is written to the database, SAP HANA keeps two separate versions of a column file. The main storage contains the compressed table contents, and the delta storage contains data freshly added to the database and which hasn't been compressed yet. When the system is idle, or when the delta storage represents a certain percentage of the main storage, the table is recompressed in the background.
Of course, we could ask why data compression is such a big deal when disks are so cheap nowadays, and that's where SAP HANA's (probably) best playing card can be brought out.
In-memory technology
SAP HANA tries as much as possible to not serve data from hard disks. They're just too slow. Whenever a table is requested in a query, SAP HANA loads the whole table into memory, and from then on, serves the query results directly from memory. Reading from memory has been, for many years, a few orders of magnitude faster than reading data from hard disks (SSDs have clawed back some of the hard disk losses, but they're still much, much slower than main memory).
Keeping all the data in memory means that accessing it is at least 150 times faster.
Of course, this means that SAP HANA can't really be run on "just any old server". How many machines have you seen with, say 128 GB of RAM? 256 GB? 1 Tera? Not many, and that's why SAP HANA isn't sold as a piece of software you can "just install".
SAP HANA as an appliance
SAP only sells HANA as an appliance. You cannot purchase SAP HANA as a DVD set from SAP. You have to purchase an appliance which is a certified server configuration from one of the SAP's hardware partners such as Dell, Cisco, or IBM. The hardware partner builds the machine to SAP's specifications, has it certified by SAP as conforming to the requirements of the SAP HANA database, and then installs SAP HANA on it. You can purchase the SAP HANA license from SAP, but the software and the hardware are actually sold together as a bundle: an appliance.
This method allows SAP much greater control over the SAP HANA ecosphere, guaranteeing a certain level of performance from the database, and keeping support calls to a reasonable level.
Since the servers are all configured to a certain level, SAP can more easily tell customers what performance they are likely to achieve from their SAP HANA installation.
Full information about SAP partners' available configurations for SAP HANA appliances are available from the sap.com website at http://www.sap.com/solutions/technology/in-memory-computing-platform/hana/partners/index.epx and click on the SAP HANA available hardware configurations link.
This document requires a SAP Service Marketplace login for downloading, but here is a sample screenshot: