Understanding how QlikView stores its data
QlikView is really good at storing data. It operates on data in memory, so being able to store a lot of data in a relatively small amount of memory gives the product a great advantage—especially as Moore's Law continues to give us bigger and bigger servers.
Understanding how QlikView stores its data is fundamental in mastering QlikView development. Writing load script with this understanding will allow you to load data in the most efficient way so that you can create the best performing applications. Your users will love you.
A great primer
A great primer on how QlikView stores its data is available on Qlik Design Blog, written by Henric Cronström (http://community.qlik.com/blogs/qlikviewdesignblog/2012/11/20/symbol-tables-and-bit-stuffed-pointers).
Note
Henric joined QlikView in 1994, so he knows quite a bit about exactly how it works.
Looking at things from a simple level
From a simple level, consider the following small table:
First name |
Surname |
Country |
---|---|---|
John |
Smith |
USA |
Jane |
Smith |
USA |
John |
Doe |
Canada |
For the preceding table, QlikView will create three symbol tables like the following:
Index |
Value |
---|---|
1010 |
John |
1011 |
Jane |
Index |
Value |
---|---|
1110 |
Smith |
1111 |
Doe |
Index |
Value |
---|---|
110 |
USA |
111 |
Canada |
And the data table will look like the following:
First name |
Surname |
Country |
---|---|---|
1010 |
1110 |
110 |
1011 |
1110 |
110 |
1010 |
1111 |
111 |
This set of tables will take up less space than the original data table for the following three reasons:
The binary indexes are bit-stuffed in the data table—they only take up as much space as needed.
The binary index, even though repeated, will take up less space than the text values. The Unicode text just for "USA" takes up several bytes—the binary index takes less space than that.
Each, larger, text value is only stored once in the symbol tables.
So, to summarize, each field in the data model will be stored in a symbol table (unless, as we will see later, it is a sequential integer value) that contains the unique values and an index value. Every table that you create in the script—including any synthetic key tables—will be represented as a data table containing just the index pointers.
Note
Because the data table indexes are bit-stuffed, and because data is stored in bytes, adding another bit or two to the indexes may not actually increase the overall width of a data table record.
Exporting the memory statistics for a document
To help us understand what is going on in a particular QlikView document, we can export details about where all the memory is being used. This export file will tell us how much memory is being used by each field in the symbol tables, the data tables, chart objects, and so on.
Perform the following steps to export the memory statistics for a document:
To export the memory statistics, you need to open Document Properties from the Settings menu (Ctrl + Alt + D). On the General tab, click on the Memory Statistics button, as shown in the following screenshot:
After you click on the button, you will be prompted to enter file information. Once you have entered the path and filename, the file will be exported. It is a tab-delimited data file:
The easiest way to analyze this file is to import it into a new QlikView document:
We can now see exactly how much space our data is taking up in the symbol tables and in the data tables. We can also look at chart calculation performance to see whether there are long running calculations that we need to tune. Analyzing this data will allow us to make valuable decisions about where we can improve performance in our QlikView document.
One thing that we need to be cognizant of is that the memory usage and calculation time of charts will only be available if that chart has actually been opened. The calculation time of the charts may also not be accurate as it will usually only be correct if the chart has just been opened for the first time—subsequent openings and changes of selection will most probably be calculated from the cache, and a cache execution should execute a lot quicker than a non-cached execution. Other objects may also use similar expressions, and these will therefore already be cached. We can turn the cache off—although only for testing purposes, as it can really kill performance.