Understanding Hyper
In this section, we will explore Tableau's data-handling engine, and how it enables structured yet organic data mining processes in enterprises. Since the release of Tableau 10.5, we can now make use of Hyper, a high-performing database, allowing us to query source data faster than ever before. Hyper is Tableau's data-handling engine, which is usually not well understood by even advanced developers, because it's not an overt part of day-to-day activities; however, if you want to truly grasp how to prepare data for Tableau, this understanding is crucial.
Hyper originally started as a research project at the University of Munich in 2008. In 2016, it was acquired by Tableau and appointed as the dedicated data engine group of Tableau, maintaining its base and employees in Munich. Initially in 10.5, Hyper replaced the earlier data-handling engine only for extracts. It is still true that live connections are not touched by Hyper, but Tableau Prep Builder now runs on the Hyper engine too, with more use cases to follow.
What makes Hyper so fast? Let's have a look under the hood!
The Tableau data-handling engine
The vision shared by the founders of Hyper was to create a high-performing, next-generation database; one system, one state, no trade-offs, and no delays. And it worked—today, Hyper can serve general database purposes, data ingestion, and analytics at the same time.
Memory prices have decreased exponentially. If we go back in time to 1996, 1 GB of data could cost $45,000 in production costs. Today, much more than that can be found on every phone, or even on a smartwatch, costing as little $2 to produce. The same goes for CPUs; transistor counts increased according to Moore's law, while other features stagnated. Memory is cheap but processing still needs to be improved.
Moore's Law is the observation made by Intel co-founder Gordon Moore that the number of transistors on a chip doubles every two years while the costs are halved. In 1965, Gordon Moore noticed that the number of transistors per square inch on integrated circuits had doubled every two years since their invention. Information on Moore's Law can be found on Investopedia at https://www.investopedia.com/terms/m/mooreslaw.asp.
While experimenting with Hyper, the founders measured that handwritten C code is faster than any existing database engine, so they came up with the idea to transform Tableau Queries into LLVM code and optimize it simultaneously, all behind the scenes, so the Tableau user won't notice it. This translation and optimization comes at a cost; traditional database engines can start executing code immediately. Tableau needs to first translate queries into code, optimize that code, then compile it into machine code, after which it can be executed. So the big question is, is it still faster? As proven by many tests on Tableau Public and other workbooks, the answer is yes!
Furthermore, if there is a query estimated to be faster if executed without the compilation to machine code, Tableau has its own virtual machine (VM) in which the query will be executed right away. And next to this, Hyper can utilize 99% of available CPUs, whereas other paralyzed processes can only utilize 29% of available CPUs. This is due to the unique and innovative technique of morsel-driven parallelization.
For those of you that want to know more about morsel-driven parallelization, a paper, which later on served as a baseline for the Hyper engine, can be found at https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf.
If you want to know more about the Hyper engine, I can highly recommend the following video at https://youtu.be/h2av4CX0k6s.
Hyper parallelizes three steps of traditional data warehousing operations:
- Transactions and Continuous Data Ingestion (Online Transaction Processing, or OLTP)
- Analytics (Online Analytical Processing, or OLAP)
- Beyond Relational (Online Beyond Relational Processing, or OBRP)
Executing those steps simultaneously makes it more efficient and more performant, as opposed to traditional systems where those three steps are separated and executed one after the other.
To sum up, Hyper is a highly specialized database engine that allows us as users to get the best out of our queries. If you recall, in Chapter 1, Getting Up to Speed – A Review of the Basics, we already saw that every change on a sheet or dashboard, including drag and drop pills, filters, and calculated fields, among others, are translated into queries. Those queries are pretty much SQL-lookalikes, however, in Tableau we call the querying engine VizQL.
VizQL, another hidden gem in your Tableau Desktop, is responsible for visualizing data into chart format and is fully executed in memory. The advantage is that no additional space on the database site is required here. VizQL is generated when a user places a field on a shelf. VizQL is then translated into SQL, MDX, or Tableau Query Language (TQL), and passed to the backend data source with a driver. Two aspects of the VizQL module are of primary importance:
- VizQL allows you to change field attributions on the fly
- VizQL enables table calculations
We'll discuss these two aspects in more detail in the following sections.
Changing field attribution
In this section, we'll demonstrate how changing a worksheet's field attribution will allow you more flexibility in your dashboard creation.
Let's look at the World Happiness Report. Please navigate to the workbook associated with this chapter on https://public.tableau.com/profile/marleen.meier and open the Score per country
sheet. We create the following worksheet by placing AVG(Happiness Score) and Country on the Columns and Rows shelves respectively. AVG(Happiness Score) is, of course, treated as a measure in this case. Lastly, sort the countries by their happiness score, highest to lowest.
Let's take a look at this in the following screenshot:
Figure 2.1: Happiness score by country
Next, please create a second worksheet called Score/Rank
to analyze the scores relative to the ranks by using Happiness Score on Rows and Happiness Rank on Columns. Both pills should be continuous, hence green-colored.
In order to accomplish this, the user defines Happiness Rank as a Dimension, as shown in the following screenshot:
Figure 2.2: Ranking score per country
Please note that Columns and Rows have been moved to the left for better readability. This can be achieved by dragging and dropping the shelves.
In order to add steps to your visualization, click on Path in the Marks Card and select the second option, Step.
You can view the code generated by Tableau that is passed to the data source with the performance recorder, which is accessible through Help, then Settings and Performance, and then Start Performance Recording. See Chapter 13, Improving Performance, for additional details.
Studying the SQL generated by VizQL to create the preceding visualization is particularly insightful:
SELECT ['Happiness Report$'].[Happiness.Rank] AS [Happiness.Rank],
AVG(['Happiness Report$'].[Happiness.Score]) AS [avg:Happiness.Score:ok] FROM
[dbo].['Happiness Report$'] ['Happiness Report$'] GROUP BY ['Happiness Report$'].[Happiness.Rank]"
The GROUP BY
clause clearly communicates that Happiness Rank
is treated as a dimension because grouping is only possible on dimensions. The takeaway is to note that VizQL enables the analyst to change the SQL code input by changing a field from measure to dimension rather than the source metadata. This on-the-fly ability enables creative exploration of the data that's not possible with other tools, and avoids lengthy exercises attempting to define all possible uses for each field.
The previous section taught us how we can manipulate data types in Tableau itself without touching the data source and its metadata itself. In the next section, we will take a closer look at table calculations.
Table calculation
In this section, we will explore how VizQL's table calculations can be used to add data to a dashboard without adding any data to the data source.
In the following example, which can be viewed by opening Sheet 4 on this chapter's workbook, note that Freedom on the vertical axis is set to Quick Table Calculation and Moving Average. Calculating a Moving Average, Running Total, or other such comparison calculations can be quite challenging to accomplish in a data source. Not only must a data architect consider what comparison calculations to include in the data source, but they must also determine the dimensions for which these calculations are relevant.
VizQL greatly simplifies such challenges using table calculations, as seen in the following screenshot:
Figure 2.3: Moving average
Taking a look at the relevant portion of SQL generated by the preceding worksheet shows that the table calculation is not performed by the data source. Instead, it is performed in Tableau by the VizQL module.
The following is the SQL query:
SELECT SUM([Happiness Report$].[Freedom]) AS [sum:Freedom:ok], [Happiness Report$].[Happiness.Rank] AS [Happiness.Rank]
FROM [dbo].[Happiness Report$] [Happiness Report$] GROUP BY ['Happiness Report$'].[Happiness.Score]
To reiterate, nothing in the preceding call to the data source generates the moving average. Only an aggregated total is returned, and Tableau calculates the moving average with VizQL.
Hyper takeaways
This overview of the Tableau data-handling engine demonstrates a flexible approach to interfacing with data. Knowledge of the data-handling engine is helpful if you want to understand the parameters for Tableau data readiness. Two major takeaways from this section are as follows:
- It is not necessary to explicitly define data types and roles for optimal Tableau usage.
- Comparison calculations such as moving averages and running totals can be addressed by table calculations in Tableau and thus do not need to be calculated in advance.
The knowledge of these two takeaways can reduce data preparation and data modeling efforts, and thus helps us streamline the overall data mining life cycle. Don't worry too much about data types and data that can be calculated based on the fields you have in your database. Tableau can do all the work for you in this respect. In the next section, we will discuss what you should consider from a data source perspective.