Data exploration, visualization, and discovery
Data visualization is not something that is done at the end of a long, costly Business Intelligence (BI) project. It is not the cute dashboard that we create to justify the investment in a new data warehouse and several Online Analytical Processing (OLAP) cubes. Data visualization is an integral part of a data exploration process that begins on the first day that we start extracting raw data.
The importance and effectiveness of using data visualization when we are exploring data is highlighted using Anscombe's quartet. Each of the following scatterplots analyzes the correlation between two variables. Correlation can also be explained numerically by means of R-squared. If we were to summarize the correlations of each of the following scatterplots using R-squared, we would discover that the number is be the same for each scatterplot, .816. It is only by visualizing the data in a two-dimensional space do we notice how different each correlation behaves:
Some tools make it cumbersome to visualize data as soon as it is extracted. Most traditional BI solutions have separate tools for each phase of their implementation process. They have one tool that extracts data, another that creates the OLAP cubes, and yet another that constructs visualizations.
QlikView is a tool that allows us to extract, transform, model, and visualize data within the same tool. Since we can visualize data from the moment it is extracted and throughout the rest of the extraction, transformation, and load (ETL) process, we are more likely to discover data anomalies at an earlier stage in the development process. We can also share our discoveries more quickly with business users, and they in turn can give us important feedback before we invest too much time developing analytical applications that don't provide them with real value. Although QlikView is considered a BI software, it stands out amongst its peers due to its extraordinary ability to explore, visualize, and discover data.
In contrast, the implementation of a traditional BI tool first focuses on organizing data into data warehouses and cubes that are based on business requirements created at the beginning of the project. Once we organize the data and distribute the first reports defined by the business requirements, we start, for the first time, to explore the data using data visualization. However, the first time business users see their new reports, the most important discovery that they make is that we've spent a great amount of time and resources developing something that doesn't fulfill their real requirements.
We can blame the business user or the business requirements process for this failure, but nobody can exactly know what they need if they have nothing tangible to start from. In a data discovery tool like QlikView, we can easily create prototypes, or what we later explain as Minimally Viable Products (MVPs), to allow business users to visualize the data within a matter of days. They use the MVP to better describe their needs, discover data inadequacies, and among other things, confirm the business value of the analysis with their executive sponsors. Only after making and sharing these first discoveries do we invest more of our resources into organizing an iteratively more mature data analysis and visualization.
Note
Data Visualization Strategy 1: Use data visualization as an integral part of data exploration and discovery from the very beginning, and all throughout our project.
We've established a general data visualization strategy to support our data exploration and discovery. Now, let's review the strategies that we assign to the teams who are tasked with not only exploring the data directly, but also making sure everyone else in the business can perform their own data exploration.
Tip
I often come across customers who have data quality issues. They often battle with whether to hold off investing in QlikView until they've cleaned the data or invest in QlikView regardless of the poor data quality. Those who implement QlikView over poor-quality data data quality and make the problem transparent tend to clean their data more quickly and more effectively.