Approaches to big data visualization
When it comes to the topic of big data, simple data visualization tools with their basic features become somewhat inadequate. The concepts and models necessary to efficiently and effectively visualize big data can be daunting, but are not unobtainable.
Using workable approaches (studied in the following chapters of this book) the reader will review some of the most popular (or currently trending) tools, such as:
- Hadoop
- R
- Data Manager
- D3
- Tableau
- Python
- Splunk
This is done in an effort to meet the challenges of big data visualization and support better decision making.
It is expected that our reading audience would be data analysts or those having at least basic knowledge of data analysis and visualization and now are interested in learning about the various alternatives for big data visualization in order to make their analysis more useful, more valuable, and hopefully have some fun doing it!
Readers holding some knowledge of big data platform tools (such as Hadoop) and having exposure to programming languages (such as perhaps R or Python) will make the most of the remaining chapters, but all should benefit.
Access, speed, and storage
We've already touched on the 3Vs (plus veracity), which include the challenges of both the storing of the large and ever-growing amounts (volumes) of data as well as being able to rapidly (with velocity) access, manipulate, and manage that data.
Chapter 2, Access, Speed, and Storage with Hadoop, of this book will expound on this topic and introduce Hadoop as the game changing technology to use for this purpose.
Dealing with expanding data sizes may lead to perpetually expanding a machines resources, to cover the expanding size of the data. Typically, this is a short-lived solution.
When dealing with data too large to handle with a single machine's memory (that is, big data) a common approach is to sample the data, meaning that basically you try to construct a smaller dataset from the full dataset that you feel is reasonably representative (of the full dataset). Using Hadoop, you have the ability to run many exploratory data analysis tasks on full datasets, without sampling, with the results efficiently returned to your machine or laptop.
Entering Hadoop
Hadoop removes the restrictions and limitations that hardware levies on the storage of big data by providing the ability to streamline data (from every touch point in any organizational data source, whether the data is structured or unstructured) for your needs across clusters of computers (which means this solution is basically infinitely scalable) using simple programming models.
The Hadoop online product documentation points out:
"Data which was previously too expensive to store, can now be stored and made available for analysis to improve business insights at 1/10 to 1/50 the cost on a per terabyte basis."
Refer to the following link for more information www.mapr.com/why-hadoop/game-changer2016.
We'll cover working examples to demonstrate solutions for effectively storing and accessing big data, but the reader should take note that Hadoop also works well with smaller amounts of data (as well as the infinity large amounts) so you can be sure that any example used in this book will not have to be reworked based upon the actual size (or actual volume) of data you may be using in your future analysis projects.
In an effort to paint a complete picture here (and we'll do this throughout all of the chapters), we will also take some time and consider the how and why of non-Hadoop (or alternate) solutions to the examples given--and considering how well they may compare to a Hadoop solution.
Context
When it comes to performing data analytics, facts can be stupid and stubborn things. They can provide us with the business intelligence metrics we long for, but without predictive analytics based on contextual interpretation, we may find ourselves using skewed quantitative analysis that produces less-than-desirable results.
Note
The appropriate use of context in analytics makes all the difference toward achieving optimal results, a Business@American staff article, which is available at https://onlinebusiness.american.edu/how-do-we-use-data-for-good-add-context/.
In Chapter 3, Context - Understanding Your Data Using R, of this book, the importance of gaining an understanding of the data you are working with and specifically, the challenges of establishing or adding context to big data will be covered with working examples demonstrating solutions for effectively addressing the issues that are presented.
Adding context to data requires manipulation of that data to review and perhaps reformat, adding calculations, aggregations, or additional columns or re-ordering, and so on.
In Chapter 3, Context - Understanding Your Data Using R, we will introduce the R programming tool as the choice for performing this type of processing and manipulating your data.
R is a language and environment very focused on statistical computing.
R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and so on) and graphical techniques, and it is highly extensible. You can refer to more information on this at www.r-project.org/about.html.
Beyond the perhaps more sophisticated modeling techniques such as performing a time-series analysis, R also supports the need for performing simple tasks such as creating a summary table, which can be used to determine data groupings.
One thing to keep in mind is that R preserves everything in machine memory.
This can become a problem if you are working with big data (even with the introduction of the low resource costs of today).
With R, sampling is a popular method for dealing with big data. In Chapter 3, Context - Understanding Your Data Using R, our focus is on gaining context of data, so sampling is acceptable.
R is great for manipulating and cleaning data, producing probability statistics, as well as actually creating visualizations with data, so it's a good choice for establishing a context for your data.
Quality
It has been said that beauty is in the eyes of the beholder, and the same can be said when trying to define data quality. What this means is if the data meets your level of expectations or, at least the minimal of requirements of a particular project, then it has some form or level of quality.
Data can have acceptable quality even if there are known complications with it. These complications can be overcome with processes we'll discuss later or, if appropriate, simply overlooked.
Even though your data may contain acceptable complications, the reader should be sure to make no mistake such that any data visualization created based upon this data will only prove to be a valuable tool if the quality of that data is assured to be at the level required. However, when using large volumes of data, it can become extremely difficult to address the quality of the data.
There are many examples of the effects of poor data quality, such as the following, which was written in an article by Sean Jackson (http://www.actian.com/about-us/blog/never-underestimate-importance-good-data-quality/):
"A business professional could not understand why response rates to campaigns and activities were so low. Nor why they couldn't really use analytics to get any competitive advantage. A quick investigation of their data and systems soon showed that a large section of the data they were using was either out-of-date, badly formatted, or just erroneous."
Data quality solutions must enable you to clean, manage, and make reliable data available across your organization.
Chapter 4, Addressing Big Data Quality, of this book offers working examples demonstrating solutions for effectively assessing and improving the level of quality of big data sources.
Typically, the first step in determining the quality of your data is performing a process referred to as profiling the data (mentioned earlier in this chapter). This is sort of an overall auditing process that helps you examine and determine whether your existing data sources meet the quality expectations or perhaps standards of your intended use or purpose.
Profiling is vitally important in that it can help you identify concerns that may exist within the data that attending to up front (before going on and actually creating a data visualization) will save valuable time (rather than having to process and reprocess the poor qualities of the data later). In fact, more importantly, it can save you from creating and presenting a visualization that contains an inaccurate view of the data.
Data profiling becomes even more critical when working with perhaps unstructured raw data sources (or data that is a mix of structured and unstructured data) that do not have referential integrity or any other quality controls. In addition, single source (data sourced from only a single place) and multisource data (a dataset that is sourced from more than one place) will most likely have additional opportunities for data concerns.
Concerns found in single sources are typically intensified when multiple sources need to be integrated into one dataset for a project. Each source may contain data concerns, but in addition, the same data in different data sources may be represented differently, overlap, or contradict.
Typical profiling tasks include the following:
- Identifying fields/columns within the data
- Listing field/column attributes and statistics such as column lengths and value distribution percentages
- Reviewing field/column value distributions
- Null ratios
- Reporting of value statistics such as minimum, maximum, average, and standard deviation for numeric columns, and minimum and maximum for date and time columns
- Identifying all the distinct values in the data
- Identifying patterns and pattern distributions within the data
The goal of these tasks (and others) is to (as the name implies) establish your data's profile by determining its characteristics, relationships, and patterns within the data and, hopefully, produce a clearer view of the content and quality of your data, that is, the data profile.
After profiling, one would most likely proceed with performing some form of scrubbing (also sometimes referred to as cleansing or in some cases preparing) of the data (to improve its quality, also mentioned earlier in this chapter).
The processes of cleansing data may be somewhat or even entirely different, depending upon the data's intended use. Because of this, the task of defining what is to be determined an error is the critical first step to be performed before any processing of the data. Even what is done to resolve the defined errors may differ, again based upon the data's intended use.
During the process of cleansing or scrubbing your data, you would perform tasks such as perhaps reformatting fields or adding missing values, and so on.
Generally, scrubbing is made up of the following efforts:
- Defining and determining errors within the data--what do you consider an error?
- Searching and identifying error instances--once an error is defined, where do they exist in your data?
- Correction of the errors--remove them or update them to acceptable values.
- Error instance and document error types--or labeling (how was the error determined and what was done to resolve it).
- Updating the entry mechanism to avoid future errors--create a process to make sure future occurrences of this type are dealt with.
In Chapter 4, Addressing Big Data Quality, we've elected to continue (from the previous chapter) to leverage the R programming language to accomplish some of the profiling work and also introduce and use the open source data manager utility for manipulating our data and addressing the quality.
Data manager is an excellent utility available as a library of Java code that is aimed at data synchronization work for moving data between different locations and different databases.
Displaying results
Data visualization is when you manually or otherwise organize and display data in a pictorial or graphic format in an attempt to enable your audience to:
- See the results of your analysis efforts more clearly
- Simplify the complexities within the data you are using
- Understand and grasp a point that you are using the data to make
Not a new concept
This concept of using pictures--typography, color, contrast, and shape--to communicate or understand data is not new and has been around for literally centuries, from the manual creation of maps and graphs in the 17th century to the invention of the pie chart in the early 1800s.
Today, computers can be used to process large amounts of data lightning fast to make visualizations tremendously more valuable. Going forward, we can expect the data visualization process to continue to evolve, perhaps as more of a mixture of art and science rather than a numbers crunching technology.
Instant gratifications
An exciting example of the data visualization evolutionary process is how the industry has moved data visualizations past the process of generating and publishing charts and graphs for an audience to review and deliberate on to now having set up an expectation for interactive visualizations.
With interactive visualization, we can take the concept of data visualization much, much further by using technology to allow the audience to interact with the data; giving the user the self-service ability to drill down into the generated pictures, charts, and graphs (to access more or specific details), interactively in real time (or near real time) to change what data is displayed (perhaps a different time frame or event) and how it's processed and/or presented (maybe select a bar graph rather than a pie chart).
This allows visualizations to be much more effective and personalized.
In Chapter 5, Displaying Results with D3, we will go through the topic of displaying the results of analysis on big data using a typical web browser using Data Driven Documents (D3) in a variety of examples. D3 allows the ability to apply prebuilt data visualizations to datasets.
Data-driven documents
Data Driven Documents is referred within the open community as D3.
D3 is an open source library written in JavaScript. The objective is to allow for easily manipulating documents based upon data using standard web browsing technologies (such as HTML or CSS). Its value-add is to provide you with full capabilities without having to build your own or strapping yourself to some proprietary framework.
These library components give you excellent tools for big data visualization and a data-driven approach to DOM manipulation. D3's functional style allows the reuse of library code modules that you've already built (or others have already built) adding pretty much any particular features you need or want (or don't want) to. This creates a means that can become as powerful as you want it (or have the time to make it) to be, to give a unique style to your data visualizations, manipulate and make it all interactive--exactly how you want or need it to be.
Dashboards
As discussed earlier in this chapter, big data is collecting and accumulating daily, in fact; minute-by-minute and there is a realization that organizations rely on this information for a variety of reasons.
Various types of reporting formats are utilized on this data, including data dashboards.
As with everything, there are various apprehensions as to the most accurate definition of what a data dashboard is.
For example, A. Chiang writes:
"A dashboard is a visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance."
Refer to the following link for more information: http://www.dashboardinsight.com/articles/digital-dashboards/fundamentals/what-is-a-dashboard.aspx.
Whatever the definition, any dashboard has the capacity for supplying timely, important information for its audience to use in decision making, if it is well designed and constructed.
It is critical that dashboards present data in a relevant, concise, and well-thought-out manner (not just a collection of visual representations in a workbook or spreadsheet) and in addition, dashboards have to have a supporting infrastructure capable of refreshing the dashboard in a well-timed manner as well as including some form of DQA. Making decisions based upon a dashboard with incorrectly presented, stale, or even incorrect data can lead to disaster.
Chapter 6, Dashboard for Big Data - Tableau, of this book offers examination of the topic of effective dashboarding and includes working examples demonstrating solutions for effectively presenting results based upon your big data analysis in a real-time dashboard format using Tableau.
Tableau is categorized as business intelligence software designed to help people see and understand data; more than just a code library, Tableau is considered to be a suite or a family of interactive data visualization products.
Tableau's structure allows us the ability to combine multiple views of data from multiple sources into a single, highly effective dashboard that can provide the data consumers with much richer insights. Tableau also works with a variety of formats of (both structured and unstructured) data and can handle the volumes of big data, literally, petabytes or terabytes, millions or billions of rows, turning that big data into valuable visualizations for targeted audiences.
To address the velocity of today's big data world, you can use Tableau to connect directly to local and cloud data sources, or just import your data for fast in-memory (more on in-memory later in this book) performance.
Another goal of Tableau is self-service analytics (which we mentioned earlier in this chapter and will talk more about later on), where a user can have a dialog with selected data to ask questions (in real time, not in a batch mode) using easy point-and-click analytics to mine big data intuitively and effectively discovering understandings and opportunities that may exist within the dataset or datasets.
Some of the more exciting abilities Tableau offers include:
- Real-time drag-and-drop cluster analysis
- Cross data source joining
- Powerful data connectors
- Mobile enabled
- Real-time territory or region data exploration
Outliers
In Chapter 7, Dealing with Outliers Using Python, we will dive into Outliers.
As was defined earlier in this chapter, an outlier is an observation point that is distant or vastly different from the other observed data points within the data.
Although outliers typically represent (only) about 1 to 5 percent of your data, when you're working with big data, investigating, or even just viewing, 1 to 5 percent of that data is rather difficult.
Investigation and adjudication
Outliers, you see, can be determined to be noninfluential or very influential to the point you are trying to make with your data visualization.
The act or process of making this determination is critically important to your analysis, but it is also very problematic when dealing with the larger volumes, many varieties, and velocities of big data. For example, a fundamental step to help make this determination is called the sizing of your samples, which is the main mathematical process of calculating the percentage of outliers to the size of the data sample, which is not so simple a task when the data is in petabytes or terabytes!
Identifying and removing outliers can be tremendously complicated and there are many differences in opinions as to how to go about determining the percentage of outliers that exist in your dataset as well as determining their effect on the data and deciding what to do with them. It is, however, generally accepted that an automated process can be created that can facilitate at least the identification of outliers, possibly even through the use of visualization.
Carrying on, all the approaches for the investigation and adjudication of outliers such as sorting, capping, graphing, and so on require manipulating and processing of the data using a tool that is feature--rich and robust.
This chapter offers working examples demonstrating solutions for effectively and efficiently identifying and dealing with big data outliers (as well as some other dataset anomalies) using Python.
Python is a scripting language that is extremely easy to learn and incredibly readable, since its coding syntax so closely resembles the English language.
According to the article, The 9 most in-demand programming languages of 2016, by Bouwkamp, available at http://www.codingdojo.com/blog/9-most-in-demand-programming-languages-of-2016, Python is listed in the top most in-demand programming languages (at the time of writing).
Born as far back as 1989 and created by Guido van Rossum, Python is actually very simple in nature, but it is also considered by the industry to be extremely powerful, fast, and it can be run in almost any environment.
As per www.python.org:
"Open sourced (and free!), Python is part of the winning formula for productivity, software quality, and maintainability at many companies and institutions around the world."
There is a growing interest within the industry to utilize the Python language for data analysis and even for big data analysis and it is the exceptional choice for the data scientist to perform typical day to day activities as it provides libraries, in fact a standard library (even some focusing specifically on big data, such as Pydoop and SciPy) to accomplish almost anything you need or want to do with the data you have or are accumulating, including:
- Automations
- Building websites and web pages
- Accessing and manipulating data
- Calculating statistics
- Creating visualizations
- Reporting
- Building predictive and explanatory models
- Evaluating models on additional data
- Integrating models into production systems
As a final note here, Python's standard library is very extensive, offering a wide range of built-in modules that provide access to system functionalities, as well as standardized solutions to solve many problems that occur in everyday programming making this an obvious choice to explore for dealing with big data outliers and related processing.
Operational intelligence
In Chapter 8, Big Data Operational Intelligence with Splunk, of this book, we concentrate on big data Operational Intelligence.
Operational intelligence (OI) is a type of analytics that attempts to deliver visibility and insight from (usually machine generated) operational or event data, running queries against streaming data feeds in real time, producing analytic results as operational instructions, which can be immediately acted upon by an organization, through manual or automated actions (a clear example of turning datasets into value!).
Sophisticated OI systems also provide the ability to associate metadata with certain metrics, process steps, channels, and so on, found within data. With this ability, it becomes easy to acquire additional related information, for example, machine-generated operational data is typically full of unique identifiers and result or status codes. These codes or identifiers may be efficient for processing and storage, but are not always easily interpreted by human beings. To make this data more readable (and therefore more valuable) we can associate additional information that is more user friendly with the data results--possibly in the form of a status or event description or perhaps a product name or machine name.
Once there is an understanding of the challenges of applying basic analytics and visualization techniques to operational big data, the value of that data can be better or more quickly realized. In this chapter, we offer working examples demonstrating solutions for the valuing of operational or event big data with operational intelligence using Splunk.
So, what is Splunk? H. Klein says:
"Splunk started out as a kind of "Google for Log files". It does a lot more... It stores all your logs and provides very fast search capabilities roughly in the same way Google does for the internet..." -- https://helgeklein.com/blog/2014/09/splunk-work/
Splunk software is a great tool to help unlock hidden value in machine generated, operational data (as well as other types of data). With Splunk, you can collect, index, search, analyze, and visualize all your data in one place, providing an integrated method to organize and extract real-time insights from massive amounts of (big data) machine data from virtually anywhere.
Splunk stores data in flat files, assigning indexes to the files. Splunk doesn't require any database software running in the background to make this happen. Splunk calls these files indexers. Splunk can index any type of time-series data (data with timestamps), making it an optimal choice for big data OI solutions. During data indexing, Splunk breaks data into events based on the timestamps it identifies.
Although using simple search terms will work, (for example, a machine ID) Splunk also offers its own Search Processing Language (SPL). Splunk SPL (think of it as kind of like SQL) is an extremely powerful tool for searching enormous amounts of big data and performing statistical operations on what is relevant within a specific context.
There are multiple versions of Splunk, including a free version that is pretty much fully functional.