[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Osvaldo Martin titled Mastering Predictive Analytics with R, Second Edition. This book will help you leverage the flexibility and modularity of R to experiment with a range of different techniques and data types.[/box]
Our article will quickly walk you through all the fundamental characteristics of Big Data.
For you to determine if your data source qualifies as big data or as needing special handling, you can start by examining your data source in the following areas:
Let's examine each of these areas.
If you are talking about the number of rows or records, then most likely your data source is not a big data source since big data is typically measured in gigabytes, terabytes, and petabytes. However, space doesn't always mean big, as these size measurements can vary greatly in terms of both volume and functionality. Additionally, data sources of several million records may qualify as big data, given their structure (or lack of structure).
Data used in predictive models may be structured or unstructured (or both) and include transactions from databases, survey results, website logs, application messages, and so on (by using a data source consisting of a higher variety of data, you are usually able to cover a broader context for the analytics you derive from it). Variety, much like volume, is considered a normal qualifier for big data.
If the data source for your predictive analytics project is the result of integrating several sources, you most likely hit on both criteria of volume and variety and your data qualifies as big data. If your project uses data that is affected by governmental mandates, consumer requests is a historical analysis, you are almost certainty using big data. Government regulations usually require that certain types of data need to be stored for several years. Products can be consumer driven over the lifetime of the product and with today's trends, historical analysis data is usually available for more than five years. Again, all examples of big data sources.
You will often find that data sources typically fall into one of the following three categories:
1. Sources with little or no structure in the data (such as simple text files).
2. Sources containing both structured and unstructured data (like data that is sourced from document management systems or various websites, and so on).
3. Sources containing highly structured data (like transactional data stored in a relational database example).
How your data source is categorized will determine how you prepare and work with your data in each phase of your predictive analytics project.
Although data sources with structure can obviously still fall into the category of big data, it's data containing both structured and unstructured data (and of course totally unstructured data) that fit as big data and will require special handling and or pre-processing.
Finally, we should take a note here that other factors (other than those discussed already in the chapter) can qualify your project data source as being unwieldy, overly complex, or a big data source.
These include (but are not limited to):
Once you have determined that the data source that you will be using in your predictive analytics project seems to qualify as big (again as we are using the term here) then you can proceed with the process of deciding how to manage and manipulate that data source, based upon the known challenges this type of data demands, so as to be most effective.
In the next section, we will review some of these common problems, before we go on to offer useable solutions.
We have learned fundamental characteristics which define Big Data, to further use them for Analytics.
If you enjoyed our post, check out the book Mastering Predictive Analytics with R, Second Edition to learn complex machine learning models using R.