Packt+ | Advance your knowledge in tech

You're reading from Practical Real-time Data Processing and Analytics Distributed Computing and Event Processing using Apache Spark, Flink, Storm, and Kafka

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781787281202

Length 360 pages

Edition 1st Edition

Languages

Processing

Tools

Apache Spark

Concepts

Data Analysis

Authors (2):

Shilpi Saxena

Saurabh Gupta

View More author details

Well to begin with, in simple terms, big data helps us deal with three V's – volume, velocity, and variety. Recently, two more V's were added to it, making it a five–dimensional paradigm; they are veracity and value

Volume: This dimension refers to the amount of data; look around you, huge amounts of data are being generated every second – it may be the email you send, Twitter, Facebook, or other social media, or it can just be all the videos, pictures, SMS messages, call records, and data from varied devices and sensors. We have scaled up the data–measuring metrics to terabytes, zettabytes and Yottabytes – they are all humongous figures. Look at Facebook alone; it's like ~10 billion messages on a day, consolidated across all users. We have ~5 billion likes a day and around ~400 million photographs are uploaded each day. Data statistics in terms of volume are startling; all of the data generated from the beginning of time to 2008 is kind of equivalent to what we generate in a day today, and I am sure soon it will be an hour. This volume aspect alone is making the traditional database dwarf to store and process this amount of data in reasonable and useful time frames, though a big data stack can be employed to store process and compute on amazingly large data sets in a cost–effective, distributed, and reliably efficient manner.
Velocity: This refers to the data generation speed, or the rate at which data is being generated. In today's world, where we mentioned that the volume of data has undergone a tremendous surge, this aspect is not lagging behind. We have loads of data because we are able to generate it so fast. Look at social media; things are circulated in seconds and they become viral, and the insight from social media is analysed in milliseconds by stock traders, and that can trigger lots of activity in terms of buying or selling. At a target point of sale counter it takes a few seconds for a credit card swipe, and within that fraudulent transaction processing, payment, bookkeeping, and acknowledgement is all done. Big data gives us the power to analyse the data at tremendous speed.
Variety: This dimension tackles the fact that the data can be unstructured. In the traditional database world, and even before that, we were used to having a very structured form of data that fitted neatly into tables. Today, more than 80% of data is unstructured – quotable examples are photos, video clips, social media updates, data from variety of sensors, voice recordings, and chat conversations. Big data lets you store and process this unstructured data in a very structured manner; in fact, it effaces the variety.
Veracity: It's all about validity and correctness of data. How accurate and usable is the data? Not everything out of millions and zillions of data records is corrected, accurate, and referable. That's what actual veracity is: how trustworthy the data is and what the quality of the data is. Examples of data with veracity include Facebook and Twitter posts with nonstandard acronyms or typos. Big data has brought the ability to run analytics on this kind of data to the table. One of the strong reasons for the volume of data is veracity.
Value: This is what the name suggests: the value that the data actually holds. It is unarguably the most important V or dimension of big data. The only motivation for going towards big data for processing super large data sets is to derive some valuable insight from it. In the end, it's all about cost and benefits.

Big data is a much talked about technology across businesses and the technical world today. There are myriad domains and industries that are convinced of its usefulness, but the implementation focus is primarily application-oriented, rather than infrastructure-oriented. The next section predominantly walks you through the same.