Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Practical Big Data Analytics

You're reading from   Practical Big Data Analytics Hands-on techniques to implement enterprise analytics and machine learning using Hadoop, Spark, NoSQL and R

Arrow left icon
Product type Paperback
Published in Jan 2018
Publisher Packt
ISBN-13 9781783554393
Length 412 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Nataraj Dasgupta Nataraj Dasgupta
Author Profile Icon Nataraj Dasgupta
Nataraj Dasgupta
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Too Big or Not Too Big FREE CHAPTER 2. Big Data Mining for the Masses 3. The Analytics Toolkit 4. Big Data With Hadoop 5. Big Data Mining with NoSQL 6. Spark for Big Data Analytics 7. An Introduction to Machine Learning Concepts 8. Machine Learning Deep Dive 9. Enterprise Data Science 10. Closing Thoughts on Big Data 11. External Data Science Resources 12. Other Books You May Enjoy

Sources of big data

Technology today allows us to collect data at an astounding rate--both in terms of volume and variety. There are various sources that generate data, but in the context of big data, the primary sources are as follows:

  • Social networks: Arguably, the primary source of all big data that we know of today is the social networks that have proliferated over the past 5-10 years. This is by and large unstructured data that is represented by millions of social media postings and other data that is generated on a second-by-second basis through user interactions on the web across the world. Increase in access to the internet across the world has been a self-fulfilling act for the growth of data in social networks.
  • Media: Largely a result of the growth of social networks, media represents the millions, if not billions, of audio and visual uploads that take place on a daily basis. Videos uploaded on YouTube, music recordings on SoundCloud, and pictures posted on Instagram are prime examples of media, whose volume continues to grow in an unrestrained manner.
  • Data warehouses: Companies have long invested in specialized data storage facilities commonly known as data warehouses. A DW is essentially collections of historical data that companies wish to maintain and catalog for easy retrieval, whether for internal use or regulatory purposes. As industries gradually shift toward the practice of storing data in platforms such as Hadoop and NoSQL, more and more companies are moving data from their pre-existing data warehouses to some of the newer technologies. Company emails, accounting records, databases, and internal documents are some examples of DW data that is now being offloaded onto Hadoop or Hadoop-like platforms that leverage multiple nodes to provide a highly-available and fault-tolerant platform.
  • Sensors: A more recent phenomenon in the space of big data has been the collection of data from sensor devices. While sensors have always existed and industries such as oil and gas have been using drilling sensors for measurements at oil rigs for many decades, the advent of wearable devices, also known as the Internet Of Things such as Fitbit and Apple Watch, meant that now each individual could stream data at the same rate at which a few oil rigs used to do just 10 years back.

Wearable devices can collect hundreds of measurements from an individual at any given point in time. While not yet a big data problem as such, as the industry keeps evolving, sensor-related data is likely to become more akin to the kind of spontaneous data that is generated on the web through social network activities.

The 4Vs of big data

The topic of the 4Vs has become overused in the context of big data, where it has started to lose some of the initial charm. Nevertheless, it helps to bear in mind what these Vs indicate for the sake of being aware of the background context to carry on a conversation.

Broadly, the 4Vs indicate the following:

  • Volume: The amount of data that is being generated
  • Variety: The different types of data, such as textual, media, and sensor or streaming data
  • Velocity: The speed at which data is being generated, such as millions of messages being exchanged at any given time across social networks
  • Veracity: This has been a more recent addition to the 3Vs and indicates the noise inherent in data, such as inconsistencies in recorded information that requires additional validation
You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018
Publisher: Packt
ISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image