Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Practical Big Data Analytics

You're reading from   Practical Big Data Analytics Hands-on techniques to implement enterprise analytics and machine learning using Hadoop, Spark, NoSQL and R

Arrow left icon
Product type Paperback
Published in Jan 2018
Publisher Packt
ISBN-13 9781783554393
Length 412 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Nataraj Dasgupta Nataraj Dasgupta
Author Profile Icon Nataraj Dasgupta
Nataraj Dasgupta
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Too Big or Not Too Big FREE CHAPTER 2. Big Data Mining for the Masses 3. The Analytics Toolkit 4. Big Data With Hadoop 5. Big Data Mining with NoSQL 6. Spark for Big Data Analytics 7. An Introduction to Machine Learning Concepts 8. Machine Learning Deep Dive 9. Enterprise Data Science 10. Closing Thoughts on Big Data 11. External Data Science Resources 12. Other Books You May Enjoy

Why we are talking about big data now if data has always existed

By the early 2000’s, rapid advances in computing and technologies, such as storage, allowed users to collect and store data with unprecedented levels of efficiency. The internet further added impetus to this drive by providing a platform that had an unlimited capacity to exchange information at a global scale. Technology advanced at a breathtaking pace and led to major paradigm shifts powered by tools such as social media, connected devices such as smart phones, and the availability of broadband connections, and by extension, user participation, even in remote parts of the world.

By and large, the majority of this data consists of information generated by web-based sources, such as social networks like Facebook and video sharing sites like YouTube. In big data parlance, this is also known as unstructured data; namely, data that is not in a fixed format such as a spreadsheet or the kind that can be easily stored in a traditional database system.

The simultaneous advances in computing capabilities meant that although the rate of data being generated was very high, it was still computationally feasible to analyze it. Algorithms in machine learning, which were once considered intractable due to both the volume as well as algorithmic complexity, could now be analyzed using various new paradigms such as cluster or multinode processing in a much simpler manner that would have earlier necessitated special-purpose machines.
Chart of data generated per minute. Credit: DOMO Inc.

Definition of big data

Collectively, the volume of data being generated has come to be termed big data and analytics that include a wide range of faculties from basic data mining to advanced machine learning is known as big data analytics. There isn't, as such, an exact definition due to the relative nature of quantifying what can be large enough to meet the criterion to classify any specific use case as big data analytics. Rather, in a generic sense, performing analysis on large-scale datasets, in the order of tens or hundreds of gigabytes to petabytes, can be termed big data analytics. This can be as simple as finding the number of rows in a large dataset to applying a machine learning algorithm on it.

Building blocks of big data analytics

At a fundamental level, big data systems can be considered to have four major layers, each of which are indispensable. There are many such layers that are outlined in various textbooks and literature and, as such, it can be ambiguous. Nevertheless, at a high level, the layers defined here are both intuitive and simplistic:

Big Data Analytics Layers

The levels are broken down as follows:

  • Hardware: Servers that provide the computing backbone, storage devices that store the data, and network connectivity across different server components are some of the elements that define the hardware stack. In essence, the systems that provide the computational and storage capabilities and systems that support the interoperability of these devices form the foundational layer of the building blocks.
  • Software: Software resources that facilitate analytics on the datasets hosted in the hardware layer, such as Hadoop and NoSQL systems, represent the next level in the big data stack. Analytics software can be classified into various subdivisions. Two of the primary high-level classifications for analytics software are tools that facilitate are:
    • Data mining: Software that provides facilities for aggregations, joins across datasets, and pivot tables on large datasets fall into this category. Standard NoSQL platforms such as Cassandra, Redis, and others are high-level, data mining tools for big data analytics.
    • Statistical analytics: Platforms that provide analytics capabilities beyond simple data mining, such as running algorithms that can range from simple regressions to advanced neural networks such as Google TensorFlow or R, fall into this category.
  • Data management: Data encryption, governance, access, compliance, and other features salient to any enterprise and production environment to manage and, in some ways, reduce operational complexity form the next basic layer. Although they are less tangible than hardware or software, data management tools provide a defined framework, using which organizations can fulfill their obligations such as security and compliance.
  • End user: The end user of the analytics software forms the final aspect of a big data analytics engagement. A data platform, after all, is only as good as the extent to which it can be leveraged efficiently and addresses business-specific use cases. This is where the role of the practitioner who makes use of the analytics platform to derive value comes into play. The term data scientist is often used to denote individuals who implement the underlying big data analytics capabilities while business users reap the benefits of faster access and analytics capabilities not available in traditional systems.
You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018
Publisher: Packt
ISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime