You're reading from Big Data Analytics Real time analytics using Apache Spark and Hadoop

Product type Paperback

Published in Sep 2016

Publisher Packt

ISBN-13 9781785884696

Length 326 pages

Edition 1st Edition

Tools

Hadoop

Concepts

Big Data

Author (1):

Venkat Ankam

View More author details

Table of Contents (12) Chapters

Preface

1. Big Data Analytics at a 10,000-Foot View

2. Getting Started with Apache Hadoop and Apache Spark FREE CHAPTER

3. Deep Dive into Apache Spark

4. Big Data Analytics with Spark SQL, DataFrames, and Datasets

5. Real-Time Analytics with Spark Streaming and Structured Streaming

6. Notebooks and Dataflows with Spark and Hadoop

7. Machine Learning with Spark and Hadoop

8. Building Recommendation Systems with Spark and Mahout

9. Graph Analytics with GraphX

10. Interactive Analytics with SparkR

Index

Chapter 1. Big Data Analytics at a 10,000-Foot View

The goal of this book is to familiarize you with tools and techniques using Apache Spark, with a focus on Hadoop deployments and tools used on the Hadoop platform. Most production implementations of Spark use Hadoop clusters and users are experiencing many integration challenges with a wide variety of tools used with Spark and Hadoop. This book will address the integration challenges faced with Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator (YARN) and explain the various tools used with Spark and Hadoop. This will also discuss all the Spark components—Spark Core, Spark SQL, DataFrames, Datasets, Spark Streaming, Structured Streaming, MLlib, GraphX, and SparkR and integration with analytics components such as Jupyter, Zeppelin, Hive, HBase, and dataflow tools such as NiFi. A real-time example of a recommendation system using MLlib will help us understand data science techniques.

In this chapter, we will approach Big Data analytics from a broad perspective and try to understand what tools and techniques are used on the Apache Hadoop and Apache Spark platforms.

Big Data analytics is the process of analyzing Big Data to provide past, current, and future statistics and useful insights that can be used to make better business decisions.

Big Data analytics is broadly classified into two major categories, data analytics and data science, which are interconnected disciplines. This chapter will explain the differences between data analytics and data science. Current industry definitions for data analytics and data science vary according to their use cases, but let's try to understand what they accomplish.

Data analytics focuses on the collection and interpretation of data, typically with a focus on past and present statistics. Data science, on the other hand, focuses on the future by performing explorative analytics to provide recommendations based on models identified by past and present data.

Figure 1.1 explains the difference between data analytics and data science with respect to time and value achieved. It also shows typical questions asked and tools and techniques used. Data analytics has mainly two types of analytics, descriptive analytics and diagnostic analytics. Data science has two types of analytics, predictive analytics and prescriptive analytics. The following diagram explains data science and data analytics:

Big Data Analytics at a 10,000-Foot View

Figure 1.1: Data analytics versus data science

The following table explains the differences with respect to processes, tools, techniques, skill sets, and outputs:

	Data analytics	Data science
Perspective	Looking backward	Looking forward
Nature of work	Report and optimize	Explore, discover, investigate, and visualize
Output	Reports and dashboards	Data product
Typical tools used	Hive, Impala, Spark SQL, and HBase	MLlib and Mahout
Typical techniques used	ETL and exploratory analytics	Predictive analytics and sentiment analytics
Typical skill set necessary	Data engineering, SQL, and programming	Statistics, machine learning, and programming

This chapter will cover the following topics:

Big Data analytics and the role of Hadoop and Spark
Big Data science and the role of Hadoop and Spark
Tools and techniques
Real-life use cases

The rest of the chapter is locked

You're reading from Big Data Analytics Real time analytics using Apache Spark and Hadoop

Table of Contents (12) Chapters

Chapter 1. Big Data Analytics at a 10,000-Foot View

Authors (1)

Personalised recommendations for you

You're reading from Big Data Analytics Real time analytics using Apache Spark and Hadoop

Table of Contents (12) Chapters

Chapter 1. Big Data Analytics at a 10,000-Foot View

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you