Big Data Analytics: Real time analytics using Apache Spark and Hadoop

Venkat Ankam

Free Trial

4.7 (7 Ratings)

Paperback Sep 2016 326 pages 1st Edition

Venkat Ankam

Free Trial

4.7 (7 Ratings)

Paperback Sep 2016 326 pages 1st Edition

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

View table of contents

Preview Book

Big Data Analytics

Chapter 2. Getting Started with Apache Hadoop and Apache Spark

In this chapter, we will understand the basics of Hadoop and Spark, how Spark is different from MapReduce, and get started with the installation of clusters and setting up the tools needed for analytics.

This chapter is divided into the following subtopics:

Introducing Apache Hadoop
Introducing Apache Spark
Discussing why we use Hadoop with Spark
Installing Hadoop and Spark clusters

Why Hadoop plus Spark?

Apache Spark shines better when it is combined with Hadoop. To understand this, let's take a look at Hadoop and Spark features.

Hadoop features

Feature

Details

Unlimited scalability

Stores unlimited data by scaling out HDFS

Effectively manages cluster resources with YARN

Runs multiple applications along with Spark

Thousands of simultaneous users

Enterprise grade

Provides security with Kerberos authentication and ACLs authorization

Data encryption

High reliability and integrity

Multi-tenancy

Wide range of applications

Files: Structured, semi-structured, and unstructured

Streaming sources: Flume and Kafka

Databases: Any RDBMS and NoSQL database

Spark features

Feature

Details

Easy development

No boilerplate coding

Multiple native APIs such as Java, Scala, Python, and R

REPL for Scala, Python, and R

Optimized performance

Caching

Optimized shuffle

Catalyst Optimizer

Unification

Batch, SQL, machine learning, streaming, and graph processing...

Download Code

Key benefits

This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools.
Learn all Spark stack components including latest topics such as DataFrames, DataSets, GraphFrames, Structured Streaming, DataFrame based ML Pipelines and SparkR.
Integrations with frameworks such as HDFS, YARN and tools such as Jupyter, Zeppelin, NiFi, Mahout, HBase Spark Connector, GraphFrames, H2O and Hivemall.

Description

Big Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components – Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components – HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters. It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark. Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data.

Who is this book for?

Though this book is primarily aimed at data analysts and data scientists, it will also help architects, programmers, and practitioners. Knowledge of either Spark or Hadoop would be beneficial. It is assumed that you have basic programming background in Scala, Python, SQL, or R programming with basic Linux experience. Working experience within big data environments is not mandatory.

What you will learn

Find out and implement the tools and techniques of big data analytics using Spark on Hadoop clusters with wide variety of tools used with Spark and Hadoop
Understand all the Hadoop and Spark ecosystem components
Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Conventional and Structured Streaming, MLLib, ML Pipelines and Graphx
See batch and real-time data analytics using Spark Core, Spark SQL, and Conventional and Structured Streaming
Get to grips with data science and machine learning using MLLib, ML Pipelines, H2O, Hivemall, Graphx, SparkR and Hivemall.

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Frequently bought together

Big Data Analytics

Sep 2016 326 pages

4.7 (7)

eBook

₱579.99 ~~₱2245.99~~

Hadoop Blueprints

Sep 2016 316 pages

5 (1)

eBook

₱579.99 ~~₱2000.99~~

Real-Time Big Data Analytics

Feb 2016 326 pages

4.5 (2)

eBook

₱579.99 ~~₱2000.99~~

Total ₱ 7,808.97

₱2806.99

₱2500.99

Total ₱ 7,808.97

Filter reviews by

All

Amazon verified reviews

Ravi Oct 11, 2016

The big data analytics is very helpful book for anyone familiar with hadoop technologies and also for beginners learning spark ecosystem.

Amazon Verified review

Subbaraju Cherukuri Oct 11, 2016

Big data had remained an enigma to many. This book by Venkat Ankam, a highly experienced & well respected Big data trainer and Architect deals with this very premise and bares it to turn it upside down. He uses simple, regularly used instructional language constructs to unravel the most popular big data technologies of Hadoop & Spark. The reader gets immersed into it as into a popular work of fiction. So engrossing is his style of presentation that one generally does not care for a few grammatical lacunae. It is so comprehensive that if you did not find a concept in it, you may treat that it is still in incubation. The author deals with both data analytics and data science in intricate detail and with mesmerising alacrity. The range of case studies discussed cover all one might ever come across."The bible for every Hadoop and Spark engineer."

Amazon Customer Oct 11, 2016

"Great book I read after a long time, Nice examples are provided, content provided in easily understandable format. I am very glad that I read this book.Thanks. Amruth puppala

Vin Oct 11, 2016

This is excellent book forbig data analytics concepts and It covered all in detail.

venkat Oct 10, 2016

The book has covered all aspects of the big data concepts that are required for anyone to compete in the big data market.

Big Data Analytics: Real time analytics using Apache Spark and Hadoop

What do you get with a Packt Subscription?

Big Data Analytics

Chapter 2. Getting Started with Apache Hadoop and Apache Spark

Introducing Apache Hadoop

Introducing Apache Spark

Note

Why Hadoop plus Spark?

Hadoop features

Spark features

Installing Hadoop plus Spark clusters

Summary

Page 1 of 6

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Big Data Analytics: Real time analytics using Apache Spark and Hadoop

What do you get with a Packt Subscription?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs