Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Big Data Analytics

You're reading from   Big Data Analytics Real time analytics using Apache Spark and Hadoop

Arrow left icon
Product type Paperback
Published in Sep 2016
Publisher Packt
ISBN-13 9781785884696
Length 326 pages
Edition 1st Edition
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Venkat Ankam Venkat Ankam
Author Profile Icon Venkat Ankam
Venkat Ankam
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Big Data Analytics at a 10,000-Foot View 2. Getting Started with Apache Hadoop and Apache Spark FREE CHAPTER 3. Deep Dive into Apache Spark 4. Big Data Analytics with Spark SQL, DataFrames, and Datasets 5. Real-Time Analytics with Spark Streaming and Structured Streaming 6. Notebooks and Dataflows with Spark and Hadoop 7. Machine Learning with Spark and Hadoop 8. Building Recommendation Systems with Spark and Mahout 9. Graph Analytics with GraphX 10. Interactive Analytics with SparkR Index

Tools and techniques

Let's take a look at different tools and techniques used in Hadoop and Spark for Big Data analytics.

While the Hadoop platform can be used for both storing and processing the data, Spark can be used for processing only by reading data into memory.

The following is a tabular representation of the tools and techniques used in typical Big Data analytics projects:

 

Tools used

Techniques used

Data collection

Apache Flume for real-time data collection and aggregation

Apache Sqoop for data import and export from relational data stores and NoSQL databases

Apache Kafka for the publish-subscribe messaging system

General-purpose tools such as FTP/Copy

Real-time data capture

Export

Import

Message publishing

Data APIs

Screen scraping

Data storage and formats

HDFS: Primary storage of Hadoop

HBase: NoSQL database

Parquet: Columnar format

Avro: Serialization system on Hadoop

Sequence File: Binary key-value pairs

RC File: First columnar format in Hadoop

ORC File: Optimized RC File

XML and JSON: Standard data interchange formats

Compression formats: Gzip, Snappy, LZO, Bzip2, Deflate, and others

Unstructured Text, images, videos, and so on

Data storage

Data archival

Data compression

Data serialization

Schema evolution

Data transformation and enrichment

MapReduce: Hadoop's processing framework

Spark: Compute engine

Hive: Data warehouse and querying

Pig: Data flow language

Python: Functional programming

Crunch, Cascading, Scalding, and Cascalog: Special MapReduce tools

Data munging

Filtering

Joining

ETL

File format conversion

Anonymization

Re-identification

Data analytics

Hive: Data warehouse and querying

Pig: Data flow language

Tez: Alternative to MapReduce

Impala: Alternative to MapReduce

Drill: Alternative to MapReduce

Apache Storm: Real-time compute engine

Spark Core: Spark core compute engine

Spark Streaming: Real-time compute engine

Spark SQL: For SQL analytics

SolR: Search platform

Apache Zeppelin: Web-based notebook

Jupyter Notebooks

Databricks cloud

Apache NiFi: Data flow

Spark-on-HBase connector

Programming languages: Java, Scala, and Python

Online Analytical Processing (OLAP)

Data mining

Data visualization

Complex event processing

Real-time stream processing

Full text search

Interactive data analytics

Data science

Python: Functional programming

R: Statistical computing language

Mahout: Hadoop's machine learning library

MLlib: Spark's machine learning library

GraphX and GraphFrames: Spark's graph processing framework and DataFrame adoption to graphs.

Predictive analytics

Sentiment analytics

Text and Natural Language Processing

Network analytics

Cluster analytics

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image