Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Apache Spark 2.x

You're reading from   Mastering Apache Spark 2.x Advanced techniques in complex Big Data processing, streaming analytics and machine learning

Arrow left icon
Product type Paperback
Published in Jul 2017
Publisher Packt
ISBN-13 9781786462749
Length 354 pages
Edition 2nd Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Romeo Kienzler Romeo Kienzler
Author Profile Icon Romeo Kienzler
Romeo Kienzler
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. A First Taste and What’s New in Apache Spark V2 FREE CHAPTER 2. Apache Spark SQL 3. The Catalyst Optimizer 4. Project Tungsten 5. Apache Spark Streaming 6. Structured Streaming 7. Apache Spark MLlib 8. Apache SparkML 9. Apache SystemML 10. Deep Learning on Apache Spark with DeepLearning4j and H2O 11. Apache Spark GraphX 12. Apache Spark GraphFrames 13. Apache Spark with Jupyter Notebooks on IBM DataScience Experience 14. Apache Spark on Kubernetes

What this book covers

Chapter 1, A First Taste and What's New in Apache Spark V2, provides an overview of Apache Spark, the functionality that is available within its modules, and how it can be extended. It covers the tools available in the Apache Spark ecosystem outside the standard Apache Spark modules for processing and storage. It also provides tips on performance tuning.

Chapter 2, Apache Spark SQL, creates a schema in Spark SQL, shows how data can be queried efficiently using the relational API on DataFrames and Datasets, and explores SQL.

Chapter 3, The Catalyst Optimizer, explains what a cost-based optimizer in database systems is and why it is necessary. You will master the features and limitations of the Catalyst Optimizer in Apache Spark.

Chapter 4, Project Tungsten, explains why Project Tungsten is essential for Apache Spark and also goes on to explain how Memory Management, Cache-aware Computation, and Code Generation are used to speed things up dramatically.

Chapter 5, Apache Spark Streaming, talks about continuous applications using Apache Spark streaming. You will learn how to incrementally process data and create actionable insights.

Chapter 6, Structured Streaming, talks about Structured Streaming – a new way of defining continuous applications using the DataFrame and Dataset APIs.

Chapter 7, Classical MLlib, introduces you to MLlib, the de facto standard for machine learning when using Apache Spark.

Chapter 8, Apache SparkML, introduces you to the DataFrame-based machine learning library of Apache Spark: the new first-class citizen when it comes to high performance and massively parallel machine learning.

Chapter 9, Apache SystemML, introduces you to Apache SystemML, another machine learning library capable of running on top of Apache Spark and incorporating advanced features such as a cost-based optimizer, hybrid execution plans, and low-level operator re-writes.

Chapter 10, Deep Learning on Apache Spark using H20 and DeepLearning4j, explains that deep learning is currently outperforming one traditional machine learning discipline after the other. We have three open source first-class deep learning libraries running on top of Apache Spark, which are H2O, DeepLearning4j, and Apache SystemML. Let's understand what Deep Learning is and how to use it on top of Apache Spark using these libraries.

Chapter 11, Apache Spark GraphX, talks about Graph processing with Scala using GraphX. You will learn some basic and also advanced graph algorithms and how to use GraphX to execute them.

Chapter 12, Apache Spark GraphFrames, discusses graph processing with Scala using GraphFrames. You will learn some basic and also advanced graph algorithms and also how GraphFrames differ from GraphX in execution.

Chapter 13, Apache Spark with Jupyter Notebooks on IBM DataScience Experience, introduces a Platform as a Service offering from IBM, which is completely based on an Open Source stack and on open standards. The main advantage is that you have no vendor lock-in. Everything you learn here can be installed and used in other clouds, in a local datacenter, or on your local laptop or PC.

Chapter 14, Apache Spark on Kubernetes, explains that Platform as a Service cloud providers completely manage the operations part of an Apache Spark cluster for you. This is an advantage but sometimes you have to access individual cluster nodes for debugging and tweaking and you don't want to deal with the complexity that maintaining a real cluster on bare-metal or virtual systems entails. Here, Kubernetes might be the best solution. Therefore, in this chapter, we explain what Kubernetes is and how it can be used to set up an Apache Spark cluster.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image