You're reading from Mastering Spark for Data Science Lightning fast and scalable data science solutions

Product type Paperback

Published in Mar 2017

Publisher Packt

ISBN-13 9781785882142

Length 560 pages

Edition 1st Edition

Tools

Apache Spark

Concepts

Data Science

Authors (5):

David George

Matthew Hallett

Antoine Amend

Andrew Morgan

Albert Bifet

+1 more

View More author details

Table of Contents (15) Chapters

Preface

1. The Big Data Science Ecosystem FREE CHAPTER

2. Data Acquisition

3. Input Formats and Schema

4. Exploratory Data Analysis

5. Spark for Geographic Analysis

6. Scraping Link-Based External Data

7. Building Communities

8. Building a Recommendation System

9. News Dictionary and Real-Time Tagging System

10. Story De-duplication and Mutation

11. Anomaly Detection on Sentiment Analysis

12. TrendCalculus

13. Secure Data

14. Scalable Algorithms

Data technologies

When Hadoop first started, the word Hadoop referred to the combination of HDFS and the MapReduce processing paradigm, as that was the outline of the original paper http://research.google.com/archive/mapreduce.html. Since that time, a plethora of technologies have emerged to complement Hadoop, and with the development of Apache YARN we now see other processing paradigms emerge such as Spark.

Hadoop is now often used as a colloquialism for the entire big data software stack and so it would be prudent at this point to define the scope of that stack for this book. The typical data architecture with a selection of technologies we will visit throughout the book is detailed as follows:

The relationship between these technologies is a dense topic as there are complex interdependencies, for example, Spark depends on GeoMesa, which depends on Accumulo, which depends on Zookeeper and HDFS! Therefore, in order to manage these relationships, there are platforms available, such as Cloudera or Hortonworks HDP http://hortonworks.com/products/sandbox/. These provide consolidated user interfaces and centralized configuration. The choice of platform is that of the reader, however, it is not recommended to install a few of the technologies initially and then move to a managed platform as the version problems encountered will be very complex. Therefore, it is usually easier to start with a clean machine and make a decision upfront as to which direction to take.

All of the software we use in this book is platform-agnostic and therefore fits into the general architecture described earlier. It can be installed independently and it is relatively straightforward to use with single or multiple server environment without the use of a managed product.

The role of Apache Spark

In many ways, Apache Spark is the glue that holds these components together. It increasingly represents the hub of the software stack. It integrates with a wide variety of components but none of them are hard-wired. Indeed, even the underlying storage mechanism can be swapped out. Combining this feature with the ability to leverage different processing frameworks means the original Hadoop technologies effectively become components, rather than an imposing framework. The logical diagram of our architecture appears as follows:

As Spark has gained momentum and wide-scale industry acceptance, many of the original Hadoop implementations for various components have been refactored for Spark. Thus, to add further complexity to the picture, there are often several possible ways to programmatically leverage any particular component; not least the imperative and declarative versions depending upon whether an API has been ported from the original Hadoop Java implementation. We have attempted to remain as true as possible to the Spark ethos throughout the remaining chapters.