Machine Learning with Apache Spark Quick Start Guide: Uncover patterns, derive actionable insights, and learn from big data using MLlib

What do you get with Print?

Instant access to your digital copy whilst your Print order is Shipped

Paperback book shipped to your preferred address

Redeem a companion digital copy on all Print orders

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Setting Up a Local Development Environment

In this chapter, we will install, configure, and deploy a local analytical development environment by provisioning a self-contained single-node cluster that will allow us to do the following:

Prototype and develop machine learning models and pipelines in Python
Demonstrate the functionality and usage of Apache Spark's machine learning library, MLlib, via the Spark Python API (PySpark)
Develop and test machine learning models on a single-node cluster using small sample datasets, and thereafter scale up to multi-node clusters processing much larger datasets with little or no code changes required

Our single-node cluster will host the following technologies:

Operating system: CentOS Linux 7
https://www.centos.org/download/
General Purpose Programming Languages:
- Java SE Development Kit (JDK) 8 (8u181)
  https://www.oracle.com/technetwork...

Key benefits

Make a hands-on start in the fields of Big Data, Distributed Technologies and Machine Learning

Learn how to design, develop and interpret the results of common Machine Learning algorithms

Uncover hidden patterns in your data in order to derive real actionable insights and business value

Description

Every person and every organization in the world manages data, whether they realize it or not. Data is used to describe the world around us and can be used for almost any purpose, from analyzing consumer habits to fighting disease and serious organized crime. Ultimately, we manage data in order to derive value from it, and many organizations around the world have traditionally invested in technology to help process their data faster and more efficiently. But we now live in an interconnected world driven by mass data creation and consumption where data is no longer rows and columns restricted to a spreadsheet, but an organic and evolving asset in its own right. With this realization comes major challenges for organizations: how do we manage the sheer size of data being created every second (think not only spreadsheets and databases, but also social media posts, images, videos, music, blogs and so on)? And once we can manage all of this data, how do we derive real value from it? The focus of Machine Learning with Apache Spark is to help us answer these questions in a hands-on manner. We introduce the latest scalable technologies to help us manage and process big data. We then introduce advanced analytical algorithms applied to real-world use cases in order to uncover patterns, derive actionable insights, and learn from this big data.

What you will learn

Understand how Spark fits in the context of the big data ecosystem

Understand how to deploy and configure a local development environment using Apache Spark

Understand how to design supervised and unsupervised learning models

Build models to perform NLP, deep learning, and cognitive services using Spark ML libraries

Design real-time machine learning pipelines in Apache Spark

Become familiar with advanced techniques for processing a large volume of data by applying machine learning algorithms

What do you get with Print?