You're reading from Mastering Machine Learning with Spark 2.x Harness the potential of machine learning, through spark

Product type Paperback

Published in Aug 2017

Publisher Packt

ISBN-13 9781785283451

Length 340 pages

Edition 1st Edition

Languages

Java

Tools

Apache Spark

Concepts

Machine Learning

Authors (3):

Alex Tellez

Michal Malohlava

Max Pumperla

View More author details

The sexiest role of the 21st century – data scientist?

At first, it's easy to paint a stereotypical picture of what a typical data scientist looks like: t-shirt, sweatpants, thick-rimmed glasses, and debugging a chunk of code in IntelliJ... you get the idea. Aesthetics aside, what are some of the traits of a data scientist? One of our favorite posters describing this role is shown here in the following diagram:

Figure 2 - What is a data scientist?

Math, statistics, and general knowledge of computer science is given, but one pitfall that we see among practitioners has to do with understanding the business problem, which goes back to asking intelligent questions of the data. It cannot be emphasized enough: asking more intelligent questions of the data is a function of the data scientist's understanding of the business problem and the limitations of the data; without this fundamental understanding, even the most intelligent algorithm would be unable to come to solid conclusions based on a wobbly foundation.

A day in the life of a data scientist

This will probably come as a shock to some of you—being a data scientist is more than reading academic papers, researching new tools, and model building until the wee hours of the morning, fueled on espresso; in fact, this is only a small percentage of the time that a data scientist gets to truly play (the espresso part however is 100% true for everyone)! Most part of the day, however, is spent in meetings, gaining a better understanding of the business problem(s), crunching the data to learn its limitations (take heart, this book will expose you to a ton of different feature engineering or feature extractions tasks), and how best to present the findings to non data-sciencey people. This is where the true sausage making process takes place, and the best data scientists are the ones who relish in this process because they are gaining more understanding of the requirements and benchmarks for success. In fact, we could literally write a whole new book describing this process from top-to-tail!

So, what (and who) is involved in asking questions about data? Sometimes, it is process of saving data into a relational database and running SQL queries to find insights into data: "for the millions of users that bought this particular product, what are the top 3 OTHER products also bought?" Other times, the question is more complex, such as, "Given the review of a movie, is this a positive or negative review?" This book is mainly focused on complex questions, like the latter. Answering these types of questions is where businesses really get the most impact from their big data projects and is also where we see a proliferation of emerging technologies that look to make this Q and A system easier, with more functionality.

Some of the most popular, open source frameworks that look to help answer data questions include R, Python, Julia, and Octave, all of which perform reasonably well with small (X < 100 GB) datasets. At this point, it's worth stopping and pointing out a clear distinction between big versus small data. Our general rule of thumb in the office goes as follows:

If you can open your dataset using Excel, you are working with small data.

Working with big data

What happens when the dataset in question is so vast that it cannot fit into the memory of a single computer and must be distributed across a number of nodes in a large computing cluster? Can't we just rewrite some R code, for example, and extend it to account for more than a single-node computation? If only things were that simple! There are many reasons why the scaling of algorithms to more machines is difficult. Imagine a simple example of a file containing a list of names:

B
D
X
A
D
A

We would like to compute the number of occurrences of individual words in the file. If the file fits into a single machine, you can easily compute the number of occurrences by using a combination of the Unix tools, sort and uniq:

bash> sort file | uniq -c

The output is as shown ahead:

2 A
1 B
1 D
1 X

However, if the file is huge and distributed over multiple machines, it is necessary to adopt a slightly different computation strategy. For example, compute the number of occurrences of individual words for every part of the file that fits into the memory and merge the results together. Hence, even simple tasks, such as counting the occurrences of names, in a distributed environment can become more complicated.

The machine learning algorithm using a distributed environment

Machine learning algorithms combine simple tasks into complex patterns, that are even more complicated in distributed environment. Let's take a simple decision tree algorithm (reference), for example. This particular algorithm creates a binary tree that tries to fit training data and minimize prediction errors. However, in order to do this, it has to decide about the branch of tree it has to send every data point to (don't worry, we'll cover the mechanics of how this algorithm works along with some very useful parameters that you can learn in later in the book). Let's demonstrate it with a simple example:

Figure 3 - Example of red and blue data points covering 2D space.

Consider the situation depicted in preceding figure. A two-dimensional board with many points colored in two colors: red and blue. The goal of the decision tree is to learn and generalize the shape of data and help decide about the color of a new point. In our example, we can easily see that the points almost follow a chessboard pattern. However, the algorithm has to figure out the structure by itself. It starts by finding the best position of a vertical or horizontal line, which would separate the red points from the blue points.

The found decision is stored in the tree root and the steps are recursively applied on both the partitions. The algorithm ends when there is a single point in the partition:

Figure 4 - The final decision tree and projection of its prediction to the original space of points.

Splitting of data into multiple machines

For now, let's assume that the number of points is huge and cannot fit into the memory of a single machine. Hence, we need multiple machines, and we have to partition data in such a way that each machine contains only a subset of data. This way, we solve the memory problem; however, it also means that we need to distribute the computation around a cluster of machines. This is the first difference from single-machine computing. If your data fits into a single machine memory, it is easy to make decisions about data, since the algorithm can access them all at once, but in the case of a distributed algorithm, this is not true anymore and the algorithm has to be "clever" about accessing the data. Since our goal is to build a decision tree that predicts the color of a new point in the board, we need to figure out how to make the tree that will be the same as a tree built on a single machine.

The naive solution is to build a trivial tree that separates the points based on machine boundaries. But this is obviously a bad solution, since data distribution does not reflect color points at all.

Another solution tries all the possible split decisions in the direction of the X and Y axes and tries to do the best in separating both colors, that is, divides the points into two groups and minimizes the number of points of another color. Imagine that the algorithm is testing the split via the line, X = 1.6. This means that the algorithm has to ask each machine in the cluster to report the result of splitting the machine's local data, merge the results, and decide whether it is the right splitting decision. If it finds an optimal split, it needs to inform all the machines about the decision in order to record which partition each point belongs to.

Compared with the single machine scenario, the distributed algorithm constructing decision tree is more complex and requires a way of distributing the computation among machines. Nowadays, with easy access to a cluster of machines and an increasing demand for the analysis of larger datasets, it becomes a standard requirement.

Even these two simple examples show that for a larger data, proper computation and distributed infrastructure is required, including the following:

A distributed data storage, that is, if the data cannot fit into a single node, we need a way to distribute and process them on multiple machines
A computation paradigm to process and transform the distributed data and to apply mathematical (and statistical) algorithms and workflows
Support to persist and reuse defined workflows and models
Support to deploy statistical models in production

In short, we need a framework that will support common data science tasks. It can be considered an unnecessary requirement, since data scientists prefer using existing tools, such as R, Weka, or Python's scikit. However, these tools are neither designed for large-scale distributed processing nor for the parallel processing of large data. Even though there are libraries for R or Python that support limited parallel or distributed programming, their main limitation is that the base platforms, that is R and Python, were not designed for this kind of data processing and computation.

From Hadoop MapReduce to Spark

With a growing amount of data, the single-machine tools were not able to satisfy the industry needs and thereby created a space for new data processing methods and tools, especially Hadoop MapReduce, which is based on an idea originally described in the Google paper, MapReduce: Simplified Data Processing on Large Clusters (https://research.google.com/archive/mapreduce.html). On the other hand, it is a generic framework without any explicit support or libraries to create machine learning workflows. Another limitation of classical MapReduce is that it performs many disk I/O operations during the computation instead of benefiting from machine memory.

As you have seen, there are several existing machine learning tools and distributed platforms, but none of them is an exact match for performing machine learning tasks with large data and distributed environment. All these claims open the doors for Apache Spark.

Enter the room, Apache Spark!

Created in 2010 at the UC Berkeley AMP Lab (Algorithms, Machines, People), the Apache Spark project was built with an eye for speed, ease of use, and advanced analytics. One key difference between Spark and other distributed frameworks such as Hadoop is that datasets can be cached in memory, which lends itself nicely to machine learning, given its iterative nature (more on this later!) and how data scientists are constantly accessing the same data many times over.

Spark can be run in a variety of ways, such as the following:

Local mode: This entails a single Java Virtual Machine (JVM) executed on a single host
Standalone Spark cluster: This entails multiple JVMs on multiple hosts
Via resource manager such as Yarn/Mesos: This application deployment is driven by a resource manager, which controls the allocation of nodes, application, distribution, and deployment

What is Databricks?

If you know about the Spark project, then chances are high that you have also heard of a company called Databricks. However, you might not know how Databricks and the Spark project are related to one another. In short, Databricks was founded by the creators of the Apache Spark project and accounts for over 75% of the code base for the Spark project. Aside from being a huge force behind the Spark project with respect to development, Databricks also offers various certifications in Spark for developers, administrators, trainers, and analysts alike. However, Databricks is not the only main contributor to the code base; companies such as IBM, Cloudera, and Microsoft also actively participate in Apache Spark development.

As a side note, Databricks also organizes the Spark Summit (in both Europe and the US), which is the premier Spark conference and a great place to learn about the latest developments in the project and how others are using Spark within their ecosystem.

Throughout this book, we will give recommended links that we read daily that offer great insights and also important changes with respect to the new versions of Spark. One of the best resources here is the Databricks blog, which is constantly being updated with great content. Be sure to regularly check this out at https://databricks.com/blog.

Also, here is a link to see the past Spark Summit talks, which you may find helpful:
http://slideshare.net/databricks.

Inside the box

So, you have downloaded the latest version of Spark (depending on how you plan on launching Spark) and you have run the standard Hello, World! example....what now?!

Spark comes equipped with five libraries, which can be used separately--or in unison--depending on the task we are trying to solve. Note that in this book, we plan on using a variety of different libraries, all within the same application so that you will have the maximum exposure to the Spark platform and better understand the benefits (and limitations) of each library. These five libraries are as follows:

Core: This is the Spark core infrastructure, providing primitives to represent and store data called Resilient Distributed Dataset (RDDs) and manipulate data with tasks and jobs.
SQL : This library provides user-friendly API over core RDDs by introducing DataFrames and SQL to manipulate with the data stored.
MLlib (Machine Learning Library) : This is Spark's very own machine learning library of algorithms developed in-house that can be used within your Spark application.
Graphx : This is used for graphs and graph-calculations; we will explore this particular library in depth in a later chapter.
Streaming : This library allows real-time streaming of data from various sources, such as Kafka, Twitter, Flume, and TCP sockets, to name a few. Many of the applications we will build in this book will leverage the MLlib and Streaming libraries to build our applications.

The Spark platform can also be extended by third-party packages. There are many of them, for example, support for reading CSV or Avro files, integration with Redshift, and Sparkling Water, which encapsulates the H2O machine learning library.