Scala for Machine Learning

Chapter 1. Getting Started

It is critical for any computer scientist to understand the different classes of machine learning algorithms and be able to select the ones that are relevant to the domain of their expertise and dataset. However, the application of these algorithms represents a small fraction of the overall effort needed to extract an accurate and performing model from input data. A common data mining workflow consists of the following sequential steps:

Loading the data.
Preprocessing, analyzing, and filtering the input data.
Discovering patterns, affinities, clusters, and classes.
Selecting the model features and the appropriate machine learning algorithm(s).
Refining and validating the model.
Improving the computational performance of the implementation.

As we will emphasize throughout this book, each stage of the process is critical to build the right model.

This first chapter introduces you to the taxonomy of machine learning algorithms, the tools and frameworks used in the book, and a simple application of logistic regression to get your feet wet.

Why machine learning?

The explosion in the number of digital devices generates an ever-increasing amount of data. The best analogy I can find to describe the need, desire, and urgency to extract knowledge from large datasets is the process of extracting a precious metal from a mine, and in some cases, extracting blood from a stone.

Knowledge is quite often defined as a model that can be constantly updated or tweaked as new data comes into play. Models are obviously domain-specific ranging from credit risk assessment, face recognition, maximization of quality of service, classification of pathological symptoms of disease, optimization of computer networks, and security intrusion detection, to customers' online behavior and purchase history.

Machine learning problems are categorized as classification, prediction, optimization, and regression.

Classification

The purpose of classification is to extract knowledge from historical data. For instance, a classifier can be built to identify a disease from a set of symptoms. The scientist collects information regarding the body temperature (continuous variable), congestion (discrete variables HIGH, MEDIUM, and LOW), and the actual diagnostic (flu). This dataset is used to create a model such as IF temperature > 102 AND congestion = HIGH THEN patient has the flu (probability 0.72), which doctors can use in their diagnostic.

Prediction

Once the model is extracted and validated against the past data, it can be used to draw inference from the future data. A doctor collects symptoms from a patient, such as body temperature and nasal congestion, and anticipates the state of his/her health.

Optimization

Some global optimization problems are intractable using traditional linear and non-linear optimization methods. Machine learning techniques improve the chances that the optimization method converges toward a solution (intelligent search). You can imagine that fighting the spread of a new virus requires optimizing a process that may evolve over time as more symptoms and cases are uncovered.

Regression

Regression is a classification technique that is particularly suitable for a continuous model. Linear (least square), polynomial, and logistic regressions are among the most commonly used techniques to fit a parametric model, or function, y= f (xj), to a dataset. Regression is sometimes regarded as a specialized case of classification for which the output variables are continuous instead of categorical.

Why Scala?

Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven dynamically into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular MapReduce model for distributed computation of large amounts of data. Among the capabilities of the language, the following features are deemed essential to machine learning and statistical analysis.

Abstraction

Monoids and monads are important concepts in functional programming. Monads are derived from the category and group theory allowing developers to create a high-level abstraction as illustrated in Twitter's Algebird (https://github.com/twitter/algebird) or Google's Breeze Scala (https://github.com/dlwh/breeze) libraries.

A monoid defines a binary operation op on a dataset T with the property of closure, identity operation, and associativity.

Let's consider the + operation is defined for a set T using the following monoidal representation:

trait Monoid[T] {
  def zero: T 
  def op(a: T, b: T): c 
}

Monoids are associative operations. For instance, if ts1, ts2, and ts3 are three time series, then the property ts1 + (ts2 + ts3) = (ts1 + ts2) + ts2 is true. The associativity of a monoid operator is critical in regards to parallelization of computational workflows.

Monads are structures that can be seen either as containers by programmers or as a generalization of Monoids. The collections bundled with the Scala standard library (list, map, and so on) are constructed as monads [1:1]. Monads provide the ability for those collections to perform the following functions:

Create the collection.
Transform the elements of the collection.
Flatten nested collections.

A common categorical representation of a monad in Scala is a trait, Monad, parameterized with a container type M:

trait Monad[M[_]] {
  def apply[T])(a: T): M[T] 
  def flatMap[T, U](m: M[T])(f: T=>M[U]): M[U] 
}

Monads allow those collections or containers to be chained to generate a workflow. This property is applicable to any scientific computation [1:2].

Scalability

As seen previously, monoids and monads enable parallelization and chaining of data processing functions by leveraging the Scala higher-order methods. In terms of implementation, Actors are the core elements that make Scala scalable. Actors act as coroutines, managing the underlying threads pool. Actors communicate through passing asynchronous messages. A distributed computing Scala framework such as Akka and Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Spark are described in detail in the last chapter of this book [1:3].

In a nutshell, a workflow is implemented as a sequence of activities or computational tasks. Those tasks consist of high-order Scala methods such as flatMap, map, fold, reduce, collect, join, or filter applied to a large collection of observations. Scala allows these observations to be partitioned by executing those tasks through a cluster of actors. Scala also supports message dispatching and routing of messages between local and remote actors. The engineers can decide to execute a workflow either locally or distributed across CPU cores and servers with no code or very little code changes.

Deployment of a workflow as a distributed computation

In this diagram, a controller, that is, the master node, manages the sequence of tasks 1 to 4 similar to a scheduler. These tasks are actually executed over multiple worker nodes that are implemented by the Scala actors. The master node exchanges messages with the workers to manage the state of the execution of the workflow as well as its reliability. High availability of these tasks is implemented through a hierarchy of supervising actors.

Configurability

Scala supports dependency injection using a combination of abstract variables, self-referenced composition, and stackable traits. One of the most commonly used dependency injection patterns, the cake pattern, is used throughout this book to create dynamic computation workflows and plots.

Maintainability

Scala embeds Domain Specific Languages (DSL) natively. DSLs are syntactic layers built on top of Scala native libraries. DSLs allow software developers to abstract computation in terms that are easily understood by scientists. The most notorious application of DSLs is the definition of the emulation of the syntax used in the MATLAB program, which data scientists are familiar with.

Computation on demand

Lazy methods and values allow developers to execute functions and allocate computing resources on demand. The Spark framework relies on lazy variables and methods to chain Resilient Distributed Datasets (RDD).

Taxonomy of machine learning algorithms

The purpose of machine learning is to teach computers to execute tasks without human intervention. An increasing number of applications such as genomics, social networking, advertising, or risk analysis generate a very large amount of data that can be analyzed or mined to extract knowledge or provide insight into a process, a customer, or an organization. Ultimately, machine learning algorithms consist of identifying and validating models to optimize a performance criterion using historical, present, and future data [1:4].

Data mining is the process of extracting or identifying patterns in a dataset.

Unsupervised learning

The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. The process known as density estimation in statistics is broken down into two categories: discovery of data clusters and discovery of latent factors. The methodology consists of processing input data to understand patterns similar to the natural learning process in infants or animals. Unsupervised learning does not require labeled data, and therefore, is easy to implement and execute because no expertise is needed to validate an output. However, it is possible to label the output of a clustering algorithm and use it for future classification.

Clustering

The purpose of data clustering is to partition a collection of data into a number of clusters or data segments. Practically, a clustering algorithm is used to organize observations into clusters by minimizing the observations within a cluster and maximizing the observations between clusters. A clustering algorithm consists of the following steps:

Creating a model by making an assumption on the input data.
Selecting the objective function or goal of the clustering.
Evaluating one or more algorithms to optimize the objective function.

Data clustering is also known as data segmentation or data partitioning.

Dimension reduction

Dimension reduction techniques aim at finding the smallest but most relevant set of features that models dataset reliability. There are many reasons for reducing the number of features or parameters in a model, from avoiding overfitting to reducing computation costs.

There are many ways to classify the different techniques used to extract knowledge from data using unsupervised learning. The following taxonomy breaks down these techniques according to their purpose, although the list is far for being exhaustive, as shown in the following diagram:

Supervised learning

The best analogy for supervised learning is function approximation or curve fitting. In its simplest form, supervised learning attempts to extract a relation or function f x → y from a training set {x, y}. Supervised learning is far more accurate and reliable than any other learning strategy. However, a domain expert may be required to label (tag) data as a training set for certain types of problems.

Supervised machine learning algorithms can be broken into two categories:

Generative models
Discriminative models

Generative models

In order to simplify the description of statistics formulas, we adopt the following simplification: the probability of an event X is the same as the probability of the discrete random variable X to have a value x, p(X) = p(X=x). The notation of joint probability (resp. conditional probability) becomes p(X, Y) = p(X=x, Y=y) (resp. p(X|Y)=p(X=x | Y=y).

Generative models attempt to fit a joint probability distribution, p(X,Y), of two events (or random variables), X and Y, representing two sets of observed and hidden (latent) variables x and y. Discriminative models learn the conditional probability p(Y|X) of an event or random variable Y of hidden variables y, given an event or random variable X of observed variables x. Generative models are commonly introduced through the Bayes' rule. The conditional probability of an event Y, given an event X, is computed as the product of the conditional probability of the event X, given the event Y, and the probability of the event X normalized by the probability of event Y [1:5].

Tip

Join probability (if X and Y are independent):

Conditional probability:

The Bayes' rule:

The Bayes' rule is the foundation of the Naïve Bayes classifier, which is the topic of Chapter 5, Naïve Bayes Classifiers.

Discriminative models

Contrary to generative models, discriminative models compute the conditional probability p(Y|X) directly, using the same algorithm for training and classification.

Generative and discriminative models have their respective advantages and drawbacks. Novice data scientists learn to match the appropriate algorithm to each problem through experimentation. Here is a brief guideline describing which type of models makes sense according to the objective or criteria of the project:

Objective	Generative models	Discriminative models
Accuracy	Highly dependent on the training set.	Probability estimates tend to be more accurate.
Modeling requirements	There is a need to model both observed and hidden variables, which requires a significant amount of training.	The quality of the training set does not have to be as rigorous as for generative models.
Computation cost	This is usually low. For example, any graphical method derived from the Bayes' rule has low overhead.	Most algorithms rely on optimization of a convex that introduces significant performance overhead.
Constraints	These models assume some degree of independence among the model features.	Most discriminative algorithms accommodate dependencies between features.

We can further refine the taxonomy of supervised learning algorithms by segregating between sequential and random variables for generative models and breaking down discriminative methods as applied to continuous processes (regression) and discrete processes (classification):

Reinforcement learning

Reinforcement learning is not as well understood as supervised and unsupervised learning outside the realms of robotics or game strategy. However, since the 90s, genetic-algorithms-based classifiers have become increasingly popular to solve problems that require collaboration with a domain expert. For some types of applications, reinforcement learning algorithms output a set of recommended actions for the adaptive system to execute. In its simplest form, these algorithms compute or estimate the best course of action. Most complex systems based on reinforcement learning establish and update policies that can be vetoed by an expert. The foremost challenge developers of reinforcement learning systems face is that the recommended action or policy may depend on partially observable states and how to deal with uncertainty.

Genetic algorithms are not usually considered part of the reinforcement learning toolbox. However, advanced models such as learning classifier systems use genetic algorithms to classify and reward the rules and policies.

As with the two previous learning strategies, reinforcement learning models can be categorized as Markovian or evolutionary:

This is a brief overview of machine learning algorithms with a suggested taxonomy. There are almost as many ways to introduce machine learning as there are data and computer scientists. We encourage you to browse through the list of references at the end of the book and find the documentation appropriate to your level of interest and understanding.

Tools and frameworks

Before getting your hands dirty, you need to download and deploy a minimum set of tools and libraries so as not to reinvent the wheel. A few key components have to be installed in order to compile and run the source code described throughout the book. We focus on open source and commonly available libraries, although you are invited to experiment with equivalent tools of your choice. The learning curve for the frameworks described here is minimal.

Java

The code described in the book has been tested with JDK 1.7.0_45 and JDK 1.8.0_25 on Windows x64 and MacOS X x64 . You need to install the Java Development Kit if you have not already done so. Finally, the environment variables JAVA_HOME, PATH, and CLASSPATH have to be updated accordingly.

Scala

The code has been tested with Scala 2.10.4. We recommend using Scala version 2.10.3 or higher and SBT 0.13 or higher. Let's assume that Scala runtime (REPL) and libraries have been properly installed and environment variables SCALA_HOME and PATH have been updated. The description and installation instructions of the Scala plugin for Eclipse are available at http://scala-ide.org/docs/user/gettingstarted.html.

You can also download the Scala plugin for Intellij IDEA from the JetBrains website at http://confluence.jetbrains.com/display/SCA/.

The ubiquitous simple build tool (sbt) will be our primary building engine. The syntax of the build file sbt/build.sbt conforms to version 0.13, and is used to compile and assemble the source code presented throughout this book.

Apache Commons Math

Apache Commons Math is a Java library for numerical processing, algebra, statistics, and optimization [1:6].

Description

This is a lightweight library that provides developers with a foundation of small, ready-to-use Java classes that can be easily weaved into a machine learning problem. The examples used throughout the book require version 3.3 or higher.

The main components of Apache Commons Math are:

Functions, differentiation, and integral and ordinary differential equations
Statistics distribution
Linear and nonlinear optimization
Dense and Sparse vectors and matrices
Curve fitting, correlation, and regression

For more information, visit http://commons.apache.org/proper/commons-math.

Licensing

We need Apache Public License 2.0; the terms are available at http://www.apache.org/licenses/LICENSE-2.0.

Installation

The installation and deployment of the Commons Math library are quite simple:

Go to the download page, http://commons.apache.org/proper/commons-math/download_math.cgi.
Download the latest .jar files in the Binaries section, commons-math3-3.3-bin.zip (for version 3.3, for instance).
Unzip and install the .jar files.
Add commons-math3-3.3.jar to classpath as follows:
- For Mac OS X, use the command export CLASSPATH=$CLASSPATH:/Commons_Math_path/commons-math3-3.3.jar
- For Windows, navigate to System property | Advanced system settings | Advanced | Environment variables…, then edit the entry of the CLASSPATH variable
Add the commons-math3-3.3.jar file to your IDE environment if needed (that is, for Eclipse, navigate to Project | Properties | Java Build Path | Libraries | Add External JARs).

You can also download commons-math3-3.3-src.zip from the Source section.

JFreeChart

JFreeChart is an open source chart and plotting Java library, widely used in the Java programmer community. It was originally created by David Gilbert [1:7].

Description

The library supports a variety of configurable plots and charts (scatter, dial, pie, area, bar, box and whisker, stacked, and 3D). We use JFreeChart to display the output of data processing and algorithms throughout the book, but you are encouraged to explore this great library on your own, as time permits.

Licensing

It is distributed under the terms of the GNU Lesser General Public License (LGPL), which permits its use in proprietary applications.

Installation

To install and deploy JFreeChart, perform the following steps:

Visit http://www.jfree.org/jfreechart.
Download the latest version from Source Forge at http://sourceforge.net/projects/jfreechart/files.
Unzip and install the .jar file.
Add jfreechart-1.0.17.jar (for version 1.0.17) to classpath as follows:
- For Mac OS, update the classpath by using export CLASSPATH=$CLASSPATH:/JFreeChart_path/ jfreechart-1.0.17.jar
- For Windows, go to System property | Advanced system settings | Advanced | Environment variables… and then edit the entry of the CLASSPATH variable
Add the jfreechart-1.0.17.jar file to your IDE environment, if needed.

Other libraries and frameworks

Libraries and tools that are specific to a single chapter are introduced along with the topic. Scalable frameworks are presented in the last chapter along with the instructions to download them. Libraries related to the conditional random fields and support vector machines are described in the respective chapters.

Note

Why not use Scala algebra and numerical libraries

Libraries such as Breeze, ScalaNLP, and Algebird are great Scala frameworks for linear algebra, numerical analysis, and machine learning. They provide even the most seasoned Scala programmer with a high-quality layer of abstraction. However, this book is designed as a tutorial that allows developers to write algorithms from the ground up using simple common Java libraries [1:8].

Filter reviews by

All

Amazon verified reviews

matej fandl Feb 09, 2015

Studying machine learning during my university times and being an aspiring scala developer I picked this book up as an opportunity to learn scala while reading about what interests me the most in the field of computer science. This wont be an easy read for people not familiar with scala at all, but if you have some experience with the language and are interested in machine learning, I definitely recommend the book. It is a nice and quite deep dive into the topic. What I found very interesting was the optional math part available in each section. This book is also showing me where my understanding of scala is still superficial, the code is written a very good way.

Amazon Verified review

Amazon Customer Feb 20, 2015

Some technical books are heavy on theory, some are heavy on practicality. There are books that describe ‘why’ and others that show you ‘how’. Personally, I have always tended towards the ‘how’ side of things - the ‘cookbook’ approach is one I have always liked. The more practical the better.Scala For Machine Learning (full disclosure, I received an unpaid review copy) is as about as practical as it gets. Loaded with code examples, this book leaves you in no doubt that it will help you construct your own code, against your own data in the fastest time possible. But it does not take any short-cuts.Machine Learning and Scala is a broad subject to cover and SFML does an admirable job of taking you from the first steps of data preparation, right the way through to a artificial neural networks and genetic algorithms.I work with data - it has been my day-to-day working life for over twenty years, and there is never any shortage of new territory to cover. I am predominately data engineering focused, and whilst the data content is the most important thing, it is the technology that keeps me engaged. In the final chapter of SFML Nicolas covers a nice selection of frameworks for concurrent processing. This really piqued my interest - Apache Spark is a hot topic at the moment and is given a good few pages here, reviewing its Scala and Akka heritage and demonstrating its core design principles of in-memory persistence, scheduling laziness, distributed dataset actions and shared variables. Better yet Nicolas shows you how to get Spark up and running and executing your first K-means tasks.The bulk of the book is made up with detailed reviews, explanations and implementation guides for different machine learning algorithms and methodologies. Each section is made up of a wealth of detailed explanation, detailed implementation instructions and guidelines, honestly - at times - the information density can be a little overwhelming, but it is all hugely valuable, and much appreciated.Running through the whole book is, of course, an appreciation of Scala and its abilities to perform in a distributed manner, at scale. This is the sort work that Scala, in all of its object-functional glory, was designed to address. I have read other Scala books and the programming language has always been presented in a “you may be used to, but Scala…” fashion. I am happy to say that is not the case here, with solid examples, proper real world code and thorough debriefs - Scala is presented as it should be, a language in its own right, being dealt with on its own terms.If your work or study includes components of machine learning and / or data processing - and you are looking for a modern take on some time-honoured methods, this book is work picking up and consuming, avidly.

mathieu Dec 31, 2014

just wanted to let potential buyers know that the kindle version has none of the nuisances sometimes found in e-books. i just purchased this book (from amazon france) and did a quick check of the source code, equations, and diagrams. everything shows up perfectly. no comment on the content yet, but it looks quite interesting.

Amazon Customer Feb 13, 2017

Very detailed. Not for beginners in scala programming

Tomer Ben David Feb 12, 2015

My favorite book these days, using scala? doing machine learning? I don't want to read a separate book on machine learning and another one on advanced scala tailored for machine learning, can anyone please summarize scala + machine learning in the same book? Yes - This book does it. Still reading but so far so good! Perfect match for me.

Scala for Machine Learning: Leverage Scala and Machine Learning to construct and study systems that can learn from data

What do you get with a Packt Subscription?

Clustering

Dimension reduction

Generative models

Discriminative models

Description

Licensing

Installation

Description

Licensing

Installation

Description

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs