Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Scala for Machine Learning
Scala for Machine Learning

Scala for Machine Learning: Leverage Scala and Machine Learning to construct and study systems that can learn from data

eBook
€35.98 €39.99
Paperback
€49.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. €18.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Scala for Machine Learning

Chapter 1. Getting Started

It is critical for any computer scientist to understand the different classes of machine learning algorithms and be able to select the ones that are relevant to the domain of their expertise and dataset. However, the application of these algorithms represents a small fraction of the overall effort needed to extract an accurate and performing model from input data. A common data mining workflow consists of the following sequential steps:

  1. Loading the data.
  2. Preprocessing, analyzing, and filtering the input data.
  3. Discovering patterns, affinities, clusters, and classes.
  4. Selecting the model features and the appropriate machine learning algorithm(s).
  5. Refining and validating the model.
  6. Improving the computational performance of the implementation.

As we will emphasize throughout this book, each stage of the process is critical to build the right model.

This first chapter introduces you to the taxonomy of machine learning algorithms, the tools and frameworks used in the book, and a simple application of logistic regression to get your feet wet.

Mathematical notation for the curious

Each chapter contains a small section dedicated to the formulation of the algorithms for those interested in the mathematical concepts behind the science and art of machine learning. These sections are optional and defined within a tip box. For example, the mathematical expression of the mean and the variance of a variable X mentioned in a tip box will be as follows:

Tip

Mean value of a variable X = {x} is defined as:

Mathematical notation for the curious

The variance of a variable X = {x} is defined as:

Mathematical notation for the curious

Why machine learning?

The explosion in the number of digital devices generates an ever-increasing amount of data. The best analogy I can find to describe the need, desire, and urgency to extract knowledge from large datasets is the process of extracting a precious metal from a mine, and in some cases, extracting blood from a stone.

Knowledge is quite often defined as a model that can be constantly updated or tweaked as new data comes into play. Models are obviously domain-specific ranging from credit risk assessment, face recognition, maximization of quality of service, classification of pathological symptoms of disease, optimization of computer networks, and security intrusion detection, to customers' online behavior and purchase history.

Machine learning problems are categorized as classification, prediction, optimization, and regression.

Classification

The purpose of classification is to extract knowledge from historical data. For instance, a classifier can be built to identify a disease from a set of symptoms. The scientist collects information regarding the body temperature (continuous variable), congestion (discrete variables HIGH, MEDIUM, and LOW), and the actual diagnostic (flu). This dataset is used to create a model such as IF temperature > 102 AND congestion = HIGH THEN patient has the flu (probability 0.72), which doctors can use in their diagnostic.

Prediction

Once the model is extracted and validated against the past data, it can be used to draw inference from the future data. A doctor collects symptoms from a patient, such as body temperature and nasal congestion, and anticipates the state of his/her health.

Optimization

Some global optimization problems are intractable using traditional linear and non-linear optimization methods. Machine learning techniques improve the chances that the optimization method converges toward a solution (intelligent search). You can imagine that fighting the spread of a new virus requires optimizing a process that may evolve over time as more symptoms and cases are uncovered.

Regression

Regression is a classification technique that is particularly suitable for a continuous model. Linear (least square), polynomial, and logistic regressions are among the most commonly used techniques to fit a parametric model, or function, y= f (xj), to a dataset. Regression is sometimes regarded as a specialized case of classification for which the output variables are continuous instead of categorical.

Why Scala?

Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven dynamically into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular MapReduce model for distributed computation of large amounts of data. Among the capabilities of the language, the following features are deemed essential to machine learning and statistical analysis.

Abstraction

Monoids and monads are important concepts in functional programming. Monads are derived from the category and group theory allowing developers to create a high-level abstraction as illustrated in Twitter's Algebird (https://github.com/twitter/algebird) or Google's Breeze Scala (https://github.com/dlwh/breeze) libraries.

A monoid defines a binary operation op on a dataset T with the property of closure, identity operation, and associativity.

Let's consider the + operation is defined for a set T using the following monoidal representation:

trait Monoid[T] {
  def zero: T 
  def op(a: T, b: T): c 
}

Monoids are associative operations. For instance, if ts1, ts2, and ts3 are three time series, then the property ts1 + (ts2 + ts3) = (ts1 + ts2) + ts2 is true. The associativity of a monoid operator is critical in regards to parallelization of computational workflows.

Monads are structures that can be seen either as containers by programmers or as a generalization of Monoids. The collections bundled with the Scala standard library (list, map, and so on) are constructed as monads [1:1]. Monads provide the ability for those collections to perform the following functions:

  1. Create the collection.
  2. Transform the elements of the collection.
  3. Flatten nested collections.

A common categorical representation of a monad in Scala is a trait, Monad, parameterized with a container type M:

trait Monad[M[_]] {
  def apply[T])(a: T): M[T] 
  def flatMap[T, U](m: M[T])(f: T=>M[U]): M[U] 
}

Monads allow those collections or containers to be chained to generate a workflow. This property is applicable to any scientific computation [1:2].

Scalability

As seen previously, monoids and monads enable parallelization and chaining of data processing functions by leveraging the Scala higher-order methods. In terms of implementation, Actors are the core elements that make Scala scalable. Actors act as coroutines, managing the underlying threads pool. Actors communicate through passing asynchronous messages. A distributed computing Scala framework such as Akka and Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Spark are described in detail in the last chapter of this book [1:3].

In a nutshell, a workflow is implemented as a sequence of activities or computational tasks. Those tasks consist of high-order Scala methods such as flatMap, map, fold, reduce, collect, join, or filter applied to a large collection of observations. Scala allows these observations to be partitioned by executing those tasks through a cluster of actors. Scala also supports message dispatching and routing of messages between local and remote actors. The engineers can decide to execute a workflow either locally or distributed across CPU cores and servers with no code or very little code changes.

Scalability

Deployment of a workflow as a distributed computation

In this diagram, a controller, that is, the master node, manages the sequence of tasks 1 to 4 similar to a scheduler. These tasks are actually executed over multiple worker nodes that are implemented by the Scala actors. The master node exchanges messages with the workers to manage the state of the execution of the workflow as well as its reliability. High availability of these tasks is implemented through a hierarchy of supervising actors.

Configurability

Scala supports dependency injection using a combination of abstract variables, self-referenced composition, and stackable traits. One of the most commonly used dependency injection patterns, the cake pattern, is used throughout this book to create dynamic computation workflows and plots.

Maintainability

Scala embeds Domain Specific Languages (DSL) natively. DSLs are syntactic layers built on top of Scala native libraries. DSLs allow software developers to abstract computation in terms that are easily understood by scientists. The most notorious application of DSLs is the definition of the emulation of the syntax used in the MATLAB program, which data scientists are familiar with.

Computation on demand

Lazy methods and values allow developers to execute functions and allocate computing resources on demand. The Spark framework relies on lazy variables and methods to chain Resilient Distributed Datasets (RDD).

Model categorization

A model can be predictive, descriptive, or adaptive.

Predictive models discover patterns in historical data and extract fundamental trends and relationships between factors. They are used to predict and classify future events or observations. Predictive analytics is used in a variety of fields such as marketing, insurance, and pharmaceuticals. Predictive models are created through supervised learning using a preselected training set.

Descriptive models attempt to find unusual patterns or affinities in data by grouping observations into clusters with similar properties. These models define the first level in knowledge discovery. They are generated through unsupervised learning.

A third category of models, known as adaptive modeling, is generated through reinforcement learning. Reinforcement learning consists of one or several decision-making agents that recommend and possibly execute actions in the attempt of solving a problem, optimizing an objective function, or resolving constraints.

Taxonomy of machine learning algorithms

The purpose of machine learning is to teach computers to execute tasks without human intervention. An increasing number of applications such as genomics, social networking, advertising, or risk analysis generate a very large amount of data that can be analyzed or mined to extract knowledge or provide insight into a process, a customer, or an organization. Ultimately, machine learning algorithms consist of identifying and validating models to optimize a performance criterion using historical, present, and future data [1:4].

Data mining is the process of extracting or identifying patterns in a dataset.

Unsupervised learning

The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. The process known as density estimation in statistics is broken down into two categories: discovery of data clusters and discovery of latent factors. The methodology consists of processing input data to understand patterns similar to the natural learning process in infants or animals. Unsupervised learning does not require labeled data, and therefore, is easy to implement and execute because no expertise is needed to validate an output. However, it is possible to label the output of a clustering algorithm and use it for future classification.

Clustering

The purpose of data clustering is to partition a collection of data into a number of clusters or data segments. Practically, a clustering algorithm is used to organize observations into clusters by minimizing the observations within a cluster and maximizing the observations between clusters. A clustering algorithm consists of the following steps:

  1. Creating a model by making an assumption on the input data.
  2. Selecting the objective function or goal of the clustering.
  3. Evaluating one or more algorithms to optimize the objective function.

Data clustering is also known as data segmentation or data partitioning.

Dimension reduction

Dimension reduction techniques aim at finding the smallest but most relevant set of features that models dataset reliability. There are many reasons for reducing the number of features or parameters in a model, from avoiding overfitting to reducing computation costs.

There are many ways to classify the different techniques used to extract knowledge from data using unsupervised learning. The following taxonomy breaks down these techniques according to their purpose, although the list is far for being exhaustive, as shown in the following diagram:

Dimension reduction

Supervised learning

The best analogy for supervised learning is function approximation or curve fitting. In its simplest form, supervised learning attempts to extract a relation or function f x → y from a training set {x, y}. Supervised learning is far more accurate and reliable than any other learning strategy. However, a domain expert may be required to label (tag) data as a training set for certain types of problems.

Supervised machine learning algorithms can be broken into two categories:

  • Generative models
  • Discriminative models

Generative models

In order to simplify the description of statistics formulas, we adopt the following simplification: the probability of an event X is the same as the probability of the discrete random variable X to have a value x, p(X) = p(X=x). The notation of joint probability (resp. conditional probability) becomes p(X, Y) = p(X=x, Y=y) (resp. p(X|Y)=p(X=x | Y=y).

Generative models attempt to fit a joint probability distribution, p(X,Y), of two events (or random variables), X and Y, representing two sets of observed and hidden (latent) variables x and y. Discriminative models learn the conditional probability p(Y|X) of an event or random variable Y of hidden variables y, given an event or random variable X of observed variables x. Generative models are commonly introduced through the Bayes' rule. The conditional probability of an event Y, given an event X, is computed as the product of the conditional probability of the event X, given the event Y, and the probability of the event X normalized by the probability of event Y [1:5].

Tip

Join probability (if X and Y are independent):

Generative models

Conditional probability:

Generative models

The Bayes' rule:

Generative models

The Bayes' rule is the foundation of the Naïve Bayes classifier, which is the topic of Chapter 5, Naïve Bayes Classifiers.

Discriminative models

Contrary to generative models, discriminative models compute the conditional probability p(Y|X) directly, using the same algorithm for training and classification.

Generative and discriminative models have their respective advantages and drawbacks. Novice data scientists learn to match the appropriate algorithm to each problem through experimentation. Here is a brief guideline describing which type of models makes sense according to the objective or criteria of the project:

Objective

Generative models

Discriminative models

Accuracy

Highly dependent on the training set.

Probability estimates tend to be more accurate.

Modeling requirements

There is a need to model both observed and hidden variables, which requires a significant amount of training.

The quality of the training set does not have to be as rigorous as for generative models.

Computation cost

This is usually low. For example, any graphical method derived from the Bayes' rule has low overhead.

Most algorithms rely on optimization of a convex that introduces significant performance overhead.

Constraints

These models assume some degree of independence among the model features.

Most discriminative algorithms accommodate dependencies between features.

We can further refine the taxonomy of supervised learning algorithms by segregating between sequential and random variables for generative models and breaking down discriminative methods as applied to continuous processes (regression) and discrete processes (classification):

Discriminative models

Reinforcement learning

Reinforcement learning is not as well understood as supervised and unsupervised learning outside the realms of robotics or game strategy. However, since the 90s, genetic-algorithms-based classifiers have become increasingly popular to solve problems that require collaboration with a domain expert. For some types of applications, reinforcement learning algorithms output a set of recommended actions for the adaptive system to execute. In its simplest form, these algorithms compute or estimate the best course of action. Most complex systems based on reinforcement learning establish and update policies that can be vetoed by an expert. The foremost challenge developers of reinforcement learning systems face is that the recommended action or policy may depend on partially observable states and how to deal with uncertainty.

Genetic algorithms are not usually considered part of the reinforcement learning toolbox. However, advanced models such as learning classifier systems use genetic algorithms to classify and reward the rules and policies.

As with the two previous learning strategies, reinforcement learning models can be categorized as Markovian or evolutionary:

Reinforcement learning

This is a brief overview of machine learning algorithms with a suggested taxonomy. There are almost as many ways to introduce machine learning as there are data and computer scientists. We encourage you to browse through the list of references at the end of the book and find the documentation appropriate to your level of interest and understanding.

Tools and frameworks

Before getting your hands dirty, you need to download and deploy a minimum set of tools and libraries so as not to reinvent the wheel. A few key components have to be installed in order to compile and run the source code described throughout the book. We focus on open source and commonly available libraries, although you are invited to experiment with equivalent tools of your choice. The learning curve for the frameworks described here is minimal.

Java

The code described in the book has been tested with JDK 1.7.0_45 and JDK 1.8.0_25 on Windows x64 and MacOS X x64 . You need to install the Java Development Kit if you have not already done so. Finally, the environment variables JAVA_HOME, PATH, and CLASSPATH have to be updated accordingly.

Scala

The code has been tested with Scala 2.10.4. We recommend using Scala version 2.10.3 or higher and SBT 0.13 or higher. Let's assume that Scala runtime (REPL) and libraries have been properly installed and environment variables SCALA_HOME and PATH have been updated. The description and installation instructions of the Scala plugin for Eclipse are available at http://scala-ide.org/docs/user/gettingstarted.html.

You can also download the Scala plugin for Intellij IDEA from the JetBrains website at http://confluence.jetbrains.com/display/SCA/.

The ubiquitous simple build tool (sbt) will be our primary building engine. The syntax of the build file sbt/build.sbt conforms to version 0.13, and is used to compile and assemble the source code presented throughout this book.

Apache Commons Math

Apache Commons Math is a Java library for numerical processing, algebra, statistics, and optimization [1:6].

Description

This is a lightweight library that provides developers with a foundation of small, ready-to-use Java classes that can be easily weaved into a machine learning problem. The examples used throughout the book require version 3.3 or higher.

The main components of Apache Commons Math are:

  • Functions, differentiation, and integral and ordinary differential equations
  • Statistics distribution
  • Linear and nonlinear optimization
  • Dense and Sparse vectors and matrices
  • Curve fitting, correlation, and regression

For more information, visit http://commons.apache.org/proper/commons-math.

Licensing

We need Apache Public License 2.0; the terms are available at http://www.apache.org/licenses/LICENSE-2.0.

Installation

The installation and deployment of the Commons Math library are quite simple:

  1. Go to the download page, http://commons.apache.org/proper/commons-math/download_math.cgi.
  2. Download the latest .jar files in the Binaries section, commons-math3-3.3-bin.zip (for version 3.3, for instance).
  3. Unzip and install the .jar files.
  4. Add commons-math3-3.3.jar to classpath as follows:
    • For Mac OS X, use the command export CLASSPATH=$CLASSPATH:/Commons_Math_path/commons-math3-3.3.jar
    • For Windows, navigate to System property | Advanced system settings | Advanced | Environment variables…, then edit the entry of the CLASSPATH variable
  5. Add the commons-math3-3.3.jar file to your IDE environment if needed (that is, for Eclipse, navigate to Project | Properties | Java Build Path | Libraries | Add External JARs).

You can also download commons-math3-3.3-src.zip from the Source section.

JFreeChart

JFreeChart is an open source chart and plotting Java library, widely used in the Java programmer community. It was originally created by David Gilbert [1:7].

Description

The library supports a variety of configurable plots and charts (scatter, dial, pie, area, bar, box and whisker, stacked, and 3D). We use JFreeChart to display the output of data processing and algorithms throughout the book, but you are encouraged to explore this great library on your own, as time permits.

Licensing

It is distributed under the terms of the GNU Lesser General Public License (LGPL), which permits its use in proprietary applications.

Installation

To install and deploy JFreeChart, perform the following steps:

  1. Visit http://www.jfree.org/jfreechart.
  2. Download the latest version from Source Forge at http://sourceforge.net/projects/jfreechart/files.
  3. Unzip and install the .jar file.
  4. Add jfreechart-1.0.17.jar (for version 1.0.17) to classpath as follows:
    • For Mac OS, update the classpath by using export CLASSPATH=$CLASSPATH:/JFreeChart_path/ jfreechart-1.0.17.jar
    • For Windows, go to System property | Advanced system settings | Advanced | Environment variables… and then edit the entry of the CLASSPATH variable
  5. Add the jfreechart-1.0.17.jar file to your IDE environment, if needed.

Other libraries and frameworks

Libraries and tools that are specific to a single chapter are introduced along with the topic. Scalable frameworks are presented in the last chapter along with the instructions to download them. Libraries related to the conditional random fields and support vector machines are described in the respective chapters.

Note

Why not use Scala algebra and numerical libraries

Libraries such as Breeze, ScalaNLP, and Algebird are great Scala frameworks for linear algebra, numerical analysis, and machine learning. They provide even the most seasoned Scala programmer with a high-quality layer of abstraction. However, this book is designed as a tutorial that allows developers to write algorithms from the ground up using simple common Java libraries [1:8].

Left arrow icon Right arrow icon

Description

Are you curious about AI? All you need is a good understanding of the Scala programming language, a basic knowledge of statistics, a keen interest in Big Data processing, and this book!

What you will learn

  • Build dynamic workflows for scientific computing
  • Leverage open source libraries to extract patterns from time series
  • Write your own classification, clustering, or evolutionary algorithm
  • Perform relative performance tuning and evaluation of Spark
  • Master probabilistic models for sequential data
  • Experiment with advanced techniques such as regularization and kernelization
  • Solve big data problems with Scala parallel collections, Akka actors, and Apache Spark clusters
  • Apply key learning strategies to a technical analysis of financial markets

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Dec 17, 2014
Length: 624 pages
Edition : 1st
Language : English
ISBN-13 : 9781783558742
Category :
Languages :

What do you get with a Packt Subscription?

Free for first 7 days. €18.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Dec 17, 2014
Length: 624 pages
Edition : 1st
Language : English
ISBN-13 : 9781783558742
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 132.97
Learning Concurrent Programming in Scala
€36.99
Scala for Machine Learning
€49.99
Machine Learning with R
€45.99
Total 132.97 Stars icon

Table of Contents

14 Chapters
1. Getting Started Chevron down icon Chevron up icon
2. Hello World! Chevron down icon Chevron up icon
3. Data Preprocessing Chevron down icon Chevron up icon
4. Unsupervised Learning Chevron down icon Chevron up icon
5. Naïve Bayes Classifiers Chevron down icon Chevron up icon
6. Regression and Regularization Chevron down icon Chevron up icon
7. Sequential Data Models Chevron down icon Chevron up icon
8. Kernel Models and Support Vector Machines Chevron down icon Chevron up icon
9. Artificial Neural Networks Chevron down icon Chevron up icon
10. Genetic Algorithms Chevron down icon Chevron up icon
11. Reinforcement Learning Chevron down icon Chevron up icon
12. Scalable Frameworks Chevron down icon Chevron up icon
A. Basic Concepts Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.8
(12 Ratings)
5 star 41.7%
4 star 16.7%
3 star 25%
2 star 8.3%
1 star 8.3%
Filter icon Filter
Top Reviews

Filter reviews by




matej fandl Feb 09, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Studying machine learning during my university times and being an aspiring scala developer I picked this book up as an opportunity to learn scala while reading about what interests me the most in the field of computer science. This wont be an easy read for people not familiar with scala at all, but if you have some experience with the language and are interested in machine learning, I definitely recommend the book. It is a nice and quite deep dive into the topic. What I found very interesting was the optional math part available in each section. This book is also showing me where my understanding of scala is still superficial, the code is written a very good way.
Amazon Verified review Amazon
Amazon Customer Feb 20, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Some technical books are heavy on theory, some are heavy on practicality. There are books that describe ‘why’ and others that show you ‘how’. Personally, I have always tended towards the ‘how’ side of things - the ‘cookbook’ approach is one I have always liked. The more practical the better.Scala For Machine Learning (full disclosure, I received an unpaid review copy) is as about as practical as it gets. Loaded with code examples, this book leaves you in no doubt that it will help you construct your own code, against your own data in the fastest time possible. But it does not take any short-cuts.Machine Learning and Scala is a broad subject to cover and SFML does an admirable job of taking you from the first steps of data preparation, right the way through to a artificial neural networks and genetic algorithms.I work with data - it has been my day-to-day working life for over twenty years, and there is never any shortage of new territory to cover. I am predominately data engineering focused, and whilst the data content is the most important thing, it is the technology that keeps me engaged. In the final chapter of SFML Nicolas covers a nice selection of frameworks for concurrent processing. This really piqued my interest - Apache Spark is a hot topic at the moment and is given a good few pages here, reviewing its Scala and Akka heritage and demonstrating its core design principles of in-memory persistence, scheduling laziness, distributed dataset actions and shared variables. Better yet Nicolas shows you how to get Spark up and running and executing your first K-means tasks.The bulk of the book is made up with detailed reviews, explanations and implementation guides for different machine learning algorithms and methodologies. Each section is made up of a wealth of detailed explanation, detailed implementation instructions and guidelines, honestly - at times - the information density can be a little overwhelming, but it is all hugely valuable, and much appreciated.Running through the whole book is, of course, an appreciation of Scala and its abilities to perform in a distributed manner, at scale. This is the sort work that Scala, in all of its object-functional glory, was designed to address. I have read other Scala books and the programming language has always been presented in a “you may be used to, but Scala…” fashion. I am happy to say that is not the case here, with solid examples, proper real world code and thorough debriefs - Scala is presented as it should be, a language in its own right, being dealt with on its own terms.If your work or study includes components of machine learning and / or data processing - and you are looking for a modern take on some time-honoured methods, this book is work picking up and consuming, avidly.
Amazon Verified review Amazon
mathieu Dec 31, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
just wanted to let potential buyers know that the kindle version has none of the nuisances sometimes found in e-books. i just purchased this book (from amazon france) and did a quick check of the source code, equations, and diagrams. everything shows up perfectly. no comment on the content yet, but it looks quite interesting.
Amazon Verified review Amazon
Amazon Customer Feb 13, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Very detailed. Not for beginners in scala programming
Amazon Verified review Amazon
Tomer Ben David Feb 12, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
My favorite book these days, using scala? doing machine learning? I don't want to read a separate book on machine learning and another one on advanced scala tailored for machine learning, can anyone please summarize scala + machine learning in the same book? Yes - This book does it. Still reading but so far so good! Perfect match for me.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.