Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Scala for Machine Learning
Scala for Machine Learning

Scala for Machine Learning: Leverage Scala and Machine Learning to construct and study systems that can learn from data

eBook
$9.99 $39.99
Paperback
$65.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Scala for Machine Learning

Chapter 1. Getting Started

It is critical for any computer scientist to understand the different classes of machine learning algorithms and be able to select the ones that are relevant to the domain of their expertise and dataset. However, the application of these algorithms represents a small fraction of the overall effort needed to extract an accurate and performing model from input data. A common data mining workflow consists of the following sequential steps:

  1. Loading the data.
  2. Preprocessing, analyzing, and filtering the input data.
  3. Discovering patterns, affinities, clusters, and classes.
  4. Selecting the model features and the appropriate machine learning algorithm(s).
  5. Refining and validating the model.
  6. Improving the computational performance of the implementation.

As we will emphasize throughout this book, each stage of the process is critical to build the right model.

This first chapter introduces you to the taxonomy of machine learning algorithms, the tools and frameworks used in the book, and a simple application of logistic regression to get your feet wet.

Mathematical notation for the curious

Each chapter contains a small section dedicated to the formulation of the algorithms for those interested in the mathematical concepts behind the science and art of machine learning. These sections are optional and defined within a tip box. For example, the mathematical expression of the mean and the variance of a variable X mentioned in a tip box will be as follows:

Tip

Mean value of a variable X = {x} is defined as:

Mathematical notation for the curious

The variance of a variable X = {x} is defined as:

Mathematical notation for the curious

Why machine learning?

The explosion in the number of digital devices generates an ever-increasing amount of data. The best analogy I can find to describe the need, desire, and urgency to extract knowledge from large datasets is the process of extracting a precious metal from a mine, and in some cases, extracting blood from a stone.

Knowledge is quite often defined as a model that can be constantly updated or tweaked as new data comes into play. Models are obviously domain-specific ranging from credit risk assessment, face recognition, maximization of quality of service, classification of pathological symptoms of disease, optimization of computer networks, and security intrusion detection, to customers' online behavior and purchase history.

Machine learning problems are categorized as classification, prediction, optimization, and regression.

Classification

The purpose of classification is to extract knowledge from historical data. For instance, a classifier can be built to identify a disease from a set of symptoms. The scientist collects information regarding the body temperature (continuous variable), congestion (discrete variables HIGH, MEDIUM, and LOW), and the actual diagnostic (flu). This dataset is used to create a model such as IF temperature > 102 AND congestion = HIGH THEN patient has the flu (probability 0.72), which doctors can use in their diagnostic.

Prediction

Once the model is extracted and validated against the past data, it can be used to draw inference from the future data. A doctor collects symptoms from a patient, such as body temperature and nasal congestion, and anticipates the state of his/her health.

Optimization

Some global optimization problems are intractable using traditional linear and non-linear optimization methods. Machine learning techniques improve the chances that the optimization method converges toward a solution (intelligent search). You can imagine that fighting the spread of a new virus requires optimizing a process that may evolve over time as more symptoms and cases are uncovered.

Regression

Regression is a classification technique that is particularly suitable for a continuous model. Linear (least square), polynomial, and logistic regressions are among the most commonly used techniques to fit a parametric model, or function, y= f (xj), to a dataset. Regression is sometimes regarded as a specialized case of classification for which the output variables are continuous instead of categorical.

Why Scala?

Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven dynamically into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular MapReduce model for distributed computation of large amounts of data. Among the capabilities of the language, the following features are deemed essential to machine learning and statistical analysis.

Abstraction

Monoids and monads are important concepts in functional programming. Monads are derived from the category and group theory allowing developers to create a high-level abstraction as illustrated in Twitter's Algebird (https://github.com/twitter/algebird) or Google's Breeze Scala (https://github.com/dlwh/breeze) libraries.

A monoid defines a binary operation op on a dataset T with the property of closure, identity operation, and associativity.

Let's consider the + operation is defined for a set T using the following monoidal representation:

trait Monoid[T] {
  def zero: T 
  def op(a: T, b: T): c 
}

Monoids are associative operations. For instance, if ts1, ts2, and ts3 are three time series, then the property ts1 + (ts2 + ts3) = (ts1 + ts2) + ts2 is true. The associativity of a monoid operator is critical in regards to parallelization of computational workflows.

Monads are structures that can be seen either as containers by programmers or as a generalization of Monoids. The collections bundled with the Scala standard library (list, map, and so on) are constructed as monads [1:1]. Monads provide the ability for those collections to perform the following functions:

  1. Create the collection.
  2. Transform the elements of the collection.
  3. Flatten nested collections.

A common categorical representation of a monad in Scala is a trait, Monad, parameterized with a container type M:

trait Monad[M[_]] {
  def apply[T])(a: T): M[T] 
  def flatMap[T, U](m: M[T])(f: T=>M[U]): M[U] 
}

Monads allow those collections or containers to be chained to generate a workflow. This property is applicable to any scientific computation [1:2].

Scalability

As seen previously, monoids and monads enable parallelization and chaining of data processing functions by leveraging the Scala higher-order methods. In terms of implementation, Actors are the core elements that make Scala scalable. Actors act as coroutines, managing the underlying threads pool. Actors communicate through passing asynchronous messages. A distributed computing Scala framework such as Akka and Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Spark are described in detail in the last chapter of this book [1:3].

In a nutshell, a workflow is implemented as a sequence of activities or computational tasks. Those tasks consist of high-order Scala methods such as flatMap, map, fold, reduce, collect, join, or filter applied to a large collection of observations. Scala allows these observations to be partitioned by executing those tasks through a cluster of actors. Scala also supports message dispatching and routing of messages between local and remote actors. The engineers can decide to execute a workflow either locally or distributed across CPU cores and servers with no code or very little code changes.

Scalability

Deployment of a workflow as a distributed computation

In this diagram, a controller, that is, the master node, manages the sequence of tasks 1 to 4 similar to a scheduler. These tasks are actually executed over multiple worker nodes that are implemented by the Scala actors. The master node exchanges messages with the workers to manage the state of the execution of the workflow as well as its reliability. High availability of these tasks is implemented through a hierarchy of supervising actors.

Configurability

Scala supports dependency injection using a combination of abstract variables, self-referenced composition, and stackable traits. One of the most commonly used dependency injection patterns, the cake pattern, is used throughout this book to create dynamic computation workflows and plots.

Maintainability

Scala embeds Domain Specific Languages (DSL) natively. DSLs are syntactic layers built on top of Scala native libraries. DSLs allow software developers to abstract computation in terms that are easily understood by scientists. The most notorious application of DSLs is the definition of the emulation of the syntax used in the MATLAB program, which data scientists are familiar with.

Computation on demand

Lazy methods and values allow developers to execute functions and allocate computing resources on demand. The Spark framework relies on lazy variables and methods to chain Resilient Distributed Datasets (RDD).

Model categorization

A model can be predictive, descriptive, or adaptive.

Predictive models discover patterns in historical data and extract fundamental trends and relationships between factors. They are used to predict and classify future events or observations. Predictive analytics is used in a variety of fields such as marketing, insurance, and pharmaceuticals. Predictive models are created through supervised learning using a preselected training set.

Descriptive models attempt to find unusual patterns or affinities in data by grouping observations into clusters with similar properties. These models define the first level in knowledge discovery. They are generated through unsupervised learning.

A third category of models, known as adaptive modeling, is generated through reinforcement learning. Reinforcement learning consists of one or several decision-making agents that recommend and possibly execute actions in the attempt of solving a problem, optimizing an objective function, or resolving constraints.

Taxonomy of machine learning algorithms

The purpose of machine learning is to teach computers to execute tasks without human intervention. An increasing number of applications such as genomics, social networking, advertising, or risk analysis generate a very large amount of data that can be analyzed or mined to extract knowledge or provide insight into a process, a customer, or an organization. Ultimately, machine learning algorithms consist of identifying and validating models to optimize a performance criterion using historical, present, and future data [1:4].

Data mining is the process of extracting or identifying patterns in a dataset.

Unsupervised learning

The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. The process known as density estimation in statistics is broken down into two categories: discovery of data clusters and discovery of latent factors. The methodology consists of processing input data to understand patterns similar to the natural learning process in infants or animals. Unsupervised learning does not require labeled data, and therefore, is easy to implement and execute because no expertise is needed to validate an output. However, it is possible to label the output of a clustering algorithm and use it for future classification.

Clustering

The purpose of data clustering is to partition a collection of data into a number of clusters or data segments. Practically, a clustering algorithm is used to organize observations into clusters by minimizing the observations within a cluster and maximizing the observations between clusters. A clustering algorithm consists of the following steps:

  1. Creating a model by making an assumption on the input data.
  2. Selecting the objective function or goal of the clustering.
  3. Evaluating one or more algorithms to optimize the objective function.

Data clustering is also known as data segmentation or data partitioning.

Dimension reduction

Dimension reduction techniques aim at finding the smallest but most relevant set of features that models dataset reliability. There are many reasons for reducing the number of features or parameters in a model, from avoiding overfitting to reducing computation costs.

There are many ways to classify the different techniques used to extract knowledge from data using unsupervised learning. The following taxonomy breaks down these techniques according to their purpose, although the list is far for being exhaustive, as shown in the following diagram:

Dimension reduction

Supervised learning

The best analogy for supervised learning is function approximation or curve fitting. In its simplest form, supervised learning attempts to extract a relation or function f x → y from a training set {x, y}. Supervised learning is far more accurate and reliable than any other learning strategy. However, a domain expert may be required to label (tag) data as a training set for certain types of problems.

Supervised machine learning algorithms can be broken into two categories:

  • Generative models
  • Discriminative models

Generative models

In order to simplify the description of statistics formulas, we adopt the following simplification: the probability of an event X is the same as the probability of the discrete random variable X to have a value x, p(X) = p(X=x). The notation of joint probability (resp. conditional probability) becomes p(X, Y) = p(X=x, Y=y) (resp. p(X|Y)=p(X=x | Y=y).

Generative models attempt to fit a joint probability distribution, p(X,Y), of two events (or random variables), X and Y, representing two sets of observed and hidden (latent) variables x and y. Discriminative models learn the conditional probability p(Y|X) of an event or random variable Y of hidden variables y, given an event or random variable X of observed variables x. Generative models are commonly introduced through the Bayes' rule. The conditional probability of an event Y, given an event X, is computed as the product of the conditional probability of the event X, given the event Y, and the probability of the event X normalized by the probability of event Y [1:5].

Tip

Join probability (if X and Y are independent):

Generative models

Conditional probability:

Generative models

The Bayes' rule:

Generative models

The Bayes' rule is the foundation of the Naïve Bayes classifier, which is the topic of Chapter 5, Naïve Bayes Classifiers.

Discriminative models

Contrary to generative models, discriminative models compute the conditional probability p(Y|X) directly, using the same algorithm for training and classification.

Generative and discriminative models have their respective advantages and drawbacks. Novice data scientists learn to match the appropriate algorithm to each problem through experimentation. Here is a brief guideline describing which type of models makes sense according to the objective or criteria of the project:

Objective

Generative models

Discriminative models

Accuracy

Highly dependent on the training set.

Probability estimates tend to be more accurate.

Modeling requirements

There is a need to model both observed and hidden variables, which requires a significant amount of training.

The quality of the training set does not have to be as rigorous as for generative models.

Computation cost

This is usually low. For example, any graphical method derived from the Bayes' rule has low overhead.

Most algorithms rely on optimization of a convex that introduces significant performance overhead.

Constraints

These models assume some degree of independence among the model features.

Most discriminative algorithms accommodate dependencies between features.

We can further refine the taxonomy of supervised learning algorithms by segregating between sequential and random variables for generative models and breaking down discriminative methods as applied to continuous processes (regression) and discrete processes (classification):

Discriminative models

Reinforcement learning

Reinforcement learning is not as well understood as supervised and unsupervised learning outside the realms of robotics or game strategy. However, since the 90s, genetic-algorithms-based classifiers have become increasingly popular to solve problems that require collaboration with a domain expert. For some types of applications, reinforcement learning algorithms output a set of recommended actions for the adaptive system to execute. In its simplest form, these algorithms compute or estimate the best course of action. Most complex systems based on reinforcement learning establish and update policies that can be vetoed by an expert. The foremost challenge developers of reinforcement learning systems face is that the recommended action or policy may depend on partially observable states and how to deal with uncertainty.

Genetic algorithms are not usually considered part of the reinforcement learning toolbox. However, advanced models such as learning classifier systems use genetic algorithms to classify and reward the rules and policies.

As with the two previous learning strategies, reinforcement learning models can be categorized as Markovian or evolutionary:

Reinforcement learning

This is a brief overview of machine learning algorithms with a suggested taxonomy. There are almost as many ways to introduce machine learning as there are data and computer scientists. We encourage you to browse through the list of references at the end of the book and find the documentation appropriate to your level of interest and understanding.

Tools and frameworks

Before getting your hands dirty, you need to download and deploy a minimum set of tools and libraries so as not to reinvent the wheel. A few key components have to be installed in order to compile and run the source code described throughout the book. We focus on open source and commonly available libraries, although you are invited to experiment with equivalent tools of your choice. The learning curve for the frameworks described here is minimal.

Java

The code described in the book has been tested with JDK 1.7.0_45 and JDK 1.8.0_25 on Windows x64 and MacOS X x64 . You need to install the Java Development Kit if you have not already done so. Finally, the environment variables JAVA_HOME, PATH, and CLASSPATH have to be updated accordingly.

Scala

The code has been tested with Scala 2.10.4. We recommend using Scala version 2.10.3 or higher and SBT 0.13 or higher. Let's assume that Scala runtime (REPL) and libraries have been properly installed and environment variables SCALA_HOME and PATH have been updated. The description and installation instructions of the Scala plugin for Eclipse are available at http://scala-ide.org/docs/user/gettingstarted.html.

You can also download the Scala plugin for Intellij IDEA from the JetBrains website at http://confluence.jetbrains.com/display/SCA/.

The ubiquitous simple build tool (sbt) will be our primary building engine. The syntax of the build file sbt/build.sbt conforms to version 0.13, and is used to compile and assemble the source code presented throughout this book.

Apache Commons Math

Apache Commons Math is a Java library for numerical processing, algebra, statistics, and optimization [1:6].

Description

This is a lightweight library that provides developers with a foundation of small, ready-to-use Java classes that can be easily weaved into a machine learning problem. The examples used throughout the book require version 3.3 or higher.

The main components of Apache Commons Math are:

  • Functions, differentiation, and integral and ordinary differential equations
  • Statistics distribution
  • Linear and nonlinear optimization
  • Dense and Sparse vectors and matrices
  • Curve fitting, correlation, and regression

For more information, visit http://commons.apache.org/proper/commons-math.

Licensing

We need Apache Public License 2.0; the terms are available at http://www.apache.org/licenses/LICENSE-2.0.

Installation

The installation and deployment of the Commons Math library are quite simple:

  1. Go to the download page, http://commons.apache.org/proper/commons-math/download_math.cgi.
  2. Download the latest .jar files in the Binaries section, commons-math3-3.3-bin.zip (for version 3.3, for instance).
  3. Unzip and install the .jar files.
  4. Add commons-math3-3.3.jar to classpath as follows:
    • For Mac OS X, use the command export CLASSPATH=$CLASSPATH:/Commons_Math_path/commons-math3-3.3.jar
    • For Windows, navigate to System property | Advanced system settings | Advanced | Environment variables…, then edit the entry of the CLASSPATH variable
  5. Add the commons-math3-3.3.jar file to your IDE environment if needed (that is, for Eclipse, navigate to Project | Properties | Java Build Path | Libraries | Add External JARs).

You can also download commons-math3-3.3-src.zip from the Source section.

JFreeChart

JFreeChart is an open source chart and plotting Java library, widely used in the Java programmer community. It was originally created by David Gilbert [1:7].

Description

The library supports a variety of configurable plots and charts (scatter, dial, pie, area, bar, box and whisker, stacked, and 3D). We use JFreeChart to display the output of data processing and algorithms throughout the book, but you are encouraged to explore this great library on your own, as time permits.

Licensing

It is distributed under the terms of the GNU Lesser General Public License (LGPL), which permits its use in proprietary applications.

Installation

To install and deploy JFreeChart, perform the following steps:

  1. Visit http://www.jfree.org/jfreechart.
  2. Download the latest version from Source Forge at http://sourceforge.net/projects/jfreechart/files.
  3. Unzip and install the .jar file.
  4. Add jfreechart-1.0.17.jar (for version 1.0.17) to classpath as follows:
    • For Mac OS, update the classpath by using export CLASSPATH=$CLASSPATH:/JFreeChart_path/ jfreechart-1.0.17.jar
    • For Windows, go to System property | Advanced system settings | Advanced | Environment variables… and then edit the entry of the CLASSPATH variable
  5. Add the jfreechart-1.0.17.jar file to your IDE environment, if needed.

Other libraries and frameworks

Libraries and tools that are specific to a single chapter are introduced along with the topic. Scalable frameworks are presented in the last chapter along with the instructions to download them. Libraries related to the conditional random fields and support vector machines are described in the respective chapters.

Note

Why not use Scala algebra and numerical libraries

Libraries such as Breeze, ScalaNLP, and Algebird are great Scala frameworks for linear algebra, numerical analysis, and machine learning. They provide even the most seasoned Scala programmer with a high-quality layer of abstraction. However, this book is designed as a tutorial that allows developers to write algorithms from the ground up using simple common Java libraries [1:8].

Source code

The Scala programming language is used to implement and evaluate the machine learning techniques presented in this book. Only a subset of the source code used to implement the techniques are presented in the book. The formal implementation of these algorithms is available on the website of Packt Publishing (http://www.packtpub.com).

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Context versus view bounds

Most Scala classes discussed in the book are parameterized with the type associated to the discrete/categorical value (Int) or continuous value (Double). Context bounds would require that any type used by the client code has Int or Double as upper bounds:

class MyClassInt[T <: Int]
class MyClassFloat[T <: Double]

Such a design introduces constraints on the client to inherit from simple types and to deal with covariance and contravariance for container types [1:9].

For this book, view bounds are used instead of context bounds only where they require an implicit conversion to the parameterized type to be defined:

Class MyClassFloat[T <% Double]
implicit def T2Double(t : T): Double

Presentation

For the sake of readability of the implementation of algorithms, all nonessential code such as error checking, comments, exceptions, or imports are omitted. The following code elements are discarded in the code snippet presented in the book:

  • Code comments
  • Validation of class parameters and method arguments:
    class BaumWelchEM(val lambda: HMMLambda ...) {
       require( lambda != null, "Lambda model is undefined")
  • Exceptions and an exception handler:
       try { .. }
       catch {
          case e: ArrayIndexOutOfBoundsException  =>println(e.toString)
        }
  • Nonessential annotation:
       @inline def mean = ..
  • Logging and debugging code:
           m_logger.debug( …)
  • Private and nonessential methods

Primitives and implicits

The algorithms presented in this book share the same primitive types, generic operators, and implicit conversions.

Primitive types

For the sake of readability of the code, the following primitive types will be used:

type XY = (Double, Double)
type XYTSeries = Array[(Double, Double)]
type DMatrix[T] = Array[Array[T]]
type DVector[T] = Array[T]  
type DblMatrix = DMatrix[Double]
type DblVector = Array[Double]

The types have the behavior (methods) of their primitive counterpart (array). However, adding a new functionality to vectors, matrices, and time series requires classes of their own right. These classes will be introduced in the next chapter.

Type conversions

Implicit conversion is an important feature of the Scala programming language because it allows developers to specify a type conversion for an entire library in a single place. Here are a few of the implicit type conversions used throughout the book:

implicit def int2Double(n: Int): Double = n.toDouble
implicit def vectorT2DblVector[T <% Double](vt: DVector[T]): DblVector = vt.map( t => t.toDouble)
implicit def double2DblVector(x: Double): DblVector = Array[Double](x)
implicit def dblPair2DbLVector(x: (Double, Double)): DblVector = Array[Double](x._1,x._2)
implicit def dblPairs2DblRows(x: (Double, Double)): DblMatrix = Array[Array[Double]](Array[Double](x._1, x._2))
...

Note

Library-specific conversion

The conversion between the primitive type listed here and types introduced in a particular library (such as Apache Commons Math) is declared in future chapters the first time those libraries are used.

Operators

Lastly, some operations are applied by multiple machine learning or preprocessing algorithms. They need to be defined implicitly. The operation on a pair of a vector of arbitrary type and vector of Double is defined as follows:

def Op[T <% Double](v: DVector[T], w: DblVector, op: (T, Double) => Double): DblVector = 
   v.zipWithIndex.map(x => op(x._1, w(x._2)))

It is also convenient to define the following operators that are included in the Scala standard library:

implicit def /(v: DblVector, n: Int):DblVector = v.map( x => x/n)
implicit def /(m: DblMatrix, col: Int, z: Double): DblMatrix = { (0 until m(n).size).foreach(i => m(n)(i) /= z)  }

We won't have to redefine the types, conversions, and operators from now on.

Immutability

It is usually a good idea to reduce the number of states of an object. Method invocation transitions an object from one state to another. The larger the number of methods or states, the more cumbersome the testing process becomes.

There is no point in creating a model that is not defined (trained). Therefore, making the training of a model as part of the constructor of the class it implements makes a lot of sense. Therefore, the only public methods of a machine learning algorithm are:

  • Classification or prediction
  • Validation
  • Retrieval of model parameters (weights, latent variables, hidden states, and so on), if needed

Performance of Scala iterators

The evaluation of the performance of Scala high-order iterative methods is beyond the scope of this book. However, it is important to be aware of the trade-off of each method.

The for loop construct is to be avoided as a counting iterator except if it is used in conjunction with yield. It is designed to implement the for-comprehension monad (map-flatMap). The source code presented in this book uses the while and foreach constructs.

Scala reducer methods reduce and fold are also frequently used for their efficiency.

Let's kick the tires

This final section introduces the key elements of the training and classification workflow. A test case using a simple logistic regression is used to illustrate each step of the computational workflow.

Overview of computational workflows

In its simplest form, a computational workflow to perform runtime processing of a dataset is composed of the following stages:

  1. Loading the dataset from files, databases, or any streaming devices.
  2. Splitting the dataset for parallel data processing.
  3. Preprocessing data using filtering techniques, analysis of variance, and applying penalty and normalization functions whenever necessary.
  4. Applying the model, either a set of clusters or classes to classify new data.
  5. Assessing the quality of the model.

A similar sequence of tasks is used to extract a model from a training dataset:

  1. Loading the dataset from files, databases, or any streaming devices.
  2. Splitting the dataset for parallel data processing.
  3. Applying filtering techniques, analysis of variance, and penalty and normalization functions to the raw dataset whenever necessary.
  4. Selecting the training, testing, and validation set from the cleansed input data.
  5. Extracting key features, establishing affinity between a similar group of observations using clustering techniques or supervised learning algorithms.
  6. Reducing the number of features to a manageable set of attributes to avoid overfitting the training set.
  7. Validating the model and tuning the model by iterating steps 5, 6, and 7 until the error meets criteria.
  8. Storing the model into the file or database to be loaded for runtime processing of new observations.

Data clustering and data classification can be performed independent of each other or as part of a workflow that uses clustering techniques as a preprocessing stage of the training phase of a supervised learning algorithm. Data clustering does not require a model to be extracted from a training set, while classification can be performed only if a model has been built from the training set. The following image gives an overview of training and classification:

Overview of computational workflows

A generic data flow for training and running a model

This diagram is an overview of a typical data mining processing pipeline. The first phase consists of extracting the model through clustering or training of a supervised learning algorithm. The model is then validated against test data, for which the source is the same as the training set but with different observations. Once the model is created and validated, it can be used to classify real-time data or predict future behavior. In reality, real-world workflows are more complex and require being dynamically configurable to allow experimentation of different models. Several alternative classifiers can be used to perform a regression and different filtering algorithms are applied against input data depending of the latent noise in the raw data.

Writing a simple workflow

This book relies on financial data to experiment with a different learning strategy. The objective of the exercise is to build a model that can discriminate between volatile and nonvolatile trading sessions. For this first example, we select a simplified version of the logistic regression as our classifier as we treat a stock-price-volume action as a continuous or pseudo-continuous process.

Note

Logistic regression

Logistic regression is treated in depth in Chapter 6, Regression and Regularization. The model treated in this example is a simple binary classifier using logistic regression for two-dimensional observations.

The classification of trading sessions according to their volatility is as follows:

  • Select a dataset
  • Load the dataset
  • Preprocess the dataset
  • Display data
  • Create the model through training
  • Classify new data

Selecting a dataset

Throughout the book, we will rely on financial data to evaluate and discuss the merit of different data processing and machine learning methods. In this example, the data is extracted from Yahoo! Finances using the CSV format with the following fields:

  • Date
  • Price at open
  • Highest price in session
  • Lowest price in session
  • Price at session close
  • Volume
  • Adjust price at session close

Let's create a simple program that loads the content of the file, executes some simple preprocessing functions, and creates a simple model. We selected the CSCO stock price between January 1, 2012 and December 1, 2013 as our data input.

Let's consider two variables, price and volume, as illustrated by the following screenshot. The top graph displays the variation of the price of Cisco stock over time and the bottom bar chart represents the daily trading volume on Cisco stock over time:

Selecting a dataset

Price-Volume action for the Cisco stock

Loading the dataset

The first step is loading the dataset from a local file. Typically, large datasets are loaded from a database or distributed filesystem such as Hadoop Distributed File System (HDFS), as shown here:

def load(fileName: String): Option[XYTSeries] = {
  val src =  Source.fromFile(fileName)
  val fields = src.getLines.map( _.split(CSV_DELIM)).toArray //1
  val cols = fields.drop(1) //2
  val data = transform(cols)
  src.close //3
  Some(data)
}

The transform method will be described in the next section.

The data file is extracted through an invocation of the Source.fromFile static method, and then the fields are extracted through a map (line 1). The header (first) row is removed with a call to drop (line 2).

Tip

Data extraction

The Source.fromFile.getLines.map invocation pipeline method returns an iterator, which needs to be converted into an array to store the information into memory.

The file has to be closed to avoid leaking of the file handle (line 3).

Tip

Code readability

A long pipeline of Scala high-order methods make the code and underlying code quite difficult to read. It is recommended to break down long chains of method calls. The following code is an example of a long chain of method calls:

val cols = Source.fromFile.getLines.map( _.split(CSV_DELIM).toArray.drop(1)

We can break down such method calls into several steps as follows:

val lines = Source.fromFile.getLines
val fields = lines.map(_.split(CSV_DELIM).toArray
val cols = fields.drop(1)

We strongly encourage you to consult the excellent guide Effective Scala, written by Marius Eriksen from Twitter. This is definitively a must read for any Scala developer [1:10].

Preprocessing the dataset

The next step is to normalize the data in the range [-0.5, 0.5] to be trained by the logistic binary classifier. It is time to introduce a non-sense statistics class.

Basic statistics

We select the computation of mean and standard deviation of the two time series as the first step of the preprocessing phase. The computation of these statistics can be implemented by the reduce methods reduceLeft and foldLeft:

val mean = price.reduceLeft( _ + _ )/price.size
val s2 = price.foldLeft(0.0)((s,x) =>s+(x-mean)*(x-mean))
val stdDev = Math.sqrt(s2/(price.size-1) )

However, this implementation has one major drawback: the dataset (price in this example) has to be traversed for each method (mean, stdDev, min, max, and so on).

One of the solutions is to create a class that computes the counters and the statistics on demand using, once again, the lazy values:

class Stats[T <% Double](private values: DVector[T]) {
   class _Stats(var minValue: Double, var maxValue: Double, var sum: Double, var sumSqr: Double) 
val stats = {
  val _stats = new _Stats(Double.MaxValue, Double.MinValue, 0.0, 0.0)
  values.foreach(x => {
    if(x < _stats.minValue) x else _stats.minValue
    if(x > _stats.maxValue) x else _stats.maxValue 
    _stats.sum + x
    _stats.sumSqr + x*x
  })
  _stats
}
 
lazy val mean = _stats.sum/values.size
lazy val variance = (_stats.sumSqr - mean*mean*values.size)/(values.size-1)
lazy val stdDev = if(variance < ZERO_EPS) ZERO_EPS else Math.sqrt(variance)
lazy val min = _stats.minValue
lazy val max = _stats.mazValue
}

We made the statistics object generic by using the view bounds T <% Double, which assumes a conversion from type T to Double. By defining the statistics as tuple counters (minimum value, maximum value, sum of values, and sum of square values) and folding these values into a statistics object, we limit the number of invocations of the foldLeft reducer method to 1, and therefore, avoid the recomputation of these statistics for the existing dataset each time new data is added.

The code illustrates the use and benefit of lazy values in Scala. The mean is computed only if and when needed.

Normalization and Gauss distribution

Statistics are usually used to normalize data into a probability value [0, 1] as required by most classification or clustering algorithms. It is logical to add the normalization method to the Stats class, as we have already extracted the min and max values:

def normalize: DblVector = {
  val range = max – min;  values.map(x => (x - min)/range)
}

The same approach is used to compute the multivariate normal distribution:

def gauss: DblVector = 
   values.map(x =>{
      val y=x-mean
      INV_SQRT_2PI/stdDev*Math.exp(-0.5*y*y/stdDev)}) 

The price action chart has a very interesting characteristic. At a closer look, a sudden change in price and increase in volume occurs about every three months or so. Experienced investors will undoubtedly recognize that those price-volume patterns are related to the release of quarterly earnings of Cisco. Such regular but unpredictable patterns can be a source of concern or opportunity if risk can be managed. The strong reaction of the stock price to the release of corporate earnings may scare some long-term investors while enticing day traders.

The following graph visualizes the potential correlation between sudden price change (volatility) and heavy trading volume:

Normalization and Gauss distribution

Correlation price-volume action for the Cisco stock

Let's try to correlate the volatility of the stock price with volume. For the sake of this exercise, we define the volatility as the maximum variation of the stock price within each trading session: the relative difference between the highest price during the trading session and the lowest price during the session.

The YahooFinancials enumeration extracts historical stock prices and session volume from a CSV file. For example, the volatility is extracted from the CSV fields of each line in the CSV file as follows:

object YahooFinancials extends Enumeration {
   type YahooFinancials = Value
   val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE = Value
   val volatility = (fs: Array[String]) =>fs(HIGH.id).toDouble-fs(LOW.id).toDouble
   …
}

The transform method uses the YahooFinancials enumeration to generate the input data for the model:

def transform(cols: Array[Array[String]]): XYTSeries = {
  val volatility = Stats[Double](cols.map(YahooFinancials.volatility)).normalize
  val volume =  Stats[Double](cols.map(YahooFinancials.volume) ).normalize
  volatility.zip(volume)
}

The volatility and volume data is normalized using the Stats.normalize method defined earlier.

Plotting data

Although charting is not the primary goal of this book, we thought that you will benefit from a brief introduction to JFreeChart. The skeleton code to generate a scatter plot is rather simple. The most relevant code is the transformation of the XYTSeries into graphical JFreeChart's XYSeries:

val xLegend = "Session Volatility"
val yLegend = "Session Volume"
def display(xy: XYTSeries, w: Int, h : Int): Unit  = {
   val series = new XYSeries("CSCO 2012-2013 Stock")
   xy.foreach( x => series.add( x._1,x._2))
     val seriesCollection = new XYSeriesCollection
     seriesCollection.addSeries(series)
    … // plot rendering code
     val chart = ChartFactory.createScatterPlot(xLegend, xLegend, yLegend, seriesCollection, PlotOrientation.VERTICAL, true, false, false)
     createFrame("Logistic Regression", chart)
  }

Note

Visualization

The JFreeChart library is introduced as a robust charting tool. The visualization of the results of a computation is beyond the scope of this book. The code related to plots and charts is omitted from the book in order to keep the code snippets concise and dedicated to machine learning. In a few occasions, output data is formatted as a CSV file to be simply imported into a spreadsheet.

Here is an example of a plot using the ScatterPlot.display method:

val plot = new ScatterPlot(("CSCO 2012-2013", "Session High - Low", "Session Volume"), new BlackPlotTheme)
plot.display(volatility_vol.filter( _._1 < 0.5), 250, 340)
Plotting data

Scatter plot of volatility and volume for the Cisco stock

There is a level of correlation between session volume and session volatility. We can use this information to classify trading sessions by their volatility.

Creating a model (learning)

The objective of the training is to build a model that can discriminate between volatile and nonvolatile trading sessions. For the sake of the exercise, session volatility has been defined as session price high and session price low coupled with heavy trading volume, which constitute the two parameters of the model.

Logistic regression is commonly used in statistics inference. The following implementation of the binary logistic regression classifier exposes a single method, classify, to comply with our desire to reduce the complexity and life cycle of objects. The model parameters, weights, are computed during training when the LogBinRegression class/model is instantiated. As mentioned earlier, the sections of the code nonessential to the understanding of the algorithm are omitted:

class LogBinRegression(val labels: DVector[(XY, Double)], val maxIters: Int, val eta: Double, val eps: Double) {
  val dim = 3
  val weights = train
      
  def classify(xy: XY): Option[(Boolean, Double)] = {
    if(weights != None) {
       val likelihood = sigmoid(w(0) + xy._1*w(1) + xy._2*w(2))
       Some(likelihood > 0.5, likelihood)
    }
    else None
  }

The training method, train, consists of iterating through the computation of the weight using a simple descent gradient. The method computes the weights and returns an option, so the model is either trained and ready for runtime classification or nonexistent (None):

def train: Option[DblVector] = {
  val w = Array.fill(dim)( x=> Random.nextDouble-1.0) 
    
  Range(0, maxIters).find(_ => {
     val deltaW = labels.foldLeft(Array.fill(dim)(0.0))((dw, lbl) => {  
       val y = sigmoid(w(0) + w(1)*lbl._1._1 +  w(2)*lbl._1._2)
       dw.map(dx => dx + (lbl._2 - y)*(lbl._1._1 + lbl._1._2))
    })
    val nextW = Array.fill(dim)(0.0)
                     .zipWithIndex
                     .map(nw => w(nw._2)+eta*deltaW(nw._2))
    val diff = Math.abs(nextW.sum - w.sum)
    nextW.copyToArray(w);  diff < eps
  }) match {
    case Some(iters) => Some(w)
    case None => { … }
  }
}
def sigmoid(x: Double):Double = 1.0/(1.0 + Math.exp(-x))

The iteration is encapsulated in the Scala find method that exists if the algorithm converges (diff < eps). The model parameters, weights, are set to None if the maximum number of iterations is reached.

The training method, train, iterates across the set of observations by computing the gradient between the predicted and observed values. In our simplistic approach, the gradient is computed as a linear function of the sigmoid of the sum of the product of the weight and training observations. As for any optimization problem, the initialization of the solution vector, weights, is critical. We choose to initialize the weight with random values, although in practice, you would use a more deterministic approach to initialize the model parameters.

In order to train the model, we need to label data. The process consists of tagging every trading session as volatile and non volatile according to the observations (relative session volatility and session volume). The labeling process is usually quite cumbersome; therefore, let's generate the label automatically. A trading session is considered volatile if a volatility and volume are both greater than 60 percent of the maximum relative volatility and volume:

val labels = volatilityVol.zip(volatilityVol.map(x =>if( x._1>0.3 && x._2>0.3) 1.0 else 0.0))

Note

Automated labeling

Although quite convenient, automated creation of training labels is not without risk because it may mislabel singular observations. This technique is used in this test for convenience but it is not recommended unless a domain expert reviews the labels manually.

The model is created (trained) by a simple instantiation of the logistic binary classifier:

val logit = new LogBinRegression(labels, 300, 0.00005, 0.02)

The training run is configured with a maximum of 300 iterations, a gradient slope of 0.00005, and convergence criteria of 0.02.

Classify the data

Finally, the model can be tested with a new fresh dataset, not related to the training set:

Date,Open,High,Low,Close,Volume,Adj Close
3/9/2011,14.78,15.08,14.20,14.91,4.79E+08,14.88
11/17/2009,10.78,10.90,10.62,10.84,3901987,10.85

It is just a matter of executing the classification method (exceptions, conditions on method arguments, and returned values are omitted):

val testData = load("resources/data/chap1/CSCO2.csv")
logit.classify(testData(0)) match {
  case Some(topCategory) => Display.show(topCategory)
  case None => { … }
}   
logit.classify(testData(1)) match {
  case Some(topCategory) => Display.show(topCategory)
  case None => { … }
}

The result of the classification is (true,0.516) for the first sample and (false,0.1180) for the second sample.

Note

Validation

The simple classification, in this test case, is provided for illustrating the runtime application of the model. It does not constitute a validation of the model by any stretch of imagination. The next chapter digs into validation metrics and methodology.

Summary

We hope you enjoyed this introduction to machine learning and how to leverage your existing skills in Scala programming to create a simple regression program to predict stock price/volume action. Here are the highlights of this introductory chapter:

  • From monadic composition and high-order collection methods for parallelization to configurability to reusability patterns, Scala is the perfect fit to implement and leverage data mining and machine learning algorithms for large-scale projects
  • There are many steps to create and apply a machine learning model
  • The implementation of the logistic binary classifier presented as part of the test case is simple enough to encourage you to learn how to write and apply more advanced machine learning algorithms

To the delight of Scala programming aficionados, the next chapter will dig deeper into building a flexible workflow by leveraging traits and dependency injection.

Left arrow icon Right arrow icon

Description

Are you curious about AI? All you need is a good understanding of the Scala programming language, a basic knowledge of statistics, a keen interest in Big Data processing, and this book!

What you will learn

  • Build dynamic workflows for scientific computing
  • Leverage open source libraries to extract patterns from time series
  • Write your own classification, clustering, or evolutionary algorithm
  • Perform relative performance tuning and evaluation of Spark
  • Master probabilistic models for sequential data
  • Experiment with advanced techniques such as regularization and kernelization
  • Solve big data problems with Scala parallel collections, Akka actors, and Apache Spark clusters
  • Apply key learning strategies to a technical analysis of financial markets
Estimated delivery fee Deliver to Turkey

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Dec 17, 2014
Length: 624 pages
Edition : 1st
Language : English
ISBN-13 : 9781783558742
Category :
Languages :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Turkey

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Publication date : Dec 17, 2014
Length: 624 pages
Edition : 1st
Language : English
ISBN-13 : 9781783558742
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 175.97
Machine Learning with R
$60.99
Scala for Machine Learning
$65.99
Learning Concurrent Programming in Scala
$48.99
Total $ 175.97 Stars icon
Banner background image

Table of Contents

14 Chapters
1. Getting Started Chevron down icon Chevron up icon
2. Hello World! Chevron down icon Chevron up icon
3. Data Preprocessing Chevron down icon Chevron up icon
4. Unsupervised Learning Chevron down icon Chevron up icon
5. Naïve Bayes Classifiers Chevron down icon Chevron up icon
6. Regression and Regularization Chevron down icon Chevron up icon
7. Sequential Data Models Chevron down icon Chevron up icon
8. Kernel Models and Support Vector Machines Chevron down icon Chevron up icon
9. Artificial Neural Networks Chevron down icon Chevron up icon
10. Genetic Algorithms Chevron down icon Chevron up icon
11. Reinforcement Learning Chevron down icon Chevron up icon
12. Scalable Frameworks Chevron down icon Chevron up icon
A. Basic Concepts Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.8
(12 Ratings)
5 star 41.7%
4 star 16.7%
3 star 25%
2 star 8.3%
1 star 8.3%
Filter icon Filter
Top Reviews

Filter reviews by




matej fandl Feb 09, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Studying machine learning during my university times and being an aspiring scala developer I picked this book up as an opportunity to learn scala while reading about what interests me the most in the field of computer science. This wont be an easy read for people not familiar with scala at all, but if you have some experience with the language and are interested in machine learning, I definitely recommend the book. It is a nice and quite deep dive into the topic. What I found very interesting was the optional math part available in each section. This book is also showing me where my understanding of scala is still superficial, the code is written a very good way.
Amazon Verified review Amazon
Amazon Customer Feb 20, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Some technical books are heavy on theory, some are heavy on practicality. There are books that describe ‘why’ and others that show you ‘how’. Personally, I have always tended towards the ‘how’ side of things - the ‘cookbook’ approach is one I have always liked. The more practical the better.Scala For Machine Learning (full disclosure, I received an unpaid review copy) is as about as practical as it gets. Loaded with code examples, this book leaves you in no doubt that it will help you construct your own code, against your own data in the fastest time possible. But it does not take any short-cuts.Machine Learning and Scala is a broad subject to cover and SFML does an admirable job of taking you from the first steps of data preparation, right the way through to a artificial neural networks and genetic algorithms.I work with data - it has been my day-to-day working life for over twenty years, and there is never any shortage of new territory to cover. I am predominately data engineering focused, and whilst the data content is the most important thing, it is the technology that keeps me engaged. In the final chapter of SFML Nicolas covers a nice selection of frameworks for concurrent processing. This really piqued my interest - Apache Spark is a hot topic at the moment and is given a good few pages here, reviewing its Scala and Akka heritage and demonstrating its core design principles of in-memory persistence, scheduling laziness, distributed dataset actions and shared variables. Better yet Nicolas shows you how to get Spark up and running and executing your first K-means tasks.The bulk of the book is made up with detailed reviews, explanations and implementation guides for different machine learning algorithms and methodologies. Each section is made up of a wealth of detailed explanation, detailed implementation instructions and guidelines, honestly - at times - the information density can be a little overwhelming, but it is all hugely valuable, and much appreciated.Running through the whole book is, of course, an appreciation of Scala and its abilities to perform in a distributed manner, at scale. This is the sort work that Scala, in all of its object-functional glory, was designed to address. I have read other Scala books and the programming language has always been presented in a “you may be used to, but Scala…” fashion. I am happy to say that is not the case here, with solid examples, proper real world code and thorough debriefs - Scala is presented as it should be, a language in its own right, being dealt with on its own terms.If your work or study includes components of machine learning and / or data processing - and you are looking for a modern take on some time-honoured methods, this book is work picking up and consuming, avidly.
Amazon Verified review Amazon
mathieu Dec 31, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
just wanted to let potential buyers know that the kindle version has none of the nuisances sometimes found in e-books. i just purchased this book (from amazon france) and did a quick check of the source code, equations, and diagrams. everything shows up perfectly. no comment on the content yet, but it looks quite interesting.
Amazon Verified review Amazon
Amazon Customer Feb 13, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Very detailed. Not for beginners in scala programming
Amazon Verified review Amazon
Tomer Ben David Feb 12, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
My favorite book these days, using scala? doing machine learning? I don't want to read a separate book on machine learning and another one on advanced scala tailored for machine learning, can anyone please summarize scala + machine learning in the same book? Yes - This book does it. Still reading but so far so good! Perfect match for me.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela