Why Scala?
Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular map-reduce model for distributed computation of large amounts of data.
Note
Disclaimer
This section does not constitute a formal introduction or description of the features of Scala. It merely mentions some of its features that are valuable to machine learning practitioners. Experienced Scala developers may skip to the next section.
Among the capabilities of the language, the following features are deemed essential in machine learning and statistical analysis.
Scala as a functional language
There are many functional features in Scala which may unsettle software engineers with experience in object-oriented programming. This section deals specifically with monadic and functorial representations of data. Functors and monads are concepts defined in the field of mathematics known as category theory. Formerly:
- A functor is a data type that defines how a transformation known as a map applies to it. Scala implements functors as type classes with a
map
method. - A monad is a wrapper around an existing data type. It applies a transformation to a data of wrapper type and returns a value of the same wrapper type. Scala implements monads as type classes with
unit
andflatMap
methods. Monads extends functors in Scala.
Abstraction
Functors and monads are important concepts in functional programming.
Functors and monads are derived from category and group theory; they allow developers to create a high-level abstraction, as illustrated in the following Scala libraries:
- Scalaz: https://github.com/scalaz
- Twitter's Algebird: https://github.com/twitter/algebird
- Google's Breeze: https://github.com/dlwh/breeze
In mathematics, a category M is a structure that is defined by the following:
- Objects of some type {x e X, y Є Y, z Є Z, …}
- Morphisms or maps applied to these objects x Є X, y Є Y, f: x -› y
- Composition of morphisms f: x -› y, g: y -› z => g o f: x -› z
Covariant, contravariant functors, and bi-functors are well-understood concepts in algebraic topology that are related to manifold and vector bundles. They are commonly used in differential geometry for the generation of non-linear models.
Higher kinded types
Higher kinded types (HKTs) are abstractions of types. They generate a new type from existing types. Let's consider the following parameterized trait:
trait M[T] { . }
A higher kinded type H
over a trait M
is defined as follows:
trait H[M[_]]; class H[M[_]]
Functors and monads are higher kinded types.
How are higher kinded types relevant to data analysis and machine learning?
Scientists define observations as sets or vectors of features.
Classification problems rely on the estimation of the similarity between vectors of observations. One technique consists of comparing two vectors by computing the normalized inner (or dot) product. A co-vector is defined as a linear map α of vector to the inner product (field).
Note
Inner product
M1: Definition of inner product <.> and co-vector α:
Let's define a vector as a constructor from any field, _ => Vector[_]
. A co-vector is then defined as the mapping function of a vector to its field: Vector[_]
.
Let's then define a two-dimension (two types or fields) higher kinded structure, Hom
, that can be defined as either a vector or a co-vector by fixing one of the two types:
type Hom[T] = {
type Right[X] = Function1[X,T] // Co-vector
type Left[X] = Function1[T,X] // Vector
}
Note
Tensors and manifolds
Vector and co-vector are classes of tensor (contravariant and covariant). Tensors (fields) are used in manifold learning non-linear models and in the generation of kernel functions. Manifolds are briefly introduced in the Manifolds section in. The topic of tensor fields in manifold learning is beyond the scope of this book.
The projections of the higher-kind Hom
to Right
or Left
single parameter types are known as functors:
- Covariant functor for the right projection
- Contravariant functor for the left projection
Functors
A covariant functor is a mapping function, such as F: C => C, with the following properties:
- If f: x -› y is a morphism on C then F(x) -› F(y) is also a morphism on C
- If id: x -› x is the identity morphism on C then F(id) is also an identity morphism on C
- If g: y -› z is also a morphism on C then F(g o f) = F(g) o F(f)
The definition of the covariant functor is F[U => V] := F[U] => F[V]
. Its implementation in Scala is:
trait Functor[M[_]]{ def map[U,V](m: M[U])(f: U=>V): M[V] }
For example, let's consider an observation defined as an n dimension vector of type T, Obs[T]
. The constructor for the observation can be represented as Function1[T,Obs]
. Its functor, ObsFunctor
, is implemented as:
trait ObsFunctor[T] extends Functor[(Hom[T])#Left] { self => override def map[U,V](vu: Function1[T,U])(f: U =>V): Function1[T,V] = f.compose(vu) }
The functor is qualified as a covariant functor because the morphism is applied to the return type of the element of Obs, Function1[T, Obs]
. The projection of the two parameters types Hom
to a vector is implemented as (Hom[T])#Left
.
A contravariant functor is a mapping function, F: C => C, with the following properties:
- If f: x -› y is a morphism on C then F(x) -› F(y) is also a morphism on C
- If id: x -› x is the identity morphism on C then F(id) is also an identity morphism on C
- If g: y -› z is also a morphism on C then F(g o f) = F(f) o F(g)
The definition of the contravariant functor is F[U => V] := F[V] => F[U], as follows:
trait CoFunctor[M[_]]{ def map[U,V](m: M[U])(f: V=>U): M[V]}
Note that the input and output types in the morphism f are reversed from the definition of a covariant functor. The constructor for the co-vector can be represented as Function1[Obs,T].
Its functor, CoObsFunctor
, is implemented as:
trait CoObsFunctor[T] extends CoFunctor[(Hom[T])#Right] { self => override def map[U,V](vu: Function1[U,T])(f: V =>U): Function1[V,T] = f.andThen(vu) }
Monads
Monads are structures in algebraic topology related to category theory. Monads extend the concept of functors to allow composition known as the monadic composition of morphisms on a single type. They enable the chaining or weaving of computation into a sequence of steps sometimes known as a data pipeline. The collections bundled with the Scala standard library (List
, Map
…) are constructed as monads [1:1].
Monads provide the ability for those collections to do the following:
- Create the collection
- Transform the elements of the collection
- Flatten nested collections
The following Scala definition of a monad as a trait illustrates the concept of a higher kinded Monad
trait for type M:
trait Monad[M[_]] { Â Â def unit[T](a: T): M[T] Â Â def map[U,V](m: M[U])(f U =>V): M[V]Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â def flatMap[U,V](m: M[U])(f: U =>M[V]): M[V] }
Monads are therefore critical in machine learning as they enable the composition of multiple data transformation functions into a sequence or workflow. This property is applicable to any type of complex scientific computation [1:2].
Note
Monadic composition of kernel functions
Monads are used in the composition of kernel functions in the Kernel functions monadic composition section in Chapter 12, Kernel Models and Support Vector Machines.
Scala as an object oriented language
Machine learning models are generated through sequences of tasks or dataflows that demand a modular design.
As an object-oriented programming language, Scala allows developers to do the following:
- Define high-level component abstraction
- Allow different developers to work concurrently on different components
- Reuse code
- Isolate functionality for easier debugging and testing (unit tests)
You may wonder how Scala fares as an object-oriented programming against Java.
Tip
Scala versus Java
Scala is the purest form of object oriented language than Java. It does not support static methods (static methods are methods of singletons) and primitive types.
One important facet of object oriented programming is the ability to change modules or implement functionality on the fly, without the need to recompile the client code. This technique is known as dependency injection. Scala supports dependency injection using a combination of abstract variables, self-referenced composition, and stackable traits [1:3]. One of the most commonly used dependency injection patterns, the cake pattern, is described in the Building workflows with mixins section in Chapter 2, Data Pipelines.
Scala as a scalable language
As seen previously, functors and monads enable the parallelization and chaining of data processing functions by leveraging the Scala higher-order methods. In terms of implementation, actors are one of the core elements that make Scala scalable. Actors provide Scala developers with a high level of abstraction to build scalable, distributed, and concurrent applications. Actors hide the nitty-gritty implementation of concurrency and the management of the underlying threads pool. Actors communicate through asynchronous immutable messages. A distributed computing Scala framework such as Akka or Apache Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Apache Spark are described in detail in the last chapter of this book [1:4].
Concisely, a workflow is implemented as a sequence of activities or computational tasks. These tasks consist of higher-order Scala methods such as flatMap
, map
, fold
, reduce
, collect
, join
, or filter
applied to a large collection of observations. Scala provides developers with the tools to partition datasets and execute the tasks through a cluster of actors. Scala also supports message dispatching and routing between local and remote actors. A developer may decide to deploy a workflow either locally or across multiple CPU cores and servers with very few code alterations.
The following figure visualizes the different elements of the definition and deployment of a workflow (or data pipeline):
In the preceding diagram, a controller, that is, the master node, manages the sequence of tasks 1 to 4 in a similar way to a scheduler. These tasks are actually executed over multiple worker nodes, and are implemented by actors. The master node or actor exchanges messages with the workers to manage the state of the execution of the workflow, as well as its reliability, as illustrated in the Scalability with actors section of Chapter 16, Parallelism with Scala and Akka. The high availability of these tasks is maintained through a hierarchy of supervising actors.
Tip
Domain-specific languages (DSLs)
Scala embeds DSLs natively. DSLs are syntactic layers built on top of Scala native libraries. DSLs allow software developers to abstract computation in terms that are easily understood by scientists. A notorious application of DSLs is the definition of the emulation of the syntax use in the MATLAB program, familiar to most data scientists.