Why Scala?
Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven dynamically into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular MapReduce
model for distributed computation of large amounts of data. Among the capabilities of the language, the following features are deemed essential to machine learning and statistical analysis.
Abstraction
Monoids and monads are important concepts in functional programming. Monads are derived from the category and group theory allowing developers to create a high-level abstraction as illustrated in Twitter's Algebird (https://github.com/twitter/algebird) or Google's Breeze Scala (https://github.com/dlwh/breeze) libraries.
A monoid defines a binary operation op
on a dataset T
with the property of closure, identity operation, and associativity.
Let's consider the +
operation is defined for a set T
using the following monoidal representation:
trait Monoid[T] { def zero: T def op(a: T, b: T): c }
Monoids are associative operations. For instance, if ts1
, ts2
, and ts3
are three time series, then the property ts1 + (ts2 + ts3) = (ts1 + ts2) + ts2
is true
. The associativity of a monoid operator is critical in regards to parallelization of computational workflows.
Monads are structures that can be seen either as containers by programmers or as a generalization of Monoids. The collections bundled with the Scala standard library (list, map, and so on) are constructed as monads [1:1]. Monads provide the ability for those collections to perform the following functions:
- Create the collection.
- Transform the elements of the collection.
- Flatten nested collections.
A common categorical representation of a monad in Scala is a trait, Monad
, parameterized with a container type M
:
trait Monad[M[_]] { def apply[T])(a: T): M[T] def flatMap[T, U](m: M[T])(f: T=>M[U]): M[U] }
Monads allow those collections or containers to be chained to generate a workflow. This property is applicable to any scientific computation [1:2].
Scalability
As seen previously, monoids and monads enable parallelization and chaining of data processing functions by leveraging the Scala higher-order methods. In terms of implementation, Actors are the core elements that make Scala scalable. Actors act as coroutines, managing the underlying threads pool. Actors communicate through passing asynchronous messages. A distributed computing Scala framework such as Akka and Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Spark are described in detail in the last chapter of this book [1:3].
In a nutshell, a workflow is implemented as a sequence of activities or computational tasks. Those tasks consist of high-order Scala methods such as flatMap
, map
, fold
, reduce
, collect
, join
, or filter
applied to a large collection of observations. Scala allows these observations to be partitioned by executing those tasks through a cluster of actors. Scala also supports message dispatching and routing of messages between local and remote actors. The engineers can decide to execute a workflow either locally or distributed across CPU cores and servers with no code or very little code changes.
In this diagram, a controller, that is, the master node, manages the sequence of tasks 1 to 4 similar to a scheduler. These tasks are actually executed over multiple worker nodes that are implemented by the Scala actors. The master node exchanges messages with the workers to manage the state of the execution of the workflow as well as its reliability. High availability of these tasks is implemented through a hierarchy of supervising actors.
Configurability
Scala supports dependency injection using a combination of abstract variables, self-referenced composition, and stackable traits. One of the most commonly used dependency injection patterns, the cake pattern, is used throughout this book to create dynamic computation workflows and plots.
Maintainability
Scala embeds Domain Specific Languages (DSL) natively. DSLs are syntactic layers built on top of Scala native libraries. DSLs allow software developers to abstract computation in terms that are easily understood by scientists. The most notorious application of DSLs is the definition of the emulation of the syntax used in the MATLAB program, which data scientists are familiar with.
Computation on demand
Lazy methods and values allow developers to execute functions and allocate computing resources on demand. The Spark framework relies on lazy variables and methods to chain Resilient Distributed Datasets (RDD).