What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Scala for Data Science

Chapter 1. Scala and Data Science

The second half of the 20^th century was the age of silicon. In fifty years, computing power went from extremely scarce to entirely mundane. The first half of the 21^st century is the age of the Internet. The last 20 years have seen the rise of giants such as Google, Twitter, and Facebook—giants that have forever changed the way we view knowledge.

The Internet is a vast nexus of information. Ninety percent of the data generated by humanity has been generated in the last 18 months. The programmers, statisticians, and scientists who can harness this glut of data to derive real understanding will have an ever greater influence on how businesses, governments, and charities make decisions.

This book strives to introduce some of the tools that you will need to synthesize the avalanche of data to produce true insight.

Data science

Data science is the process of extracting useful information from data. As a discipline, it remains somewhat ill-defined, with nearly as many definitions as there are experts. Rather than add yet another definition, I will follow Drew Conway's description (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram). He describes data science as the culmination of three orthogonal sets of skills:

Data scientists must have hacking skills. Data is stored and transmitted through computers. Computers, programming languages, and libraries are the hammers and chisels of data scientists; they must wield them with confidence and accuracy to sculpt the data as they please. This is where Scala comes in: it's a powerful tool to have in your programming toolkit.
Data scientists must have a sound understanding of statistics and numerical algorithms. Good data scientists will understand how machine learning algorithms function and how to interpret results. They will not be fooled by misleading metrics, deceptive statistics, or misinterpreted causal links.
A good data scientist must have a sound understanding of the problem domain. The data science process involves building and discovering knowledge about the problem domain in a scientifically rigorous manner. The data scientist must, therefore, ask the right questions, be aware of previous results, and understand how the data science effort fits in the wider business or research context.

Drew Conway summarizes this elegantly with a Venn diagram showing data science at the intersection of hacking skills, maths and statistics knowledge, and substantive expertise:

It is, of course, rare for people to be experts in more than one of these areas. Data scientists often work in cross-functional teams, with different members providing the expertise for different areas. To function effectively, every member of the team must nevertheless have a general working knowledge of all three areas.

To give a more concrete overview of the workflow in a data science project, let's imagine that we are trying to write an application that analyzes the public perception of a political campaign. This is what the data science pipeline might look like:

Obtaining data: This might involve extracting information from text files, polling a sensor network or querying a web API. We could, for instance, query the Twitter API to obtain lists of tweets with the relevant hashtags.
Data ingestion: Data often comes from many different sources and might be unstructured or semi-structured. Data ingestion involves moving data from the data source, processing it to extract structured information, and storing this information in a database. For tweets, for instance, we might extract the username, the names of other users mentioned in the tweet, the hashtags, text of the tweet, and whether the tweet contains certain keywords.
Exploring data: We often have a clear idea of what information we want to extract from the data but very little idea how. For instance, let's imagine that we have ingested thousands of tweets containing hashtags relevant to our political campaign. There is no clear path to go from our database of tweets to the end goal: insight into the overall public perception of our campaign. Data exploration involves mapping out how we are going to get there. This step will often uncover new questions or sources of data, which requires going back to the first step of the pipeline. For our tweet database, we might, for instance, decide that we need to have a human manually label a thousand or more tweets as expressing "positive" or "negative" sentiments toward the political campaign. We could then use these tweets as a training set to construct a model.
Feature building: A machine learning algorithm is only as good as the features that enter it. A significant fraction of a data scientist's time involves transforming and combining existing features to create new features more closely related to the problem that we are trying to solve. For instance, we might construct a new feature corresponding to the number of "positive" sounding words or pairs of words in a tweet.
Model construction and training: Having built the features that enter the model, the data scientist can now train machine learning algorithms on their datasets. This will often involve trying different algorithms and optimizing model hyperparameters. We might, for instance, settle on using a random forest algorithm to decide whether a tweet is "positive" or "negative" about the campaign. Constructing the model involves choosing the right number of trees and how to calculate impurity measures. A sound understanding of statistics and the problem domain will help inform these decisions.
Model extrapolation and prediction: The data scientists can now use their new model to try and infer information about previously unseen data points. They might pass a new tweet through their model to ascertain whether it speaks positively or negatively of the political campaign.
Distillation of intelligence and insight from the model: The data scientists combine the outcome of the data analysis process with knowledge of the business domain to inform business decisions. They might discover that specific messages resonate better with the target audience, or with specific segments of the target audience, leading to more accurate targeting. A key part of informing stakeholders involves data visualization and presentation: data scientists create graphs, visualizations, and reports to help make the insights derived clear and compelling.

This is far from a linear pipeline. Often, insights gained at one stage will require the data scientists to backtrack to a previous stage of the pipeline. Indeed, the generation of business insights from raw data is normally an iterative process: the data scientists might do a rapid first pass to verify the premise of the problem and then gradually refine the approach by adding new data sources or new features or trying new machine learning algorithms.

In this book, you will learn how to deal with each step of the pipeline in Scala, leveraging existing libraries to build robust applications.

Why Scala?

You want to write a program that handles data. Which language should you choose?

There are a few different options. You might choose a dynamic language such as Python or R or a more traditional object-oriented language such as Java. In this section, we will explore how Scala differs from these languages and when it might make sense to use it.

When choosing a language, the architect's trade-off lies in a balance of provable correctness versus development speed. Which of these aspects you need to emphasize will depend on the application requirements and where on the permanence spectrum your program lies. Is this a short script that will be used by a few people who can easily fix any problems that arise? If so, you can probably permit a certain number of bugs in rarely used code paths: when a developer hits a snag, they can just fix the problem as it arises. By contrast, if you are developing a database engine that you plan on releasing to the wider world, you will, in all likelihood, favor correctness over rapid development. The SQLite database engine, for instance, is famous for its extensive test suite, with 800 times as much testing code as application code (https://www.sqlite.org/testing.html).

What matters, when estimating the correctness of a program, is not the perceived absence of bugs, it is the degree to which you can prove that certain bugs are absent.

There are several ways of proving the absence of bugs before the code has even run:

Static type checking occurs at compile time in statically typed languages, but this can also be used in strongly typed dynamic languages that support type annotations or type hints. Type checking helps verify that we are using functions and classes as intended.
Static analyzers and linters that check for undefined variables or suspicious behavior (such as parts of the code that can never be reached).
Declaring some attributes as immutable or constant in compiled languages.
Unit testing to demonstrate the absence of bugs along particular code paths.

There are several more ways of checking for the absence of some bugs at runtime:

Dynamic type checking in both statically typed and dynamic languages
Assertions verifying supposed program invariants or expected contracts

In the next sections, we will examine how Scala compares to other languages in data science.

Static typing and type inference

Scala's static typing system is very versatile. A lot of information as to the program's behavior can be encoded in types, allowing the compiler to guarantee a certain level of correctness. This is particularly useful for code paths that are rarely used. A dynamic language cannot catch errors until a particular branch of execution runs, so a bug can persist for a long time until the program runs into it. In a statically typed language, any bug that can be caught by the compiler will be caught at compile time, before the program has even started running.

Statically typed object-oriented languages have often been criticized for being needlessly verbose. Consider the initialization of an instance of the Example class in Java:

Example myInstance = new Example() ;

We have to repeat the class name twice—once to define the compile-time type of the myInstance variable and once to construct the instance itself. This feels like unnecessary work: the compiler knows that the type of myInstance is Example (or a superclass of Example) as we are binding a value of the Example type.

Scala, like most functional languages, uses type inference to allow the compiler to infer the type of variables from the instances bound to them. We would write the equivalent line in Scala as follows:

val myInstance = new Example()

The Scala compiler infers that myInstance has the Example type at compile time. A lot of the time, it is enough to specify the types of the arguments and of the return value of a function. The compiler can then infer types for all the variables defined in the body of the function. Scala code is usually much more concise and readable than the equivalent Java code, without compromising any of the type safety.

Scala encourages immutability

Scala encourages the use of immutable objects. In Scala, it is very easy to define an attribute as immutable:

val amountSpent = 200

The default collections are immutable:

val clientIds = List("123", "456") // List is immutable
clientIds(1) = "589" // Compile-time error

Having immutable objects removes a common source of bugs. Knowing that some objects cannot be changed once instantiated reduces the number of places bugs can creep in. Instead of considering the lifetime of the object, we can narrow in on the constructor.

Scala and functional programs

Scala encourages functional code. A lot of Scala code consists of using higher-order functions to transform collections. You, as a programmer, do not have to deal with the details of iterating over the collection. Let's write an occurrencesOf function that returns the indices at which an element occurs in a list:

def occurrencesOf[A](elem:A, collection:List[A]):List[Int] = {
  for { 
    (currentElem, index) <- collection.zipWithIndex
    if (currentElem == elem)
  } yield index
}

How does this work? We first declare a new list, collection.zipWithIndex, whose elements are (collection(0), 0), (collection(1), 1), and so on: pairs of the collection's elements and their indexes.

We then tell Scala that we want to iterate over this collection, binding the currentElem variable to the current element and index to the index. We apply a filter on the iteration, selecting only those elements for which currentElem == elem. We then tell Scala to just return the index variable.

We did not need to deal with the details of the iteration process in Scala. The syntax is very declarative: we tell the compiler that we want the index of every element equal to elem in collection and let the compiler worry about how to iterate over collection.

Consider the equivalent in Java:

static <T> List<Integer> occurrencesOf(T elem, List<T> collection) {
  List<Integer> occurrences = new ArrayList<Integer>() ;
  for (int i=0; i<collection.size(); i++) {
    if (collection.get(i).equals(elem)) {
      occurrences.add(i) ;
    }
  }
  return occurrences ;
}

In Java, you start by defining a (mutable) list in which to put occurrences as you find them. You then iterate over the collection by defining a counter, considering each element in turn and adding its index to the list of occurrences, if need be. There are many more moving parts that we need to get right for this method to work. These moving parts exist because we must tell Java how to iterate over the collection, and they represent a common source of bugs.

Furthermore, as a lot of code is taken up by the iteration mechanism, the line that defines the logic of the function is harder to find:

static <T> List<Integer> occurrencesOf(T elem, List<T> collection) {
  List<Integer> occurences = new ArrayList<Integer>() ;
  for (int i=0; i<collection.size(); i++) {
    if (collection.get(i).equals(elem)) { 
      occurrences.add(i) ;
    }
  }
  return occurrences ;
}

Note that this is not meant as an attack on Java. In fact, Java 8 adds a slew of functional constructs, such as lambda expressions, the Optional type that mirrors Scala's Option, or stream processing. Rather, it is meant to demonstrate the benefit of functional approaches in minimizing the potential for errors and maximizing clarity.

Null pointer uncertainty

We often need to represent the possible absence of a value. For instance, imagine that we are reading a list of usernames from a CSV file. The CSV file contains name and e-mail information. However, some users have declined to enter their e-mail into the system, so this information is absent. In Java, one would typically represent the e-mail as a string or an Email class and represent the absence of e-mail information for a particular user by setting that reference to null. Similarly, in Python, we might use None to demonstrate the absence of a value.

This approach is dangerous because we are not encoding the possible absence of e-mail information. In any nontrivial program, deciding whether an instance attribute can be null requires considering every occasion in which this instance is defined. This quickly becomes impractical, so programmers either assume that a variable is not null or code too defensively.

Scala (following the lead of other functional languages) introduces the Option[T] type to represent an attribute that might be absent. We might then write the following:

class User {
  ...
  val email:Option[Email]
  ...
}

We have now encoded the possible absence of e-mail in the type information. It is obvious to any programmer using the User class that e-mail information is possibly absent. Even better, the compiler knows that the email field can be absent, forcing us to deal with the problem rather than recklessly ignoring it to have the application burn at runtime in a conflagration of null pointer exceptions.

All this goes back to achieving a certain level of provable correctness. Never using null, we know that we will never run into null pointer exceptions. Achieving the same level of correctness in languages without Option[T] requires writing unit tests on the client code to verify that it behaves correctly when the e-mail attribute is null.

Note that it is possible to achieve this in Java using, for instance, Google's Guava library (https://code.google.com/p/guava-libraries/wiki/UsingAndAvoidingNullExplained) or the Optional class in Java 8. It is more a matter of convention: using null in Java to denote the absence of a value has long been the norm.

Easier parallelism

Writing programs that take advantage of parallel architectures is challenging. It is nevertheless necessary to tackle all but the simplest data science problems.

Parallel programming is difficult because we, as programmers, tend to think sequentially. Reasoning about the order in which different events can happen in a concurrent program is very challenging.

Scala provides several abstractions that greatly facilitate the writing of parallel code. These abstractions work by imposing constraints on the way parallelism is achieved. For instance, parallel collections force the user to phrase the computation as a sequence of operations (such as map, reduce, and filter) on collections. Actor systems require the developer to think in terms of actors that encapsulate the application state and communicate by passing messages.

It might seem paradoxical that restricting the programmer's freedom to write parallel code as they please avoids many of the problems associated with concurrency. However, limiting the number of ways in which a program behaves facilitates thinking about its behavior. For instance, if an actor is misbehaving, we know that the problem lies either in the code for this actor or in one of the messages that the actor receives.

As an example of the power afforded by having coherent, restrictive abstractions, let's use parallel collections to solve a simple probability problem. We will calculate the probability of getting at least 60 heads out of 100 coin tosses. We can estimate this using Monte Carlo: we simulate 100 coin tosses by drawing 100 random Boolean values and check whether the number of true values is at least 60. We repeat this until results have converged to the required accuracy, or we get bored of waiting.

Let's run through this in a Scala console:

scala> val nTosses = 100
nTosses: Int = 100

scala> def trial = (0 until nTosses).count { i =>
  util.Random.nextBoolean() // count the number of heads
}
trial: Int

The trial function runs a single set of 100 throws, returning the number of heads:

scala> trial
Int = 51

To get our answer, we just need to repeat trial as many times as we can and aggregate the results. Repeating the same set of operations is ideally suited to parallel collections:

scala> val nTrials = 100000
nTrials: Int = 100000

scala> (0 until nTrials).par.count { i => trial >= 60 }
Int = 2745

The probability is thus approximately 2.5% to 3%. All we had to do to distribute the calculation over every CPU in our computer is use the par method to parallelize the range (0 until nTrials). This demonstrates the benefits of having a coherent abstraction: parallel collections let us trivially parallelize any computation that can be phrased in terms of higher-order functions on collections.

Clearly, not every problem is as easy to parallelize as a simple Monte Carlo problem. However, by offering a rich set of intuitive abstractions, Scala makes writing parallel applications manageable.

Interoperability with Java

Scala runs on the Java virtual machine. The Scala compiler compiles programs to Java byte code. Thus, Scala developers have access to Java libraries natively. Given the phenomenal number of applications written in Java, both open source and as part of the legacy code in organizations, the interoperability of Scala and Java helps explain the rapid uptake of Scala.

Interoperability has not just been unidirectional: some Scala libraries, such as the Play framework, are becoming increasingly popular among Java developers.

Key benefits

A complete guide for scalable data science solutions, from data ingestion to data visualization

Deploy horizontally scalable data processing pipelines and take advantage of web frameworks to build engaging visualizations

Build functional, type-safe routines to interact with relational and NoSQL databases with the help of tutorials and examples provided

Description

Scala is a multi-paradigm programming language (it supports both object-oriented and functional programming) and scripting language used to build applications for the JVM. Languages such as R, Python, Java, and so on are mostly used for data science. It is particularly good at analyzing large sets of data without any significant impact on performance and thus Scala is being adopted by many developers and data scientists. Data scientists might be aware that building applications that are truly scalable is hard. Scala, with its powerful functional libraries for interacting with databases and building scalable frameworks will give you the tools to construct robust data pipelines. This book will introduce you to the libraries for ingesting, storing, manipulating, processing, and visualizing data in Scala. Packed with real-world examples and interesting data sets, this book will teach you to ingest data from flat files and web APIs and store it in a SQL or NoSQL database. It will show you how to design scalable architectures to process and modelling your data, starting from simple concurrency constructs such as parallel collections and futures, through to actor systems and Apache Spark. As well as Scala’s emphasis on functional structures and immutability, you will learn how to use the right parallel construct for the job at hand, minimizing development time without compromising scalability. Finally, you will learn how to build beautiful interactive visualizations using web frameworks. This book gives tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed with building data science and data engineering solutions.

What you will learn

Transform and filter tabular data to extract features for machine learning

Implement your own algorithms or take advantage of MLLib's extensive suite of models to build distributed machine learning pipelines

Read, transform, and write data to both SQL and NoSQL databases in a functional manner

Write robust routines to query web APIs

Read data from web APIs such as the GitHub or Twitter API

Use Scala to interact with MongoDB, which offers high performance and helps to store large data sets with uncertain query requirements

Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations

Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Frequently bought together

$61.99

$48.99

$65.99

Total $ 176.97

adnan baloch May 05, 2016

One of the hottest jobs these days is that of the data scientist. It makes sense given the explosion of data generated by the online activities of millions of internet users and collected by online businesses and social media websites. As the author of this book explains, data scientists need to be conversant in three areas at once: programming, statistics/numerical algorithms and the ability to ask the right questions that will help in making decisions crucial to expanding a business and keeping it competitive. This book deals with the first of these essential skills: programming. Scala is a functional programming language with powerful parallel computing capabilities. The functional part of the language ensures that code written in Scala is terse and avoids common bugs that are the major source of headaches in traditional languages like Python or Java. The one place where Scala lags is in the availability of mature libraries. Still, the author discusses several good Scala libraries that make the Scala programmer's job easy so she can focus on the actual data science. Breeze and Breeze-viz are put to use in manipulating arrays of data and plotting simple graphs respectively. Parallel collections are explained intuitively so that anyone without any experience of parallel computation will find it useful. Futures make it possible to add further concurrency to Scala based projects by freeing the main thread from blocking events like waiting to receive data from a web page.Databases form the core of data storage in any data focused programming solution. The author shows how to write a functional wrapper for JDBC and also discusses a popular functional wrapper called Slick so the readers will be equipped to handle both scenarios depending on their needs. Gathering data from the web can hardly work without an understanding of interfacing with APIs. The author takes a very practical approach in exploring this crucial aspect by querying the Github API and storing the data in MongoDB. Furthermore, readers get to see how to create their own simple web API. Sooner or later, data scientists have to turn to distributed computing for the horsepower needed to complete their complex calculations. Actor based concurrency using Akka fills this gap and the author gives it an excellent treatment in a dedicated chapter. Machine learning is discussed using MLlib but a good conceptual understanding of ML is needed for this chapter. The uninitiated are forewarned: don't expect the author to teach machine learning in a single chapter. For me, the most exciting two chapters are the ones that use the Play framework with D3.js to build a single page app. This represents true empowerment because it enables budding data scientists to share their fruits of labor with the entire web community in a visually captivating way. In short, data scientists wondering about Scala's effectiveness as a great tool for data science need only skim through this book. They won't be disappointed.

Amazon Verified review

Bill Jones Apr 23, 2016

The good: The book covers using Scala with various tools and provides use cases, it dives in but not deep. In my opinion it is a great beginner book to help you get started with Scala, but you'll want to pick up another title after this for continued learning.The bad: I really wished it would have dived in deeper and focused less on integrating multiple say DB platforms, but overall not enough to make me hate the book.

Amazon Customer Feb 21, 2017

Very good. You can also buy Scala for Machine Learning

Timothy J. Whittaker Apr 09, 2016

I spent a lot of time looking for a book like this. The other reviewer is correct, there is very little on actual statistical learning in this text, but this is not the author's aim. To me, this is more about awareness of some great Scala (and Java) libraries (with application) that any data scientist should find useful. The definition of data science taken by this book is probably the broadest I have seen - there is something worthwhile in every single chapter of this book.

Duncan W. Robinson Mar 21, 2016

Scala for Data Science was a fairly good introduction for me to applied Scala applications and interoperability. Working through a few examples in this book proved to be my first foray into using Scala. In my opinion, the book seemed a bit light on techniques for statistical learning, but was rich in tools showing how to Scala with JSON, APIs, SQL, MongoDB, and Spark.

Scala for Data Science: Leverage the power of Scala with different tools to build scalable, robust data science applications

What do you get with a Packt Subscription?

Scala for Data Science

Chapter 1. Scala and Data Science

Data science

Programming in data science

Why Scala?

Static typing and type inference

Scala encourages immutability

Scala and functional programs

Null pointer uncertainty

Easier parallelism

Interoperability with Java

When not to use Scala

Summary

References

Page 1 of 7

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Scala for Data Science: Leverage the power of Scala with different tools to build scalable, robust data science applications

What do you get with a Packt Subscription?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs