What this book covers
We aim to give you a flavor for what is possible with Scala, and to get you started using libraries that are useful for building data science applications. We do not aim to provide an entirely comprehensive overview of any of these topics. This is best left to online documentation or to reference books. What we will teach you is how to combine these tools to build efficient, scalable programs, and have fun along the way.
Chapter 1, Scala and Data Science, is a brief description of data science, and of Scala's place in the data scientist's tool-belt. We describe why Scala is becoming increasingly popular in data science, and how it compares to alternative languages such as Python.
Chapter 2, Manipulating Data with Breeze, introduces Breeze, a library providing support for numerical algorithms in Scala. We learn how to perform linear algebra and optimization, and solve a simple machine learning problem using logistic regression.
Chapter 3, Plotting with breeze-viz, introduces the breeze-viz library for plotting two-dimensional graphs and histograms.
Chapter 4, Parallel Collections and Futures, describes basic concurrency constructs. We will learn to parallelize simple problems by distributing them over several threads using parallel collections, and apply what we have learned to build a parallel cross-validation pipeline. We then describe how to wrap computation in a future to execute it asynchronously. We apply this pattern to query a web API, sending several requests in parallel.
Chapter 5, Scala and SQL through JDBC, looks at interacting with SQL databases in a functional manner. We learn how to use common Scala patterns to wrap the Java interface exposed by JDBC. Besides learning about JDBC, this chapter introduces type classes, the loan pattern, implicit conversions, and other patterns that are frequently leveraged in libraries and existing Scala code.
Chapter 6, Slick - A Functional Interface for SQL, describes the Slick library for mapping data in SQL tables to Scala objects.
Chapter 7, Web APIs, describes how to query web APIs in a concurrent, fault-tolerant manner using futures. We learn to parse JSON responses and formulate complex HTTP requests with authentication. We walk through querying the GitHub API to obtain information about GitHub users programmatically.
Chapter 8, Scala and MongoDB, walks the reader through interacting with MongoDB, a leading NoSQL database. We build a pipeline that fetches user data from the GitHub API and stores it in a MongoDB database.
Chapter 9, Concurrency with Akka, introduces the Akka framework for building concurrent applications with actors. We use Akka to build a scalable crawler that explores the GitHub follower graph.
Chapter 10, Distributed Batch Processing with Spark, explores the Apache Spark framework for building distributed applications. We learn how to construct and manipulate distributed datasets in memory. We touch briefly on the internals of Spark, learning how the architecture allows for distributed, fault-tolerant computation.
Chapter 11, Spark SQL and DataFrames, describes DataFrames, one of the more powerful features of Spark for the manipulation of structured data. We learn how to load JSON and Parquet files into DataFrames.
Chapter 12, Distributed Machine Learning with MLlib, explores how to build distributed machine learning pipelines with MLlib, a library built on top of Apache Spark. We use the library to train a spam filter.
Chapter 13, Web APIs with Play, describes how to use the Play framework to build web APIs. We describe the architecture of modern web applications, and how these fit into the data science pipeline. We build a simple web API that returns JSON.
Chapter 14, Visualization with D3 and the Play Framework, builds on the previous chapter to program a fully fledged web application with Play and D3. We describe how to integrate JavaScript into a Play framework application.
Appendix, Pattern Matching and Extractors, describes how pattern matching provides the programmer with a powerful construct for control flow.