Getting started
This section starts by laying out the implementation infrastructure for Chapter 4, Building a Spam Classification Pipeline. The goal of this section will be to get started on developing one data pipeline to analyze the flight-on-time dataset. The first step is to set up prerequisites, before implementation. That is the goal of the next subsection.
Â
Setting up prerequisite software
The following prerequisites or prerequisite checks are recommended. A new prerequisite on this list is MongoDB:
- Increase Java memory
- Review JDK version
- Self-contained Scala application based on Simple Build Tool (SBT), where all dependencies are wired into the
build.sbt
file - MongoDB
We start by detailing the steps to increase the memory available to the Spark application. Why would we want to do that? This and other points related to Java heap space memory are explored in the following topic.
Increasing Java memory
Flight on-time records, compiled over a period of time, say, month by month, become big or...