Processing data with Apache Spark
In this section, we will implement the examples from Chapter 3, Processing – MapReduce and Beyond, using the Scala API. We will consider both the batch and real-time processing scenarios. We will show you how Spark Streaming can be used to compute statistics on the live Twitter stream.
Building and running the examples
Scala source code for the examples can be found at https://github.com/learninghadoop2/book-examples/tree/master/ch5. We will be using sbt
to build, manage, and execute code.
The build.sbt
file controls the codebase metadata and software dependencies; these include the version of the Scala interpreter that Spark links to, a link to the Akka package repository used to resolve implicit dependencies, as well as dependencies on Spark and Hadoop libraries.
The source code for all examples can be compiled with:
$ sbt compile
Or, it can be packaged into a JAR file with:
$ sbt package
A helper script to execute compiled classes can be generated with:
...