Initializing Cascalog and Hadoop for distributed processing
Hadoop was developed by Yahoo! to implement Google's MapReduce algorithm, and then it was open sourced. Since then, it's become one of the most widely tested and used systems for creating distributed processing.
The central part of this ecosystem is Hadoop, but it's also complemented by a range of other tools, including the Hadoop Distributed File System (HDFS) and Pig, a language used to write jobs in order to run them on Hadoop.
One tool that makes working with Hadoop easier is Cascading. This provides a workflow-like layer on top of Hadoop that can make the expression of some data processing and analysis tasks much easier. Cascalog is a Clojure-idiomatic interface to Cascading and, ultimately, Hadoop.
This recipe will show you how to access and query data in Clojure sequences using Cascalog.
Getting ready
First, we have to list our dependencies in the Leiningen project.clj
file:
(defproject distrib-data "0.1.0" :dependencies [[org...