Getting data from Hadoop
Just as Kettle simplifies loading data into Hadoop, pulling data back out from the Hadoop File System is just as easy. In fact, we can treat it just like any other data source that is a flat file.
Getting ready
For this recipe, we will be using the Baseball Dataset loaded into Hadoop in the recipe Loading data into Hadoop (also in this chapter). It is recommended that this recipe is performed before continuing.
We will be focusing on the Salaries.csv
and the Master.csv
datasets. Let us find out just how much money each player earned over the course of their careers.
How to do it...
Perform the following steps to retrieve the baseball data from Hadoop:
Open Spoon and create a new transformation.
In the Design tab, under the Big Data section, select and bring over two Hadoop File Input steps. We will use one for each of the
.csv
files we wish to merge together.Edit one of the Hadoop File Input steps. This will be used to pull in the
Salaries.csv
information.For the File or...