Building a random data sample for Weka
Weka is another open source tool that is officially supported by Pentaho, that focuses on data mining. Like it's cousins R and RapidMiner, Weka provides a library of statistical analysis tools that can be integrated into complex decision making systems. For this recipe, we will go over how to build a random dataset for Weka using Kettle.
Getting ready
We will be using the baseball player salaries data that can be found on the book's website or from Lahman's Baseball Archive website, found at http://www.seanlahman.com/baseball-archive/statistics/. The code for this recipe can also be found on the book's website.
This recipe also takes advantage of the ARFF Output plugin. This is available either via the Marketplace (for Kettle 5 and higher) or from the wiki at http://wiki.pentaho.com/display/EAI/List+of+Available+Pentaho+Data+Integration+Plug-Ins.
How to do it...
Perform the following steps to build a random data sample for Weka:
Create a new transformation...