Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Pentaho Analytics for MongoDB Cookbook
Pentaho Analytics for MongoDB Cookbook

Pentaho Analytics for MongoDB Cookbook: Over 50 recipes to learn how to use Pentaho Analytics and MongoDB to create powerful analysis and reporting solutions

Arrow left icon
Profile Icon Joel Andre Latino Profile Icon Harris Ward
Arrow right icon
$48.99
Paperback Dec 2015 218 pages 1st Edition
eBook
$9.99 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Joel Andre Latino Profile Icon Harris Ward
Arrow right icon
$48.99
Paperback Dec 2015 218 pages 1st Edition
eBook
$9.99 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Pentaho Analytics for MongoDB Cookbook

Chapter 1. PDI and MongoDB

In this chapter, we will cover these recipes:

  • Learning basic operations with Pentaho Data Integration
  • Migrating data from the RDBMS to MongoDB
  • Loading data from MongoDB to MySQL
  • Migrating data from files to MongoDB
  • Exporting MongoDB data using the aggregation framework
  • MongoDB Map/Reduce using the User Defined Java Class step and MongoDB Java Driver
  • Working with jobs and filtering MongoDB data using parameters and variables

Introduction

Migrating data from an RDBMS to a NoSQL database, such as MongoDB, isn't an easy task, especially when your RBDMS has a lot of tables. It can be a time consuming issue, and in most cases, using a manual process is like developing a bespoke solution.

Pentaho Data Integration (or PDI, also known as Kettle) is an Extract, Transform, and Load (ETL) tool that can be used as a solution for this problem. PDI provides a graphical drag-and-drop development environment called Spoon. Primarily, PDI is used to create data warehouses. However, it can also be used for other scenarios, such as migrating data between two databases, exporting data to files with different formats (flat, CSV, JSON, XML, and so on), loading data into databases from many different types of source data, data cleaning, integrating applications, and so on.

The following recipes will focus on the main operations that you need to know to work with PDI and MongoDB.

Learning basic operations with Pentaho Data Integration

The following recipe is aimed at showing you the basic building blocks that you can use for the rest of the recipes in this chapter. We recommend that you work through this simple recipe before you tackle any of the others. If you want, PDI also contains a large selection of sample transformations for you to open, edit, and test. These can be found in the sample directory of PDI.

Getting ready

Before you can begin this recipe, you will need to make sure that the JAVA_HOME environment variable is set properly. By default, PDI tries to guess the value of the JAVA_HOME environment variable. Note that for this book, we are using Java 1.7. As soon as this is done, you're ready to launch Spoon, the graphical development environment for PDI. To start Spoon, you can use the appropriate scripts located at the PDI home folder. To start Spoon in Windows, you will have to execute the spoon.bat script in the home folder of PDI. For Linux or Mac, you will have to execute the spoon.sh bash script instead.

How to do it…

First, we need configure Spoon to be able to create transformations and/or jobs. To acclimatize to the tool, perform the following steps:

  1. Create a new empty transformation:
    1. Click on the New file button from the toolbar menu and select the Transformation item entry. You can also navigate to File | New | Transformation from the main menu. Ctrl + N also creates a new transformation.
  2. Set a name for the transformation:
    1. Open the Transformation settings dialog by pressing Ctrl + T. Alternatively, you can right-click on the right-hand-side working area and select Transformation settings. Or on the menu bar, select the Settings... item entry from the Edit menu.
    2. Select the Transformation tab.
    3. Set Transformation Name to First Test Transformation.
    4. Click on the OK button.
  3. Save the transformation:
    1. Click on the Save current file button from the toolbar. Alternatively, from the menu bar, go to File | Save. Or finally, use the quick option by pressing Ctrl + S.
    2. Choose the location of your transformation and give it the name chapter1-first-transformation.
    3. Click on the OK button.
  4. Run a transformation using Spoon.
    1. You can run the transformation by either of these ways: click on the green play icon on the transformation toolbar and navigate to Action | Run on the main menu or simply press F9.
    2. You will get an Execute a transformation dialog. Here, you can set parameters, variables, or arguments if they are required for running the transformation.
    3. Run the transformation by clicking on the Launch button.
  5. Run the transformation in preview mode using Spoon.
    1. In the Transformation debug dialog, select the step you want to preview the output data.
    2. After selecting the desired output step, you can preview the transformation by either clicking on the magnify icon on the transformation toolbar, going to Action | Preview on the main menu, or simply pressing F10.
    3. You will get a Transformation debug dialog that you can use to define the number of rows you want to see, breakpoints, and the step that you want analyze.
    4. You can click on the Configure button to define parameters, variables, or arguments. Click on the Quick Launch button to preview the transformation.

How it works…

In this recipe, we just introduced the Spoon tool, touching on the main basic points for you to manage ETL transformations. We started by creating a transformation. We gave a name to the transformation, First Test Transformation in this case. Then, we saved the transformation in the filesystem with the name chapter1-first-transformation.

Finally, we ran the transformation normally and in debug mode. Understanding how to run a transformation in debug mode is useful for future ETL developments as it helps you understand what is happening inside of the transformation.

There's more…

In the PDI home folder, you will find a large selection of sample transformations and jobs that you can open, edit, and run to better understand the functionality of the diverse steps available in PDI.

Migrating data from the RDBMS to MongoDB

In this recipe, you will transfer data from a sample RDBMS to a MongoDB database. The sample data is called SteelWheels and is available in the Pentaho BA server, running on the Hypersonic Database Server.

Getting ready

Start the Pentaho BA Server by executing the appropriate scripts located in the BA Server's home folder. It is start-pentaho.sh for Unix/Linux operating systems, and for the Windows operating system, it is start-pentaho.bat. Also in Windows, you can go to the Start menu and choose Pentaho Enterprise Edition, then Server Management, and finally Start BA Server.

Start Pentaho Data Integration by executing the right scripts in the PDI home folder. It is spoon.sh for Unix/Linux operating systems and spoon.bat for the Windows operating system. Besides this, in Windows, you can go to the Start menu and choose Pentaho Enterprise Edition, then Design Tools, and finally Data Integration.

Start MongoDB. If you don't have the server running as a service, you need execute the mongod –dbpath=<data folder> command in the bin folder of MongoDB.

To make sure you have the Pentaho BA Server started, you can access the default URL, which is http://localhost:8080/pentaho/. When you launch Spoon, you should see a welcome screen like the one pictured here:

Getting ready

How to do it…

After you have made that sure you are ready to start the recipe, perform the following steps:

  1. Create a new empty transformation.
    1. As was explained in the first recipe of this chapter, set the name of this transformation to Migrate data from RDBMS to MongoDB.
    2. Save the transformation with the name chapter1-rdbms-to-mongodb.
  2. Select a customer's data from the SteelWheels database using Table Input step.
    1. Select the Design tab in the left-hand-side view.
    2. From the Input category folder, find the Table Input step and drag and drop it into the working area in the right-hand-side view.
    3. Double-click on the Table Input step to open the configuration dialog.
    4. Set the Step Name property to Select Customers.
    5. Before we can get any data from the SteelWheels Hypersonic database, we will have to create a JDBC connection to it.

      To do this, click on the New button next to the Database Connection pulldown. This will open the Database Connection dialog.

      Set Connection Name to SteelWheels. Next, select the Connection Type as Hypersonic. Set Host Name to localhost, Database Name to SampleData, Port to 9001, Username to pentaho_user, and finally Password to password. Your setup should look similar to the following screenshot:

      How to do it…
    6. You can test the connection by clicking on the Test button at the bottom of the dialog. You should get a message similar to Connection Successful. If not, then you must double-check your connection details.
    7. Click on OK to return to the Table Input step.
    8. Now that we have a valid connection set, we are able to get a list of customers from the SteelWheels database. Copy and paste the following SQL into the query text area:
      SELECT * FROM CUSTOMERs
    9. Click on the Preview button and you will see a table of customer details.
    10. Your Table Input step configuration should look similar to what is shown in the following screenshot:
      How to do it…
    11. Click on OK to exit the Table Input configuration dialog.
  3. Now, let's configure the output of the customer's data in the MongoDB database.
    1. Under the Design tab, from the Big Data category folder, find the MongoDB Output step and drag and drop it into the working area in the right-hand-side view.
    2. As we want data to flow from the Table Input step to the MongoDB Output step, we are going to create a Hop between the steps. To do this, simply hover over the Table Input step and a popup will appear, with some options below the step. Click on Right Arrow and then on the MongoDB Output step. This will create a Hop between the two steps.
      How to do it…
    3. It's time to configure the MongoDB Output step. Double-click on it.
    4. Set Step Name to Customers Output.
    5. As we're running a default MongoDB instance, we only have to set some simple properties in this step. Set Hostname to localhost and Port to 27017.
    6. Select the Output options tab. In this tab, we can define how the data will be inserted into MongoDB.
    7. Set the Database property to SteelWheels. Don't worry if this database doesn't exist in MongoDB, as it will be created automatically.
    8. Set the Collection property to Customers. Again, don't worry if this collection doesn't exist in MongoDB, as it will be created automatically.
    9. Leave the Batch insert size property at 100. For performance and/or production purposes, you can increase it if necessary. If you don't provide any value to this field, the default value will be 100.
    10. We are going to truncate the collection each time before we load data. In this way, if we rerun the transformation many times, we won't get duplicate records. Your Output options page should look like what is shown in this screenshot:
      How to do it…
    11. Now, let's define the MongoDB documents structure. Select the Mongo document fields tab.
    12. Click on the Get fields button, and the fields list will be populated with the SteelWheels database fields in the ETL stream.
    13. By default, the column names in the SteelWheels database are in uppercase. In MongoDB, these field names should be in camel case. You can manually edit the names of the MongoDB document paths in this section also. Make sure that the Use Field Name option is set to No for each field, like this:
      How to do it…
    14. By clicking on Preview document structure, you will see an example of what the document will look like when it is inserted into the MongoDB Customers collection.
    15. Click on the OK button to finish the MongoDB Output configuration.
  4. The transformation design is complete. You can run it for testing purposes using the Run button, as illustrated here:
    How to do it…

How it works…

As you can see, this is a basic transformation that loads data from the RDBMS database and inserts it into a MongoDB collection. This is a very simple example of loading data from one point to another. Not all transformations are like this. That is why PDI comes with various steps that allow you to manipulate data along the way.

In this case, we truncate the collection each time the transformation is run. However, it is also possible to use other combinations, such as Insert&Update or just Insert or Update individually.

There's more…

Now that we have designed a transformation, let's look at a simple way of reusing the MongoDB connection for future transformations.

How to reuse the properties of a MongoDB connection

If you have to create MongoDB connections manually for each transformation, you are likely to make mistakes and typos. A good way to avoid this is to store the MongoDB connection details in a separate .properties file on your filesystem. There is a file called kettle.properties that is located in a hidden directory called .kettle in your home directory. For example, in Linux, the location will be /home/latino/.kettle. In Windows, it will be C:\Users\latino\.kettle. Navigate to and open this .properties file in your favorite text editor. Then, copy and paste the following lines:

MONGODB_STEELWHEELS_HOSTNAME=localhost
MONGODB_STEELWHEELS_PORT=27017
MONGODB_STEELWHEELS_USERNAME=
MONGODB_STEELWHEELS_PASSWORD=

Save the .properties file and restart Spoon.

Now, where can we use these properties?

You will notice that when you are setting properties in certain PDI steps, you can see the following icon:

How to reuse the properties of a MongoDB connection

This icon denotes that we can use a variable or parameter in place of a static value. Variables are defined using the following structure: ${MY_VARIABLE}. You will notice that the variables are encapsulated in ${}. If you are not sure what the name of your variable is, you can also press Ctrl and the Spacebar; this will open a drop-down list of the available variables. You will see the MongoDB variables that you defined in the .properties file earlier in this list. With this in mind, we can now replace the connection details in our steps with variables as shown in this screenshot:

How to reuse the properties of a MongoDB connection

You can find out more about the MongoDB Output step on this documentation website: http://wiki.pentaho.com/display/EAI/MongoDB+Output

Loading data from MongoDB to MySQL

In this recipe, we will guide you through extracting data from MongoDB and inserting it into a MySQL database. You will create a simple transformation as you did in the last recipe, but in reverse. You don't have to use MySQL as your database. If you want, you can use any other database. You just need to make sure that you can connect to Pentaho Data Integration via JDBC. However, in this book, we will use MySQL as an example.

Getting ready

Make sure you have created a MySQL database server or some other database type server with a database called SteelWheels. Also make sure that your MongoDB instance is running and launch Spoon.

How to do it…

After you have made sure that you have the databases set up, perform the following steps:

  1. Create a new empty transformation.
    1. Set the name for this transformation to Loading data from MongoDB to MySQL.
    2. Save the transformation with the name chapter1-mongodb-to-mysql.
  2. Select Customers from MongoDB using the MongoDB Input step.
    1. Select the Design tab in the left-hand-side view.
    2. From the Big Data category folder, find the MongoDB Input step and drag and drop it into the working area in the right-hand-side view.
    3. Double-click on the MongoDB Input step to open the configuration dialog.
    4. Set the Step Name property to Select Customers.
    5. Select the Input options tab. Click on Get DBs and select SteelWheels from the Database select box.
    6. After selecting the database, you can click on the Get Collections button and then select Customers Collection from the select box.
    7. As we're just running one MongoDB instance, we'll keep Read preference as primary and will not configure any Tag set specification.
    8. Click on the Query tab. In this section, we'll define the where filter data condition and the fields that we want to extract.
    9. As we just want the customers from USA, we'll write the following query in the Query expression (JSON) field: {"address.country": "USA"}.

      Note

      In this recipe, we are not going to cover the MongoDB aggregation framework, so you can ignore those options for now.

    10. Click on the Fields tab. In this tab, we'll define the output fields that we want. By default, the Output single JSON field comes checked. This means that each document is extracted in the JSON format with the field name defined in the Name of JSON output field. As we want to define the fields, we remove the selection of the Output single JSON field.
    11. Click on the Get fields button and you will get all the fields available from MongoDB. Remove the _id field because it isn't necessary. For deletion, you can select the row of the _id field and press the Delete key from your keyboard, or right-click on the row and select the Delete selected lines option.
    12. Click on OK to finish the MongoDB input configuration.
  3. Let's configure the output of the MongoDB Customers data in the MySQL database.
    1. On the Design tab, from the Output category folder, find the Table Output step and drag and drop it into the working area in the right-hand-side view.
    2. Connect the MongoDB Input step to the Table output step by creating a hop between them.
    3. Double-click on the step to open the Table Output configuration dialog.
    4. Set Step Name to Customers Output.
    5. Click on the New button next to the Database Connection pulldown. This will open the Database Connection dialog.

      Set Connection Name to SteelWheels. Select the Connection Type as MySQL. Set Host Name to localhost, Database Name to SteelWheels, and Port to 3306. Then, set Username and Password to whatever you had set them as. Your setup should look similar to the following screenshot:

      How to do it…
    6. Test this, and if all is well, click on OK to return to the Table Output step.
  4. Insert this data into a MySQL table using the Table Output step:
    1. Set the Target table field to Customers. This is the name of the MySQL table to insert data into.
    2. As we haven't created a customer's table in the MySQL database, we can use a PDI function that will try to generate the required SQL to create the table and structure. Simply click on the SQL button and it will open the Execute SQL dialog. Here, you will see the SQL that PDI will execute to create the customers table. Click on Execute to send this SQL to MySQL and create the table. Then, click on OK.
    3. Click on OK again to exit the Table Output configuration dialog. The transformation is complete. You can now run it to load data from MongoDB to MySQL.

How it works…

In this transformation, we are simply selecting a collection from the MongoDB Input step where the country field is USA. Next, we map this collection to the fields in the PDI stream. Lastly, we insert this data into a MySQL table using the Table Output step. In the Fields tab, we use JSONPath to select the correct data from the MongoDB collection (http://goessner.net/articles/JsonPath/). JSONPath is like XPath for JSON documents.

Migrating data from files to MongoDB

In this recipe, we will guide you through creating a transformation that loads data from different files in your filesystem, and then load them into a MongoDB Collection. We are going to load data from files called orders.csv, customers.xls, and products.xml. Each of these files contains a key that we can use to join data in PDI before we send it to the MongoDB Output step.

Getting ready

Start Spoon and take a look at the content of the orders.csv, customers.xls, and products.xml files. This will help you understand what the data looks like before you start loading it into MongoDB.

How to do it…

You will need the orders.csv, customers.xls, and products.xml files. These files will be available at the Packt Publishing website, just in case you don't have them. Make sure that MongoDB is up and running, and then you will be able to perform to the following steps:

  1. Create a new empty transformation.
    1. Set the transformation name to Migrate data from files to MongoDB.
    2. Save the transformation with the name chapter1-files-to-mongodb.
  2. Select data from the orders.csv file using the CSV file input step.
    1. Select the Design tab in the left-hand-side view.
    2. From the Input category folder, find the CSV file input step and drag and drop it into the working area in the right-hand-side view.
    3. Double-click on the step to open the CSV Input configuration dialog.
    4. Set Step Name to Select Orders.
    5. In the Filename field, click on the Browse button, navigate to the location of the .csv file, and select the order.csv file.
    6. Set the Delimiter field to a semicolon (;).
    7. Now, let's define our output fields by clicking on the Get Fields button. A Sample size dialog will appear; it is used to analyze the format data in the CSV file. Click on OK. Then, click on Close in Scan results.
    8. Click on OK to finish the configuration of the CSV file input.
  3. Select data from the customers.xls file using the Microsoft Excel Input step.
    1. Select the Design tab in the left-hand-side view.
    2. From the Input category folder, find the Microsoft Excel Input step and drag and drop it into the working area in the right-hand-side view.
    3. Double-click on the step to open the Microsoft Excel Input dialog.
    4. Set Step Name to Select Customers.
    5. On the Files tab, in the File or directory field, click on the Browse button and choose the location of the customers.xls file in your filesystem. After that, click on the Add button to add the file to the list of files to be processed.
    6. Select the Sheets tab. Then, click on the Get sheetname(s)... button. You'll be shown an Enter list dialog. Select Sheet1 and click on the > button to add a sheet to the Your selection list. Finally, click on OK.
    7. Select the Fields tab. Then, click on the Get field from header row... button. This will generate a list of existing fields in the spreadsheet. You will have to make a small change; change the Type field for Customer Number from Number to Integer. You can preview the file data by clicking on the Preview rows button.
    8. Click on OK to finish the configuration of the Select Customers step.
  4. Select data from the products.xml file using the Get data from XML step.
    1. Select the Design tab in the left-hand-side view.
    2. From the Input category folder, find the Get data from XML step and drag and drop it into the working area in the right-hand-side view.
    3. Double-click on the step to open the Get data from XML dialog.
    4. Set Step Name to Select Products.
    5. On the File tab, in the File or directory field, click on the Browse button and choose the location of the products.xml file in your filesystem. After that, click on the Add button to add the file to the list of files to be processed.
    6. Select the Content tab. Click on the Get XPath nodes button and select the /products/product option from the list of the Available Paths dialog.
    7. Next, select the Fields tab. Click on the Get fields button and you will get a list of available fields in the XML file. Change the types of the last three fields (stockquantity, buyprice, and MSRP) from Number to Integer. Set the Trim Type to Both for all fields.
  5. Now, let's join the data from the three different files.
    1. Select the Design tab in the left-hand-side view.
    2. From the Lookup category folder, find the Stream lookup step. Drag and drop it onto the working area in the right-hand-side view. Double-click on Stream lookup and change the Step name field to Lookup Customers.
    3. We are going to need two lookup steps for this transformation. Drag and drop another Stream Lookup step onto the design view, and set Step Name to Lookup Products.
    4. Create a hop between the Select Orders step and the Lookup Customers step.
    5. Then, create a hop from the Select Customers step to the Lookup Customers step.
    6. Next, create a hop from the Lookup Customers step to the Lookup Products step.
    7. Finally, create a hop from Select Products to the Lookup Products step.
  6. Let's configure the Lookup Customers step. Double-click on the Lookup Customers step and set the Lookup step field to the Select Customers option.
    1. In the Keys section, set the Field and Lookup Field options to Customer Number.
    2. Click on the Get lookup fields button. This will populate the step with all the available fields from the lookup source. Remove Customer Number from the field from the list.
    3. Click on OK to finish.
  7. Let's configure the Lookup Products step. The process is similar to that of the Lookup Customers step but with different values. Double-click on the Lookup Products step and set the Lookup step field to the Select Products option.
    1. In the Keys section, set Field to Product Code and the LookupField option to Code.
    2. Click on the Get lookup fields button. This will populate the step with all the available fields from the lookup source. Remove Code from the field in the list.
    3. Click on OK to finish.
  8. Now that we have the data joined correctly, we can write the data stream to a MongoDB collection.
    1. On the Design tab, from the Big Data category folder, find the MongoDB Output step and drag and drop it into the working area in the right-hand-side view.
    2. Create a hop between the Lookup Products step and the MongoDB Output step.
    3. Double-click on the MongoDB Output step and change the Step name field to Orders Output.
    4. Select the Output options tab. Click on the Get DBs buttons and select the SteelWheels option for the Database field. Set the Collection field to Orders. Check the Truncate collection option.
    5. Select the Mongo document fields tab. Click on the Get fields button and you will get a list of fields from the previous step.
    6. Configure the Mongo document output as seen in the following screenshot:
      How to do it…
    7. Click on OK.
  9. You can run the transformation and check out MongoDB for the new data. Your transformation should look like the one in this screenshot:
    How to do it…

How it works…

In this transformation, we initially get data from the Orders CSV. This first step populates the primary data stream in PDI. Our other XLS and XML steps also collect data. We then connect these two streams of data to the first stream using the Lookup steps and the correct keys. When we finally have all of the data in the single stream, we can load it into the MongoDB collection.

You can learn more about the Stream lookup step online at:

http://wiki.pentaho.com/display/EAI/Stream+Lookup

Exporting MongoDB data using the aggregation framework

In this recipe, we will explore the use of the MongoDB aggregation framework in the MongoDB Input Step. We will create a simple example to get data from a collection and show you how you can take advantage of the MongoDB aggregation framework to prepare data for the PDI stream.

Getting ready

To get ready for this recipe, you will need to start your ETL development environment Spoon, and make sure that you have the MongoDB server running with the data from the previous recipe.

How to do it…

The following steps introduce the use of the MongoDB aggregation framework:

  1. Create a new empty transformation.
    1. Set the transformation to PDI using MongoDB Aggregation Framework.
    2. Set the name for this transformation to chapter1-using-mongodb-aggregation-framework.
  2. Select data from the Orders collection using the MongoDB Input step.
    1. Select the Design tab in the left-hand-side view.
    2. From the Big Data category folder, find the MongoDB Input step and drag and drop it into the working area in the right-hand-side view.
    3. Double-click on the step to open the MongoDB Input dialog.
    4. Set the step name to Select 'Baane Mini Imports' Orders.
    5. Select the Input options tab. Click on the Get DBs button and select the SteelWheels option for the Database field. Next, click on Get collections and select the Orders option for the Collection field.
    6. Select the Query tab and then check the Query is aggregation pipeline option. In the text area, write the following aggregation query:
      [ 
       { $match: {"customer.name" : "Baane Mini Imports"} },
       { $group: {"_id" : {"orderNumber": "$orderNumber", 
       "orderDate" : "$orderDate"}, "totalSpend": { $sum: 
       "$totalPrice"} } } 
      ]
    7. Uncheck the Output single JSON field option.
    8. Select the Fields tab. Click on the Get Fields button and you will get a list of fields returned by the query. You can preview your data by clicking on the Preview button.
    9. Click on the OK button to finish the configuration of this step.
  3. We want to add a Dummy step to the stream. This step does nothing, but it will allow us to select a step to preview our data. Add the Dummy step from the Flow category to the workspace and name it OUTPUT.
  4. Create a hop between the Select 'Baane Mini Imports' Orders step and the OUTPUT step.
  5. Select the OUTPUT dummy step and preview the data.

How it works…

The MongoDB aggregation framework allows you to define a sequence of operations or stages that is executed in pipeline much like the Unix command-line pipeline. You can manipulate your collection data using operations such as filtering, grouping, and sorting before the data even enters the PDI stream.

In this case, we are using the MongoDB Input step to execute an aggregation framework query. Technically, this does the same as db.collection.aggregate(). The query that we execute is broken down into two parts. For the first part, we filter the data based on a customer name. In this case, it is Baane Mini Imports. For the second part, we group the data by order number and order date and sum the total price.

See also

In the next recipe, we will talk about other ways in which you can aggregate data using MongoDB Map/Reduce.

MongoDB Map/Reduce using the User Defined Java Class step and MongoDB Java Driver

In this recipe, we will use the MongoDB Map/Reduce on PDI. Unfortunately, PDI doesn't provide a step for this MongoDB feature. However, PDI does provide a step called User Defined Java Class (UDJC) that will allow you to write Java code to manipulate your data.

We are going to get the total price for all orders for a single client, which we will pass to the transformation as a parameter. We will also get a total for all other clients in the collection. In total, we should get two rows back.

Getting ready

To get ready for this recipe, you need to download the MongoDB driver. In this case, we are using the mongo-java-driver-2.11.1 version. You can use the last version, but the code in this recipe may be a bit out of date. The driver should live in the lib folder of PDI. Then, you just need start your ETL development environment Spoon and make sure you have the MongoDB server started with the data from the last recipe inserted.

How to do it…

In this recipe, we'll program Java code and utilize the MongoDB Java driver to connect to the MongoDB database. So, make sure you have the driver in the lib folder of PDI and then perform the following steps:

  1. Create a new empty transformation.
    1. Set the transformation name to MongoDB Map/Reduce.
    2. On the Transformation properties and Parameters tab, create a new parameter with the name as CUSTOMER_NAME.
    3. Save the transformation with the name chapter1-mongodb-map-reduce.
  2. From the Job category folder, find the Get Variables step and drag and drop it into the working area in the right-side view.
    1. Double-click on the Get Variables step to open the configuration dialog.
    2. Set the Step name property to Get Customer Name.
    3. Add a row with the name as customerName, the variable as ${CUSTOMER_NAME}, and Type set to String.
  3. From the Scripting category folder, find the User Defined Java Class step and drag and drop it into the working area in the right-hand-side view.
  4. Create a hop between the Get Customer Name step and the User Defined Java Class step.
    1. Double-click on the User Defined Java Class step to open the configuration dialog.
    2. In the Step name field, give a suggested name of MapReduce.
    3. In Class code, let's define our Java code that is sent to MongoDB by a command using the MapReduce functions and then we will get the result:
      import com.mongodb.DB;
      import com.mongodb.DBCollection;
      import com.mongodb.DBObject;
      import com.mongodb.MapReduceCommand;
      import com.mongodb.MapReduceOutput;
      import com.mongodb.Mongo;
      
      private FieldHelper customerNameIn = null;
      
      public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
      {
        Object[] r = getRow();
        if (r == null) {
          setOutputDone();
          return false;
        }
         if (first) {
          first = false;
          customerNameIn = get(Fields.In, "customerName");
        }
        
        try {
          final Mongo mongo = new Mongo("localhost", 27017);
          final DB db = mongo.getDB("SteelWheels");
          final DBCollection ordersCol = db.getCollection("Orders");
          final String map = "function() { "+
            "var category; " + "if ( this.customer.name == '"+customerNameIn.getString(r)+"' ) "+ "category = '"+customerNameIn.getString(r)+"'; " + "else " + "category = 'Others'; "+ "emit(category, {totalPrice: this.totalPrice, count: 1});}";
          final String reduce = "function(key, values) { " + "var n = { count: 0, totalPrice: 0}; " + "for ( var i = 0; i < values.length; i++ ) {" + "n.count += values[i].count; "+ "n.totalPrice += values[i].totalPrice; "+
              } " + "return n;} ";
          final MapReduceCommand cmd = new MapReduceCommand(ordersCol, map, reduce, null, MapReduceCommand.OutputType.INLINE, null);
          final MapReduceOutput out = ordersCol.mapReduce(cmd);
          get(Fields.Out, "mapReduceJSON").setValue(r,out.toString());   
        } catch (Exception e) {
          e.printStackTrace();
          get(Fields.Out, "mapReduceJSON").setValue(r,"");
        }
        r = createOutputRow(r, data.outputRowMeta.size());
        putRow(data.outputRowMeta, r);
        return true;
      }

      Tip

      Downloading the example code

      You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    4. On the Fields tab, set Fieldname to mapReduceJSON and the Type property to String. This will be the field output from the MapReduce command.
    5. Click on OK to finish the configuration.
  5. From the Input category folder, find the Json Input step and drag and drop it into the working area in the right-hand-side view.
  6. Create a hop between the MapReduce step and the Json Input step.
    1. Double-click on the JSON Input step to open the configuration dialog.
    2. Set the Step Name property to Convert JSON.
    3. On the File tab, check the Source is defined in a field? option. Next, select the mapReduceJSON option in the select box of Get source from field.
    4. On the Fields tab, we will map the JSON to Fields in the PDI stream. The definition should be like what is shown in this screenshot:
      How to do it…
    5. Click on OK to finish the configuration.
  7. Now, let's define the fields that we want to see as the output of the transformation. From the Transform category folder, find the Select values step and drag and drop it into the working area in the right-side view.
  8. Create a hop between the Convert JSON step and the Select values step.
    1. Double-click on the Select Values step to open the configuration dialog.
    2. Set the Step Name property to OUTPUT.
    3. On the Select & Alter tab, click on the Get fields to select button. This will populate the table with all the available fields in the stream. Remove the mapReduceJSON field; it isn't necessary anymore, since we have converted it into individual fields in the PDI stream.
    4. Click on OK to finish the configuration.
  9. When you run the transformation, be sure to set the CUSTOMER_NAME parameter in the Run dialog. This will be used by the Get Customer Name step and to filter the map function.
    How to do it…

How it works…

In this example, we executed a transformation that takes CUSTOMER_NAME as a parameter. This value is then sent to User Defined Java Class and used in the Java code within. The code in User Defined Java Class is a simple Map and Reduce JavaScript function that we are sending to the MongoDB server.

The output of this step is a single JSON row that needs to be parsed into fields in the PDI Stream. To do this, we used the JSON input step and mapped the JSON string to individual stream fields.

If you want to know more about User Defined Java Class, you can find out more in the documentation at http://wiki.pentaho.com/display/EAI/User+Defined+Java+Class.

There's more…

When we talk about map and reduce functions, it is almost mandatory to talk about Hadoop, an open source software framework for storage and processing of datasets that uses a MapReduce engine.

PDI provides integration with Hadoop using PDI job steps and transformation steps. You can find more documentation about this on the Pentaho website. Personally, I recommend these two tutorials:

Working with jobs and filtering MongoDB data using parameters and variables

In this recipe, we guide you through creating two PDI jobs. One uses variables and the other uses parameters. In a PDI process, jobs orchestrate other jobs and transformations in a coordinated way to realize the main business process. These jobs use the transformation created in the last recipe but with some changes, as described in this recipe.

So, in this recipe, we are going create two different jobs, which will be used to send data to a subtransformation. The subtransformation that we will use will be a copy of the transformation in the previous recipe.

Getting ready

To get ready for this recipe, you need to start your ETL development environment Spoon, and make sure you have the MongoDB server started with the data inserted in the last recipes.

How to do it…

Let's start using jobs and variables. We can orchestrate the ETL to run in different ways. In this simple case, we are just using the customer name. Perform the following steps:

  1. Let's copy and paste the transformation created in the previous recipe and save it as chapter1-mongodb-map-reduce-writelog.ktr.
  2. Open that transformation using Spoon, and from the Utility category folder, find the Write to log step. Drag and drop it into the working area in the right-side view.
    1. Create a hop between the OUTPUT step and the Write to log step.
    2. Double-click on the Write to Log step to open the configuration dialog.
    3. Set Step Name to MapReduce.
    4. Click on the Get Fields button.
    5. Click on OK to finish the configuration.
  3. Let's create a new empty job.
    1. Click on the New file button from the toolbar menu and select the Job item entry. Alternatively from menu bar, go to File | New | Job.
    2. Open the Job properties dialog by pressing Ctrl + J or by right-clicking on the right-hand-side working area and selecting Job settings.
    3. Select the Job tab. Set Job Name to Job Parameters.
    4. Select the Parameters tab and add a Parameter entry with the name as CUSTOMER_NAME. Click on OK.
    5. Save the Job with the name job-parameters.
  4. From the General category folder, find the START, Transformation, and Success steps and drag and drop them into the working area in the right-side view.
    1. Create a hop between the START step and the Transformation step.
    2. Then, create a hop from the Transformation step to the Success step.
    3. Double-click on the Transformation step to open the configuration dialog
    4. Change the Name of job entry property to MapReduce Transf.
    5. Click on the transformation button of the Transformation filename field and select the transformation file that you copied before in your filesystem. Also select the chapter1-mongodb-map-reduce-writelog.ktr file.
    6. Select the Parameters tab. By default, the Pass all parameters values down to the sub-transformation option is checked, which means our job parameter will be passed to the transformation.
    7. Click on OK to finish.
    8. Run the job and analyze the results and check the logs on the Logging tab.

Now let's do a quick and simple example using variables:

  1. Copy and paste the chapter1-mongodb-map-reduce-writelog transformation. Save it as chapter1-mongodb-map-reduce-writelog-without-parameter.
  2. Open the transformation with Spoon and remove the parameter from Transformation properties.
  3. Copy and paste the last job. Save it as job-variables.
    1. Open the job with Spoon.
    2. In Job properties, change the job name to Job Variables. From the Parameters tab, remove the CUSTOMER_NAME parameter. Select the parameter, right-click on it and select Delete selected lines, or just press delete on your keyboard.
    3. Click on OK to finish.
  4. From the General category folder, find the Set variables step and drag and drop it into the working area in the right-side view.
    1. Remove the hop from between the START step and MapReduce Transf step.
    2. Create a hop between the START step and the Set variables step.
    3. Then, create a hop between Set Variables and the MapReduce Transf step.
    4. Double-click on the Set Variables step to open the configuration dialog.
    5. Set the Step name property to Set CUSTOMER_NAME.
    6. On Variables, create a new variable with the CUSTOMER_NAME name. Set the value to an existing client in the database and the Scope type to Valid in the root job.
    7. Click on OK to finish the configuration.
  5. On the MapReduce Transf transformation step, change the file location for the transformation file to the transformation without the parameter.
  6. Run the job and analyze the results, checking the logs in the Logging tab.

How it works…

Most ETL solutions created in Pentaho Data Integration will be sets of jobs and transformations.

Transformations are workflows with an orchestration of actions that manipulate data using essentially input, transformation, and output steps.

Jobs are workflows with an orchestration of tasks that can be order execution failure or success.

Variables and parameters are extremely useful functions that we can use to create dynamic jobs and transformations.

Left arrow icon Right arrow icon

Key benefits

  • Create reports and stunning dashboards with MongoDB data
  • Accelerate data access and maximize productivity with unique features of Pentaho for MongoDB
  • A step-by-step recipe-based guide for making full use of Pentaho suite tools with MongoDB.

Description

MongoDB is an open source, schemaless NoSQL database system. Pentaho as a famous open source Analysis tool provides high performance, high availability, and easy scalability for large sets of data. The variant features in Pentaho for MongoDB are designed to empower organizations to be more agile and scalable and also enables applications to have better flexibility, faster performance, and lower costs. Whether you are brand new to online learning or a seasoned expert, this book will provide you with the skills you need to create turnkey analytic solutions that deliver insight and drive value for your organization. The book will begin by taking you through Pentaho Data Integration and how it works with MongoDB. You will then be taken through the Kettle Thin JDBC Driver for enabling a Java application to interact with a database. This will be followed by exploration of a MongoDB collection using Pentaho Instant view and creating reports with MongoDB as a datasource using Pentaho Report Designer. The book will then teach you how to explore and visualize your data in Pentaho BI Server using Pentaho Analyzer. You will then learn how to create advanced dashboards with your data. The book concludes by highlighting contributions of the Pentaho Community.

Who is this book for?

This book is intended for data architects and developers with a basic level of knowledge of MongoDB. Familiarity with Pentaho is not expected.

What you will learn

  • Extract, load, and transform data from MongoDB collections to other datasources
  • Design Pentaho Reports using different types of connections for MongoDB
  • Create a OLAP mondrian schema for MongoDB
  • Explore your MongoDB data using Pentaho Analyzer
  • Utilize the drag and drop web interface to create dashboards
  • Use Kettle Thin JDBC with MongoDB for analysis
  • Integrate advanced dashboards with MondoDB using different types of connections
  • Publish and run a report on Pentaho BI server using a web interface
Estimated delivery fee Deliver to Turkey

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Dec 29, 2015
Length: 218 pages
Edition : 1st
Language : English
ISBN-13 : 9781783553273
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Turkey

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Publication date : Dec 29, 2015
Length: 218 pages
Edition : 1st
Language : English
ISBN-13 : 9781783553273
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 103.98
Pentaho Analytics for MongoDB Cookbook
$48.99
Learning Pentaho CTools
$54.99
Total $ 103.98 Stars icon
Banner background image

Table of Contents

9 Chapters
1. PDI and MongoDB Chevron down icon Chevron up icon
2. The Thin Kettle JDBC Driver Chevron down icon Chevron up icon
3. Pentaho Instaview Chevron down icon Chevron up icon
4. A MongoDB OLAP Schema Chevron down icon Chevron up icon
5. Pentaho Reporting Chevron down icon Chevron up icon
6. The Pentaho BI Server Chevron down icon Chevron up icon
7. Pentaho Dashboards Chevron down icon Chevron up icon
8. Pentaho Community Contributions Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela