Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Pentaho Analytics for MongoDB Cookbook

You're reading from   Pentaho Analytics for MongoDB Cookbook Over 50 recipes to learn how to use Pentaho Analytics and MongoDB to create powerful analysis and reporting solutions

Arrow left icon
Product type Paperback
Published in Dec 2015
Publisher
ISBN-13 9781783553273
Length 218 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Harris Ward Harris Ward
Author Profile Icon Harris Ward
Harris Ward
Joel Andre Latino Joel Andre Latino
Author Profile Icon Joel Andre Latino
Joel Andre Latino
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

Preface 1. PDI and MongoDB FREE CHAPTER 2. The Thin Kettle JDBC Driver 3. Pentaho Instaview 4. A MongoDB OLAP Schema 5. Pentaho Reporting 6. The Pentaho BI Server 7. Pentaho Dashboards 8. Pentaho Community Contributions Index

Migrating data from files to MongoDB

In this recipe, we will guide you through creating a transformation that loads data from different files in your filesystem, and then load them into a MongoDB Collection. We are going to load data from files called orders.csv, customers.xls, and products.xml. Each of these files contains a key that we can use to join data in PDI before we send it to the MongoDB Output step.

Getting ready

Start Spoon and take a look at the content of the orders.csv, customers.xls, and products.xml files. This will help you understand what the data looks like before you start loading it into MongoDB.

How to do it…

You will need the orders.csv, customers.xls, and products.xml files. These files will be available at the Packt Publishing website, just in case you don't have them. Make sure that MongoDB is up and running, and then you will be able to perform to the following steps:

  1. Create a new empty transformation.
    1. Set the transformation name to Migrate data from files to MongoDB.
    2. Save the transformation with the name chapter1-files-to-mongodb.
  2. Select data from the orders.csv file using the CSV file input step.
    1. Select the Design tab in the left-hand-side view.
    2. From the Input category folder, find the CSV file input step and drag and drop it into the working area in the right-hand-side view.
    3. Double-click on the step to open the CSV Input configuration dialog.
    4. Set Step Name to Select Orders.
    5. In the Filename field, click on the Browse button, navigate to the location of the .csv file, and select the order.csv file.
    6. Set the Delimiter field to a semicolon (;).
    7. Now, let's define our output fields by clicking on the Get Fields button. A Sample size dialog will appear; it is used to analyze the format data in the CSV file. Click on OK. Then, click on Close in Scan results.
    8. Click on OK to finish the configuration of the CSV file input.
  3. Select data from the customers.xls file using the Microsoft Excel Input step.
    1. Select the Design tab in the left-hand-side view.
    2. From the Input category folder, find the Microsoft Excel Input step and drag and drop it into the working area in the right-hand-side view.
    3. Double-click on the step to open the Microsoft Excel Input dialog.
    4. Set Step Name to Select Customers.
    5. On the Files tab, in the File or directory field, click on the Browse button and choose the location of the customers.xls file in your filesystem. After that, click on the Add button to add the file to the list of files to be processed.
    6. Select the Sheets tab. Then, click on the Get sheetname(s)... button. You'll be shown an Enter list dialog. Select Sheet1 and click on the > button to add a sheet to the Your selection list. Finally, click on OK.
    7. Select the Fields tab. Then, click on the Get field from header row... button. This will generate a list of existing fields in the spreadsheet. You will have to make a small change; change the Type field for Customer Number from Number to Integer. You can preview the file data by clicking on the Preview rows button.
    8. Click on OK to finish the configuration of the Select Customers step.
  4. Select data from the products.xml file using the Get data from XML step.
    1. Select the Design tab in the left-hand-side view.
    2. From the Input category folder, find the Get data from XML step and drag and drop it into the working area in the right-hand-side view.
    3. Double-click on the step to open the Get data from XML dialog.
    4. Set Step Name to Select Products.
    5. On the File tab, in the File or directory field, click on the Browse button and choose the location of the products.xml file in your filesystem. After that, click on the Add button to add the file to the list of files to be processed.
    6. Select the Content tab. Click on the Get XPath nodes button and select the /products/product option from the list of the Available Paths dialog.
    7. Next, select the Fields tab. Click on the Get fields button and you will get a list of available fields in the XML file. Change the types of the last three fields (stockquantity, buyprice, and MSRP) from Number to Integer. Set the Trim Type to Both for all fields.
  5. Now, let's join the data from the three different files.
    1. Select the Design tab in the left-hand-side view.
    2. From the Lookup category folder, find the Stream lookup step. Drag and drop it onto the working area in the right-hand-side view. Double-click on Stream lookup and change the Step name field to Lookup Customers.
    3. We are going to need two lookup steps for this transformation. Drag and drop another Stream Lookup step onto the design view, and set Step Name to Lookup Products.
    4. Create a hop between the Select Orders step and the Lookup Customers step.
    5. Then, create a hop from the Select Customers step to the Lookup Customers step.
    6. Next, create a hop from the Lookup Customers step to the Lookup Products step.
    7. Finally, create a hop from Select Products to the Lookup Products step.
  6. Let's configure the Lookup Customers step. Double-click on the Lookup Customers step and set the Lookup step field to the Select Customers option.
    1. In the Keys section, set the Field and Lookup Field options to Customer Number.
    2. Click on the Get lookup fields button. This will populate the step with all the available fields from the lookup source. Remove Customer Number from the field from the list.
    3. Click on OK to finish.
  7. Let's configure the Lookup Products step. The process is similar to that of the Lookup Customers step but with different values. Double-click on the Lookup Products step and set the Lookup step field to the Select Products option.
    1. In the Keys section, set Field to Product Code and the LookupField option to Code.
    2. Click on the Get lookup fields button. This will populate the step with all the available fields from the lookup source. Remove Code from the field in the list.
    3. Click on OK to finish.
  8. Now that we have the data joined correctly, we can write the data stream to a MongoDB collection.
    1. On the Design tab, from the Big Data category folder, find the MongoDB Output step and drag and drop it into the working area in the right-hand-side view.
    2. Create a hop between the Lookup Products step and the MongoDB Output step.
    3. Double-click on the MongoDB Output step and change the Step name field to Orders Output.
    4. Select the Output options tab. Click on the Get DBs buttons and select the SteelWheels option for the Database field. Set the Collection field to Orders. Check the Truncate collection option.
    5. Select the Mongo document fields tab. Click on the Get fields button and you will get a list of fields from the previous step.
    6. Configure the Mongo document output as seen in the following screenshot:
      How to do it…
    7. Click on OK.
  9. You can run the transformation and check out MongoDB for the new data. Your transformation should look like the one in this screenshot:
    How to do it…

How it works…

In this transformation, we initially get data from the Orders CSV. This first step populates the primary data stream in PDI. Our other XLS and XML steps also collect data. We then connect these two streams of data to the first stream using the Lookup steps and the correct keys. When we finally have all of the data in the single stream, we can load it into the MongoDB collection.

You can learn more about the Stream lookup step online at:

http://wiki.pentaho.com/display/EAI/Stream+Lookup

You have been reading a chapter from
Pentaho Analytics for MongoDB Cookbook
Published in: Dec 2015
Publisher:
ISBN-13: 9781783553273
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image