Exporting MongoDB data using the aggregation framework
In this recipe, we will explore the use of the MongoDB aggregation framework in the MongoDB Input Step. We will create a simple example to get data from a collection and show you how you can take advantage of the MongoDB aggregation framework to prepare data for the PDI stream.
Getting ready
To get ready for this recipe, you will need to start your ETL development environment Spoon, and make sure that you have the MongoDB server running with the data from the previous recipe.
How to do it…
The following steps introduce the use of the MongoDB aggregation framework:
- Create a new empty transformation.
- Set the transformation to PDI using MongoDB Aggregation Framework.
- Set the name for this transformation to
chapter1-using-mongodb-aggregation-framework
.
- Select data from the Orders collection using the MongoDB Input step.
- Select the Design tab in the left-hand-side view.
- From the Big Data category folder, find the MongoDB Input step and drag and drop it into the working area in the right-hand-side view.
- Double-click on the step to open the MongoDB Input dialog.
- Set the step name to Select 'Baane Mini Imports' Orders.
- Select the Input options tab. Click on the Get DBs button and select the SteelWheels option for the Database field. Next, click on Get collections and select the Orders option for the Collection field.
- Select the Query tab and then check the Query is aggregation pipeline option. In the text area, write the following aggregation query:
[ { $match: {"customer.name" : "Baane Mini Imports"} }, { $group: {"_id" : {"orderNumber": "$orderNumber", "orderDate" : "$orderDate"}, "totalSpend": { $sum: "$totalPrice"} } } ]
- Uncheck the Output single JSON field option.
- Select the Fields tab. Click on the Get Fields button and you will get a list of fields returned by the query. You can preview your data by clicking on the Preview button.
- Click on the OK button to finish the configuration of this step.
- We want to add a Dummy step to the stream. This step does nothing, but it will allow us to select a step to preview our data. Add the Dummy step from the Flow category to the workspace and name it OUTPUT.
- Create a hop between the Select 'Baane Mini Imports' Orders step and the OUTPUT step.
- Select the OUTPUT dummy step and preview the data.
How it works…
The MongoDB aggregation framework allows you to define a sequence of operations or stages that is executed in pipeline much like the Unix command-line pipeline. You can manipulate your collection data using operations such as filtering, grouping, and sorting before the data even enters the PDI stream.
In this case, we are using the MongoDB Input step to execute an aggregation framework query. Technically, this does the same as db.collection.aggregate(). The query that we execute is broken down into two parts. For the first part, we filter the data based on a customer name. In this case, it is Baane Mini Imports. For the second part, we group the data by order number and order date and sum the total price.
See also
In the next recipe, we will talk about other ways in which you can aggregate data using MongoDB Map/Reduce.