Aggregating data for modeling
From the previous chapters, you might remember that machine learning algorithms expect the dataset to be in a specific form and it needs to be in one table. The data needed for this table, however, could reside in multiple sources. Hence, one of the first things you need to do is to aggregate data from multiple sources. This is often done using SQL or Python. Recently, DataRobot has added the capability to add multiple datasets into a project and then aggregate this data within DataRobot. Please note that there are still some data cleansing operations that you might have to do outside of DataRobot, so if you want to use the aggregation capabilities of DataRobot, you need to do cleansing operations prior to bringing this data into DataRobot. We cover data cleansing in the following section. If you choose to do data aggregation inside DataRobot, you have to make sure to do this at the very start of the project (Figure 4.4):