Almost there! This step has the following five tasks:
- Selecting the data.
- Cleaning the data.
- Constructing the data.
- Integrating the data.
- Formatting the data.
These tasks are relatively self-explanatory. The goal is to get the data ready to input in the algorithms. This includes merging, feature engineering, and transformations. If imputation is needed, then it happens here as well. Additionally, with R, pay attention to how the outcome needs to be labeled. If your outcome/response variable is Yes/No, it may not work in some packages and will require a transformed or no variable with 1/0. At this point, you should also break your data into the various test sets if applicable: train, test, or validate. This step can be an unmitigated burden, but most experienced people will tell you that it is where you can separate yourself from your peers. With this, let's move on to the payoff, where you earn your money.