Summary
In this chapter, we learned how to perform several operations on a data frame, including scaling, standardizing, and normalizing. Also, we covered the sorting, ranking, and joining operations with their implementations in R. We discussed the need for pre-processing of the data; and identified and handled outliers, missing values, and duplicate values.
Next, we moved on to the sampling of data. It is important for the data to contain a reasonable sample of each class that is to be predicted. If the data is imbalanced, it can affect our predictions in a negative manner. Therefore, we can use either the undersampling, oversampling, ROSE, or SMOTE techniques imbalanced to ensure that the dataset is representative of all the classes that we want to predict. This can be done using the MICE, rpart, ROSE, and caret packages.
In the next chapter, we will cover feature engineering in detail, where we will focus on extracting features to create models.