Real world datasets are very varied: variables can be textual, numerical, or categorical and observations can be missing, false, or wrong (outliers). To perform a proper data analysis, we will understand how to correctly parse a dataset, clean it, and create an output matrix optimally built for regression. To extract knowledge, it is essential that the reader is able to create an observation matrix, using different techniques of data analysis and cleaning.
In the previous chapters, we analyzed how to perform a single and multiple regression analysis while how to carry out a multiple and multinomial logistic regression. But in all cases analyzed, to get the correct indication from the models, the data must be processed in advance to eliminate any anomalies.
In this chapter, we will explore the data preparation techniques to obtain a high- performing...