Feature selection is the first (and sometimes the most important) step in a machine learning pipeline. Not all of these features are useful for our purposes, and some of them are expressed using different notations, so it's often necessary to preprocess our dataset before any further operations.
We saw how we can split the data into training and test sets using a random shuffle and how to manage missing elements. Another very important section covered the techniques used to manage categorical data or labels, which are very common when a certain feature only assumes a discrete set of values.
Then, we analyzed the problem of dimensionality. Some datasets contain many features that are correlated with each other, so they don't provide any new information but increase the computational complexity and reduce the overall performances. The PCA is a method to select...