Executing dimensionality reduction
In the Explaining feature engineering section of Chapter 2, Detecting Spam Emails, we defined a feature of a ML problem as an attribute or a characteristic that describes it. Accumulating many features together creates a vector of attributes and each sample in a dataset is a unique combination of vector values. Consequently, adding more features to a specific problem implies increasing the vector’s dimensions. It is logical to think that having more features will provide a better description of the underlying data and alleviate the work of any ML algorithm that follows. But unfortunately, there are other implications.
In our discussion about Support Vector Machines (SVM) in Chapter 2, Detecting Spam Emails, we saw that each sample is a point in a high-dimensional space. More similar samples are closer than others and using the cosine similarity or Euclidean distance metrics, we can obtain their proximity. If we expand the dimensions...