The more disciplined we are in handling our data, the better results we are likely to achieve in the end. The first step in this procedure is known as data preprocessing, and it comes in (at least) three different flavors:
- Data formatting: The data may not be in a format that is suitable for us to work with; for example, the data might be provided in a proprietary file format, which our favorite machine learning algorithm does not understand.
- Data cleaning: The data may contain invalid or missing entries, which need to be cleaned up or removed.
- Data sampling: The data may be far too large for our specific purpose, forcing us to sample the data intelligently.
Once the data has been preprocessed, we are ready for the actual feature engineering: to transform the preprocessed data to fit our specific machine learning algorithm. This step usually involves one...