Data preprocessing and feature engineering
Data preprocessing and feature engineering play a crucial and foundational role in machine learning. It’s like laying the groundwork for a building – the stronger and better prepared the foundation, the better the final structure (machine learning model) will be. Here is a breakdown of their relationship:
- Preprocessing prepares data for efficient learning: Raw data from various sources often contains inconsistencies, errors, and irrelevant information. Preprocessing cleans, organizes, and transforms the data into a format suitable for the chosen machine learning algorithm. This allows the algorithm to understand the data more easily and efficiently, leading to better model performance.
- Preprocessing helps improve model accuracy and generalizability: By handling missing values, outliers, and inconsistencies, preprocessing reduces noise in data. This enables a model to focus on the true patterns and relationships within the data, leading to more accurate predictions and better generalization on unseen data.
- Feature engineering provides meaningful input variables: Raw data is transformed and manipulated to create new features or select relevant ones. New features potentially improve model performance and insight generation.
Overall, data preprocessing and feature engineering is an essential step in the machine learning workflow. By dedicating time and effort to proper preprocessing and feature engineering, you lay the foundation to build reliable, accurate, and generalizable machine learning models. We will cover the preprocessing phase first in this section.
Preprocessing and exploration
When we learn, we require high-quality learning material. We can’t learn from gibberish, so we automatically ignore anything that doesn’t make sense. A machine learning system isn’t able to recognize gibberish, so we need to help it by cleaning the input data. It’s often claimed that cleaning the data forms a large part of machine learning. Sometimes, the cleaning is already done for us, but you shouldn’t count on it.
To decide how to clean data, we need to be familiar with it. There are some projects that try to automatically explore the data and do something intelligent, such as producing a report. For now, unfortunately, we don’t have a solid solution in general, so you need to do some work.
We can do two things, which aren’t mutually exclusive: first, scan the data, and second, visualize the data. This also depends on the type of data we’re dealing with—whether we have a grid of numbers, images, audio, text, or something else.
Ultimately, a grid of numbers is the most convenient form, and we will always work toward having numerical features. Let’s pretend that we have a table of numbers in the rest of this section.
We want to know whether features have missing values, how the values are distributed, and what type of features we have. Values can approximately follow a normal distribution, a binomial distribution, a Poisson distribution, or another distribution altogether. Features can be binary: either yes or no, positive or negative, and so on. They can also be categorical: pertaining to a category, such as continents (Africa, Asia, Europe, South America, North America, and so on). Categorical variables can also be ordered, for instance, high, medium, and low. Features can also be quantitative, for example, the temperature in degrees or the price in dollars. Now, let’s dive into how we can cope with each of these situations.
Dealing with missing values
Quite often, we miss values for certain features. This could happen for various reasons. It can be inconvenient, expensive, or even impossible to always have a value. Maybe we weren’t able to measure a certain quantity in the past because we didn’t have the right equipment or just didn’t know that the feature was relevant. However, we’re stuck with missing values from the past.
Sometimes, it’s easy to figure out that we’re missing values, and we can discover this just by scanning the data or counting the number of values we have for a feature and comparing this figure with the number of values we expect, based on the number of rows. Certain systems encode missing values with, for example, values such as 999,999 or -1. This makes sense if the valid values are much smaller than 999,999. If you’re lucky, you’ll have information about the features provided by whoever created the data in the form of a data dictionary or metadata.
Once we know that we’re missing values, the question arises of how to deal with them. The simplest answer is to just ignore them. However, some algorithms can’t deal with missing values, and the program will just refuse to continue. In other circumstances, ignoring missing values will lead to inaccurate results. The second solution is to substitute missing values with a fixed value—this is called imputing. We can impute the arithmetic mean, median, or mode of the valid values of a certain feature. Ideally, we will have some prior knowledge of a variable that is somewhat reliable. For instance, we may know the seasonal averages of temperature for a certain location and be able to impute guesses for missing temperature values, given a date. We will talk about dealing with missing data in detail in Chapter 10, Machine Learning Best Practices. Similarly, techniques in the following sections will be discussed and employed in later chapters, just in case you feel uncertain about how they can be used.
Label encoding
Humans are able to deal with various types of values. Machine learning algorithms (with some exceptions) require numerical values. If we offer a string such as Ivan
, unless we’re using specialized software, the program won’t know what to do. In this example, we’re dealing with a categorical feature—names, probably. We can consider each unique value to be a label. (In this particular example, we also need to decide what to do with the case—is Ivan
the same as ivan
?). We can then replace each label with an integer—label encoding.
The following example shows how label encoding works:
Label |
Encoded Label |
Africa |
1 |
Asia |
2 |
Europe |
3 |
South America |
4 |
North America |
5 |
Other |
6 |
Table 1.3: Example of label encoding
This approach can be problematic in some cases because the learner may conclude that there is an order (unless it is expected, for example, bad=0, ok=1, good=2, and excellent=3). In the preceding mapping table, Asia
and North America
in the preceding case differ by 4
after encoding, which is a bit counterintuitive, as it’s hard to quantify them. One-hot encoding in the next section takes an alternative approach.
One-hot encoding
The one-of-K, or one-hot encoding, scheme uses dummy variables to encode categorical features. Originally, it was applied to digital circuits. The dummy variables have binary values such as bits, so they take the values zero or one (equivalent to true or false). For instance, if we want to encode continents, we will have dummy variables, such as is_asia
, which will be true if the continent is Asia
and false otherwise. In general, we need as many dummy variables as there are unique values minus one (or sometimes the exact number of unique values). We can determine one of the labels automatically from the dummy variables because they are exclusive.
If the dummy variables all have a false value, then the correct label is the label for which we don’t have a dummy variable. The following table illustrates the encoding for continents:
Continent |
Is_africa |
Is_asia |
Is_europe |
Is_sam |
Is_nam |
Africa |
1 |
0 |
0 |
0 |
0 |
Asia |
0 |
1 |
0 |
0 |
0 |
Europe |
0 |
0 |
1 |
0 |
0 |
South America |
0 |
0 |
0 |
1 |
0 |
North America |
0 |
0 |
0 |
0 |
1 |
Other |
0 |
0 |
0 |
0 |
0 |
Table 1.4: Example of one-hot encoding
The encoding produces a matrix (grid of numbers) with lots of zeros (false values) and occasional ones (true values). This type of matrix is called a sparse matrix. The sparse matrix representation is handled well by the scipy
package, which we will discuss later in this chapter.
Dense embedding
While one-hot encoding is a simple and sparse representation of categorical features, dense embedding provides a compact, continuous representation that captures semantic relationships based on the co-occurrence patterns in data. For example, using dense embedding, the continent categories might be represented by 3-dimensional continuous vectors like:
- Africa: [0.9, -0.2, 0.5]
- Asia: [-0.1, 0.8, 0.6]
- Europe: [0.6, 0.3, -0.7]
- South America: [0.5, 0.2, 0.1]
- North America: [0.4, 0.3, 0.2]
- Other: [-0.8, -0.5, 0.4]
In this example, you may notice the vectors of South America and North America are closer together than those of Africa and Asia. Dense embedding can capture the similarities between categories. In another example, you may see more closeness of the vectors of Europe and North America, based on cultural similarity.
We will explore dense embedding further in Chapter 7, Mining the 20 Newsgroups Dataset with Text Analysis Techniques.
Scaling
Values of different features can differ by orders of magnitude. Sometimes, this can mean that the larger values dominate the smaller values. This depends on the algorithm we use. For certain algorithms to work properly, we’re required to scale data.
There are the following several common strategies that we can apply:
- Standardization removes the mean of a feature and divides it by the standard deviation. If the feature values are normally distributed, we will get a Gaussian, which is centered around zero with a variance of one.
- If the feature values aren’t normally distributed, we can remove the median and divide by the interquartile range. The interquartile range is the range between the first and third quartile (or 25th and 75th percentile).
- A range between zero and one is a common choice of range for feature scaling.
We will use scaling in many projects throughout the book.
An advanced version of data preprocessing is usually called feature engineering. We will cover that next.
Feature engineering
Feature engineering is the process of creating or improving features. Features are often created based on common sense, domain knowledge, or prior experience. There are certain common techniques for feature creation; however, there is no guarantee that creating new features will improve your results. We are sometimes able to use the clusters found by unsupervised learning as extra features. Deep neural networks are often able to derive features automatically.
We will briefly look at some feature engineering techniques: polynomial transformation and binning.
Polynomial transformation
If we have two features, a and b, we can suspect that there is a polynomial relationship, such as a2 + ab + b2. We can consider a new feature an interaction between a and b, such as the product ab. An interaction doesn’t have to be a product—although this is the most common choice—it can also be a sum, a difference, or a ratio. If we use a ratio to avoid dividing by zero, we should add a small constant to the divisor and dividend.
The number of features and the order of the polynomial for a polynomial relationship aren’t limited. However, if we follow the Occam’s razor principle, we should avoid higher-order polynomials and interactions of many features. In practice, complex polynomial relations tend to be more difficult to compute and tend to overfit, but if you really need better results, they may be worth considering. We will see polynomial transformation in action in Best practice 12 – performing feature engineering without domain expertise section in Chapter 10, Machine Learning Best Practices.
Binning
Sometimes, it’s useful to separate feature values into several bins. For example, we may only be interested in whether it rained on a particular day. Given the precipitation values, we can binarize the values so that we get a true value if the precipitation value isn’t zero, and a false value otherwise. We can also use statistics to divide values into high, low, and medium bins. In marketing, we often care more about the age group, such as 18 to 24, than a specific age, such as 23.
The binning process inevitably leads to a loss of information. However, depending on your goals, this may not be an issue, actually reducing the chance of overfitting. Certainly, there will be improvements in speed and a reduction of memory or storage requirements and redundancy.
Any real-world machine learning system should have two modules: a data preprocessing module, which we just covered in this section, and a modeling module, which will be covered next.