Feature extraction and engineering

Data preparation is the longest and the most complex phase of any ML project. The same was emphasized while discussing the CRISP-DM model, where we mentioned how the data preparation phase takes up about 60-70% of the overall time spent in a ML project.

Once we have our raw dataset preprocessed and wrangled, the next step is to make it usable for ML algorithms. Feature extraction is the process of deriving features from raw attributes. For instance, feature extraction while working with image data refers to the extraction of red, blue, and green channel information as features from raw pixel-level data.

On the same lines, feature engineering refers to the process of deriving additional features from existing ones using mathematical transformations. For instance, feature engineering would help us in deriving a feature such as annual income from a person's monthly income (based on use case requirements). Since both feature extraction and engineering help us transform raw datasets into usable forms the terms are used interchangeably by ML practioners.

Feature engineering strategies

The process of transforming raw datasets (post clean up and wrangling) into features that can be utilized by ML algorithms is a combination of domain knowledge, use case requirements, and specific techniques. Features thus depict various representations of the underlying data and are the outcome of the feature engineering process.

Since feature engineering transforms raw data into a useful representation of itself, there are various standard techniques and strategies that can be utilized, based on the type of data at hand. In this section we will discuss a few of those strategies, briefly covering both structured and unstructured data.

Working with numerical data

Numerical data, commonly available in datasets in the form of integers or floating-point numbers and popularly known as continuous numerical data, is usually a ML friendly data type. By friendly, we refer to the fact that numeric data can be ingested in most ML algorithms directly. This however, does not mean that numeric data does not require additional processing and feature engineering steps.

There are various techniques for extracting and engineering features from numerical data. Let's look at some of those techniques in this section:

Raw measures: These data attributes or features can be used directly in their raw or native format as they occur in the dataset without any additional processing. Examples can be age, height, or weight (as long as data distributions are not too skewed!).
Counts: Numeric features such as counts and frequencies are also useful in certain scenarios to depict important details. Examples can be the number of credit card fraud occurences, song listen counts, device event occurences, and so on.
Binarization: Often we might want to binarize occurrences or features, especially to just indicate if a specific item or attribute was present (usually denoted with a 1) or absent (denoted with a 0). This is useful in scenarios like building recommendation systems.
Binning: This technique typically bins or groups continuous numeric values from any feature or attribute under analysis to discrete bins, such that each bin covers a specific numeric range of values. Once we get these discrete bins, we can choose to further apply categorical data-based feature engineering on the same. Various binning strategies exist, such as fixed-width binning and adaptive binning.

Code snippets to better understand feature engineering for numeric data are available in the notebook feature_engineering_numerical_and_categorical_data.ipynb.

Working with categorical data

Another important class of data commonly encountered is categorical data. Categorical features have discrete values that belong to a finite set of classes. These classes may be represented as text or numbers. Depending upon whether there is any order to the classes or not, categorical features are termed as ordinal and nominal respectively.

Nominal features are those categorical features that have a finite set of values but do not have any natural ordering to them. For instance, weather seasons, movie genres, and so on are all nominal features. Categorical features that have a finite set of classes with a natural ordering to them are termed as ordinal features. For instance, days of the week, dress sizes, and so on are ordinals.

Typically, any standard workflow in feature engineering involves some form of transformation of these categorical values into numeric labels and then the application of some encoding scheme on these values. Popular encoding schemes are briefly mentioned as follows:

One-hot encoding: This strategy creates n binary valued columns for a categorical attribute assuming there are n number of distinct categories
Dummy coding: This strategy creates n-1 binary valued columns for a categorical attribute assuming there are n number of distinct categories
Feature hashing: This strategy is leveraged where we use a hash function to add several features into a single bin or bucket (new feature), which is popularly used when we have a large number of features

Code snippets to better understand feature engineering for categorical data are available in the notebook feature_engineering_numerical_and_categorical_data.ipynb.

Working with image data

Image or visual data is a rich source of data, with several use cases that can be solved using ML algorithms and deep learning. Image data poses a lot of challenges and requires careful preprocessing and transformation before it can be utilized by any of the algorithms. Some of the most common ways of performing feature engineering on image data are as follows:

Utilize metadata information or EXIF data: Attributes such as image creation date, modification date, dimensions, compression format, device used to capture the image, resolution, focal length, and so on.
Pixel and channel information: Every image can be considered as a matrix of pixel values or a (m, n, c) matrix where m represents the number of rows, n represents the number of columns, and c points to color channels (for instance R, G, and B). Such a matrix can be then transformed into different shapes as per requirements of the algorithm and use case.
Pixel intensity: Sometimes it is difficult to work with colored images that have multiple channels across colors. Pixel intensity-based feature extraction relies on binning pixels based on intensities rather than utilizing raw pixel-level values.
Edge detection: Sharp changes in contrast and brightness between neighboring pixels can be utilized to identify object edges. There are different algorithms available for edge detection.
Object detection: We take the concept of edge detection and extend it to object detection and then utilize identified object boundaries as useful features. Again, different algorithms may be utilized based on the type of image data available.

Deep learning based automated feature extraction

The feature extraction methods for image data and other types discussed so far require a lot of time, effort, and domain understanding. This kind of feature extraction has its merits along with its limitations.

Lately, deep learning, specifically Convolutional Neural Networks (CNNs), have been studied and utilized as automated feature extractors. A CNN is a special case of deep neural networks optimized for image data. At the core of any CNN are convolutional layers, which basically apply sliding filters across the height and width of the image. The dot product of pixel values and these filters results in activation maps that are learned across multiple epochs. At every level, these convolutional layers help in extracting specific features such as edges, textures, corners, and so on.

There is more to deep learning and CNNs, but, to keep things simple, let's assume that at every layer, CNNs help us extract different low and high-level features automatically. This in turn saves us from manually performing feature extraction. We will study CNNs in more detail in coming chapters and see how they help us extract features automatically.

Working with text data

Numerical and categorical features are what we call structured data types. They are easier to process and utilize in ML workflow. Textual data is one major source of unstructured information that is equally important. Textual data presents multiple challenges related to syntactical understanding, semantics, format, and content. Textual data also presents issues of transformation into numeric form before it can be utilized by ML algorithms. Thus, feature engineering for textual data is preceded by rigorous preprocessing and clean up steps.

Text preprocessing

Textual data requires careful and diligent preprocessing before any feature extraction/engineering can be performed. There are various steps involved in preprocessing textual data. The following is a list of some of the most widely used preprocessing steps for textual data:

Tokenization
Lowercasing
Removal of special characters
Contraction expansions
Stopword removal
Spell corrections
Stemming and lemmatization

We will be covering most techniques in detail in the chapters related to use cases. For a better understanding, readers may refer to Chapter 4 and Chapter 7 of Practical Machine Learning with Python, Sarkar and their co-authors, Springer, 2017.

Feature engineering

Once we have our textual data properly processed via the methods mentioned in the previous section, we can utilize some of the following techniques for feature extraction and transformation into numerical form. Code snippets to better understand feature engineering for textual data are available in the Jupyter Notebook feature_engineering_text_data.ipynb:

Bag-of-words model: This is by far the simplest vectorization technique for textual data. In this technique, each document is represented as a vector on N dimensions, where N indicates all possible words across the preprocessed corpus, and each component of the vector either denotes the presence of the word or its frequency.
TF-IDF model: The bag-of-words model works under very simplistic assumptions and at certain times leads to various issues. One of the most common issues is related to some words overshadowing the rest of the words due to very high frequency, as the bag-of-words model utilizes absolute frequencies to vectorize. The Term Frequency-Inverse Document Frequency (TF-IDF) model mitigates this issue by scaling/normalizing the absolute frequencies. Mathematically, the model is defined as follows:
tfidf (w, D) = tf (W, D) * idf (w, D)

Here, tfidf (w, D) denotes the TF-IDF score of each word w in document D, tf (w, D) is the frequency of word w in document D and idf (w, D) denotes the inverse document frequency, calculated as the log transformation of total documents in corpus C divided by frequency of documents where w occurs.

Apart from bag of words and TF-IDF, there are other transformations, such as bag of N-grams, and word embeddings such as Word2vec, GloVe, and many more. We will cover several of them in detail in subsequent chapters.