Feature engineering

Feature engineering is the process of creating new features by transforming existing ones. It is very important in traditional ML but is less important in deep learning.

Traditionally, the data scientists or the researchers would apply their domain knowledge and come up with a smart representation of the input that would highlight the relevant feature and make the prediction task more accurate.

For example, before the advent of deep learning, traditional computer vision required custom algorithms that were extracting the most relevant features, such as edge detection or Scale-Invariant Feature Transform (SIFT).

To understand this concept, let's look at an example. Here, we see an original photo:

And, after some feature engineering—in particular, after running an edge detection algorithm, we get the following result:

One of the great advantages of using deep learning is that is not necessary to hand craft these features, but the network will do the job:

How deep learning performs feature engineering

The theoretical advantage of neural networks is that they are universal approximators. The Universal Approximation Theorem states that a feed-forward network with a single hidden layer, a finite number of neurons, and some assumptions regarding the activation function can approximate any continuous functions. However, this theorem does not specify whether the parameters of the network are learnable algorithmically.

In practice, layers are added to the network to increase the non-linearity of the approximated function, and there is a lot of empirical evidence that the deeper the network is and the more the data we feed into the network, the better the results will be. There are some caveats on this statement that we will see later on in this book.

Nevertheless, there are some deep learning tasks that still require feature engineering—for example, natural Language processing (NLP). In this case, feature engineering can be anything from dividing the text into small subsets, called n-grams, to a vectorized representation using, for example, word embedding.

Feature scaling

A very important engineering technique that is necessary to perform even with neural networks is feature scaling. It's necessary to scale the numerical input to have all the features on the same scale; otherwise, the network will give more importance to features with larger numerical values.

A very simple transformation is re-scaling the input between 0 and 1, also known as MinMax scaling. Other common operations are standardization and zero-mean translation, which makes sure the standard deviation of the input is 1 and the mean is 0, which in the scikit-learn library are implemented in the scale method:

from sklearn import preprocessing
import numpy as np
X_train = np.array([[ -3., 1.,  2.],
                   [ 2.,  0.,  0.],
                   [ 1.,  2., 3.]])
X_scaled = preprocessing.scale(X_train)

The preceding command generates the following result:

Out[2]:
array([[-1.38873015,  0.        ,  0.26726124],
       [ 0.9258201 , -1.22474487, -1.33630621],
       [ 0.46291005,  1.22474487,  1.06904497]])

You can find many other numerical transformations already available in scikit-learn. Some other important transformations from its documentation are as follows:

PowerTransformer: This transformation applies a power transformation to each feature in order to transform the data to follow a Gaussian-like distribution. It will find the optimal scaling factor to stabilize the variance and at the same time minimize skewness. The PowerTransformer transformation of scikit-learn will force the mean to be zero and force the variance to 1.
QuantileTransformer: This transformation has an additional output_distribution parameter that allows us to force a Gaussian distribution to the features instead of a uniform distribution. It will introduce saturation for our inputs' extreme values.

Feature engineering in Keras

Keras provides a nice and simple interface to do feature engineering. A task that we will study in particular in this book is image classification. For this task, Keras provides the ImageDataGenerator class, which allows us to easily pre-process and augment the data.

The augmentation we are going to perform is aimed at generating more images using some random transformations such as zooming, flipping, shearing, and shifting. These transformations help prevent overfitting and make the model more robust to different image conditions, such as brightness.

We will see the code first and then explain what it does. Following Keras' documentation (https://keras.io/), it's possible to create a generator with the mentioned transformations with the code:

from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
        rotation_range=45,
        width_shift_range=0.25,
        height_shift_range=0.25,
        rescale=1./255,
        shear_range=0.3,
        zoom_range=0.3,
        horizontal_flip=True,
        fill_mode='nearest')

For the generator, it's possible to set a few parameters:

The rotation_range parameter represents value in degrees (0-180), which will be used to randomly find a value to rotate the inputs.
width_shift and height_shift are ranges (as a fraction of total width or height) within which it randomly translates pictures vertically or horizontally.
Scale is a common operation used to re-scale a raw image. In this case, we have RGB images, in which each pixel is represented by a value between 0 and 255. Because of this, we use a scaling factor of 1/255 so our values now will be between 0 and 1. We do this as otherwise the numbers would be too high given the typical learning rate, one of the parameters of our network.
shear_range is used for randomly applying shearing transformations.
zoom_range is used to create additional pictures by randomly zooming inside pictures.
horizontal_flip is a Boolean value used to create additional pictures by randomly flipping half of the image horizontally. This is useful when there are no assumptions of horizontal asymmetry.
fill_model is the strategy used for filling in new components

In this way, from one image, we can create many to feed to our model. Notice that we only initialized the object so far, so no instruction has being executed as the generator will perform the action only when it's called; it will be done later on.