Introducing Autoencoders
In previous chapters, we have seen that neural networks are very powerful algorithms. The power of each network lies in its architecture, activation functions, and regularization terms, plus a few other features. Among the varieties of neural architectures, there is a very versatile one, especially useful for three tasks: detecting unknown events, detecting unexpected events, and reducing the dimensionality of the input space. This neural network is the autoencoder.
Architecture of the Autoencoder
The autoencoder (or autoassociator) is a multilayer feedforward neural network, trained to reproduce the input vector onto the output layer. Like many neural networks, it is trained using the gradient descent algorithm, or one of its modern variations, against a loss function, such as the Mean Squared Error (MSE). It can have as many hidden layers as desired. Regularization terms and other general parameters that are useful for avoiding overfitting or for improving the learning process can be applied here as well.
The only constraint on the architecture is that the number of input units must be the same as the number of output units, as the goal is to train the autoencoder to reproduce the input vector onto the output layer.
The simplest autoencoder has only three layers: one input layer, one hidden layer, and one output layer. More complex structured autoencoders might include additional hidden layers:
Autoencoders can be used for many different tasks. Let's first see how an autoencoder can be used for dimensionality reduction.
Reducing the Input Dimensionality with an Autoencoder
Let's consider an autoencoder with a very simple architecture: one input layer with units; one output layer, also with units; and one hidden layer with units. If , the autoencoder produces a compression of the input vector onto the hidden layer, reducing its dimensionality from to .
In this case, the first part of the network, moving the data from a vector with size to a vector with size , plays the role of the encoder. The second part of the network, reconstructing the input vector from a space back into a space, is the decoder. The compression rate is then . The larger the value of and the smaller the value of , the higher the compression rate:
When using the autoencoder for dimensionality reduction, the full network is first trained to reproduce the input vector onto the output layer. Then, before deployment, it is split into two parts: the encoder (input layer and hidden layer) and the decoder (hidden layer and output layer). The two subnetworks are stored separately.
Tip
If you are interested in the output of the bottleneck layer, you can configure the Keras Network Executor node to output the middle layer. Alternatively, you can split the network within the DL Python Network Editor node by writing a few lines of Python code.
During the deployment phase, in order to compress an input record, we just pass it through the encoder and save the output of the hidden layer as the compressed record. Then, in order to reconstruct the original vector, we pass the compressed record through the decoder and save the output values of the output layer as the reconstructed vector.
If a more complex structure is used for the autoencoder – for example, with more than one hidden layer – one of the hidden layers must work as the compressor output, producing the compressed record and separating the encoder from the decoder subnetwork.
Now, the question when we talk about data compression is how faithfully can the original record be reconstructed? How much information is lost by using the output of the hidden layer instead of the original data vector? Of course, this all depends on how well the autoencoder performs and how large our error tolerance is.
During testing, when we apply the network to new data, we denormalize the output values and we calculate the chosen error metric – for example, the Root Mean Square Error (RMSE) – between the original input data and the reconstructed data on the whole test set. This error value gives us a measure of the quality of the reconstructed data. Of course, the higher the compression rate, the higher the reconstruction error. The problem thus becomes to train the network to achieve acceptable performance, as per our error tolerance.
Let's move on to the next application field of autoencoders: anomaly detection.
Detecting Anomalies Using an Autoencoder
In most classification/prediction problems, we have a set of examples covering all event classes and based on this dataset, we train a model to classify events. However, sometimes, the event class we want to predict is so rare and unexpected that no (or almost no) examples are available at all. In this case, we do not talk about classification or prediction but about anomaly detection.
An anomaly can be any rare, unexpected, unknown event: a cardiac arrhythmia, a mechanical breakdown, a fraudulent transaction, or other rare, unexpected, unknown events. In this case, since no examples of anomalies are available in the training set, we need to use neural networks in a more creative way than for conventional, standard classification. The autoencoder structure lends itself to such creative usage, as required for the solution of an anomaly detection problem (see, for example, A.G. Gebresilassie, Neural Networks for Anomaly (Outliers) Detection, https://blog.goodaudience.com/neural-networks-for-anomaly-outliers-detection-a454e3fdaae8).
Since no anomaly examples are available, the autoencoder is trained only on non-anomaly examples. Let's call these examples of the "normal" class. On a training set full of "normal" data, the autoencoder network is trained to reproduce the input feature vector onto the output layer.
The idea is that, when required to reproduce a vector of the "normal" class, the autoencoder is likely to perform a decent job because that is what it was trained to do. However, when required to reproduce an anomaly on the output layer, it will hopefully fail because it won't have seen this kind of vector throughout the whole training phase. Therefore, if we calculate the distance – any distance – between the original vector and the reproduced vector, we see a small distance for input vectors of the "normal" class and a much larger distance for input vectors representing an anomaly.
Thus, by setting a threshold, , we should be able to detect anomalies with the following rule:
IF THEN -> "normal"
IF THEN -> "anomaly"
Here, is the reconstruction error for the input vector, , and is the set threshold.
This sort of solution has already been implemented successfully for fraud detection, as described in a blog post, Credit Card Fraud Detection using Autoencoders in Keras -- TensorFlow for Hackers (Part VII), by Venelin Valkov (https://medium.com/@curiousily/credit-card-fraud-detection-using-autoencoders-in-keras-tensorflow-for-hackers-part-vii-20e0c85301bd). In this chapter, we will use the same idea to build a similar solution using a different autoencoder structure.
Let's find out how the idea of an autoencoder can be used to detect fraudulent transactions.