In this chapter, we continue our journey into deep learning with R with autoencoders.
A classical autoencoder consists of three parts:
- An encoding function, which compresses your data
- A decoding function, which reconstructs data from a compressed version
- A metric or distance, which calculates the difference between the information lost by compression on your data
We typically assume that all these involved functions are smooth enough to be able to use backpropagation or other gradient-based methods, although they need not be and we could use derivative-free methods to train them.Â
Although the compression bit might remind you of algorithms, such as the MP3 compression algorithm, an important difference is that autoencoders are data specific. An autoencoder trained in pictures of cats and dogs will likely perform poorly in pictures of buildings. In contrast, the MP3 compression algorithm uses assumptions of sound in general and can work regardless of the sound data. The data-specific bit is a serious caveat for widespread application which makes autoencoders rarely used for compression tasks.Â
One reason autoencoders have attracted so much attention in recent years is because many people believe that they might be the key for unsupervised learning, although strictly speaking, they are a self-supervised learning algorithm.Â
Sometimes the features abstracted from autoencoders can be fed into supervised learning algorithms, making them somewhat comparable to principal component analysis (PCA) as a dimensionality reduction technique.Â
Autoencoders are typically used in computer vision problems such as image denoising or for picking up features such as colors, light and edges. They are also used for data visualization of large dimensional datasets, as they can find more interesting features than PCA. Other recent applications include fraud and intrusion detection.
For our purposes, an autoencoder neural network is simply an algorithm for unsupervised learning that applies backpropagation by setting the target values equal to the inputs, if x1, x2, ... xm are the training examples and y1, y2, ... ym are the labels, then we will do backpropagation by setting xi = yi for all values of i.
From your previous experience with machine learning, you might be familiar with PCA. Don't worry if you are not familiar with it, this is not strictly required for our purposes. PCA is a dimensionality reduction technique, which means, given a set of training examples, a suitable transformation is applied (for math geeks, this is just a projection into the vector space generated by the eigenvectors of the covariance matrix). The goal of this projection is to find the most relevant features of the input data, so that in the end we get a simplified representation of it.
Autoencoders work in a similar vein, except that the transformation involved is not a projection, but rather a non-linear function f. Given a training example x, an autoencoder encodes x using a neural network into a hidden state h:=f(x), and decodes h using a function g, which brings an overall transformation of x => g(f(x)). If the result of this process would be simply g(f(x))=x, we would not have a very useful transformation. The idea is illustrated in the following diagram:Â
and then back to a three-dimensional space.
On the left part, a three-dimensional input vector is transformed into a two-dimensional encoded state (this is the action of f) and then transformed back into a three-dimensional vector (by the action of g).
Why do we take the trouble of encoding and decoding? This has two purposes. On one hand, autoencoders provide, as PCA, a way to automatically generate features in a lower dimensional space. This is useful as part of a machine learning pipeline for feature extraction, in the same way PCA is also useful. Synthesizing the data and automatically generating features (instead of relying on domain expertise and feature handcrafting) to improve the accuracy of a supervised learning algorithm, be it for classification or regression tasks. For our purposes, it is also useful for outlier detection. As the computer is forced to understand the essential features of the data, anything that jumps out as odd will be thrown away during the reconstruction process (that is, the full encoding–decoding cycle), and the outliers will be easily identifiable.
Before jumping into the fraud example for this chapter, let's get our feet wet looking at a simpler example, and at the same time getting our tools ready.