Discovering how deepfakes work
Deepfakes are a unique variation of a generative auto-encoder being used to generate the face swap. This requires a special structure, which we will explain in this section.
Generative auto-encoders
The particular type of neural network that regular deepfakes use is called a generative auto-encoder. Unlike a Generative Adversarial Network (GAN), an auto-encoder does not use a discriminator or any “adversarial” techniques.
All auto-encoders work by training a collection of neural network models to solve a problem. In the case of generative auto-encoders, the AI is used to generate a new image with new details that weren’t in the original image. However, with a normal auto-encoder, the problem is usually something such as classification (deciding what an image is), object identification (finding something inside an image), or segmentation (identifying different parts of an image). To do this, there are two types of models used in the autoencoder – the encoder and decoder. Let’s see how this works.
The deepfake training cycle
The training cycle is a cyclical process in which the model is continuously trained on images until stopped. The process can be broken down into four steps:
- Encode faces into smaller intermediate representations.
- Decode the intermediate representations back into faces.
- Calculate the loss of (meaning, the difference between) the original face and the output of the model.
- Modify (backpropagate) the models toward the correct answer.
Figure 1.2 – Diagram of the training cycle
In more detail, the process unfolds as follows:
- The encoder’s job is to encode two different faces into an array, which we call the intermediate representation. The intermediate representation is much smaller than the original image size, with enough space to describe the lighting, pose, and expression of the faces. This process is similar to compression, where unnecessary data is thrown out to fit the data into a smaller space.
- The decoder is actually a matched pair of models, which turn the intermediate representation back into faces. There is one decoder for each of the input faces, which is trained only on images of that one person’s face. This process tries to create a new face that matches the original face that was given to the encoder and encoded into the intermediate representation.
Figure 1.3 – Encoder and decoder
- Loss is a score that is given to the auto-encoder based on how well it recreates the original faces. This is calculated by comparing the original image to the output from the encoder-decoder process. This comparison can be done in many ways, from a strict difference between them or something significantly more complicated that includes human perception as part of the calculation. No matter how it’s done, the result is the same: a number from 0 to 1, with 0 being the score for the model returning the exact same image and 1 being the exact opposite or the image. Most of the numbers will fall between 0 to 1. However, a perfect reconstruction (or its opposite) is impossible.
Note
The loss is where an auto-encoder differs from a GAN. In a GAN, the comparison loss is either replaced or supplemented with an additional network (usually an auto-encoder itself), which then produces a loss score of its own. The theory behind this structure is that the loss model (called a discriminator) can learn to get better at detecting the output of the generating model (called a generator) while the generator can learn to get better at fooling the discriminator.
- Finally, there is backpropagation, a process in which the models are adjusted by following the path back through both the decoder and encoder that generated the face and nudging those paths toward the correct answer.
Figure 1.4 – Loss and backpropagation
Once complete, the whole process starts back over at the encoder again. This continues to repeat until the neural network has finished training. The decision of when to end training can happen in several ways. It can happen when a certain number of repetitions have occurred (called iterations), when all the data has been gone through (called an epoch), or when the results meet a certain loss score.
Why not GANs?
GANs are one of the current darlings of generative networks. They are extremely popular and used extensively, being used particularly for super-resolution (intelligent upscaling), music generation, and even sometimes deepfakes. However, there are some reasons that they’re not used in all deepfake solutions.
GANs are popular due to their “imaginative” nature. They learn through the interaction of their generator and discriminator to fill in gaps in the data. Because they can fill in missing pieces, they are great at reconstruction tasks or at tasks where new data is required.
The ability of a GAN to create new data where it is missing is great for numerous tasks, but it has a critical flaw when used for deepfakes. In deepfakes, the goal is to replace one face with another face. An imaginative GAN would likely learn to fill the gaps in the data from one face with the data from the other. This leads to a problem that we call “identity bleed” where the two faces aren’t swapped properly; instead, they’re blended into a face that doesn’t look like either person, but a mix of the two.
This flaw in a GAN-created deepfake can be corrected or prevented but requires much more careful data collection and processing. In general, it’s easier to get a full swap instead of a blending by using a generative auto-encoder instead of a GAN.
The auto-encoder structure
Another name for an auto-encoder is an “hourglass” model. The reason for this is that each layer of an encoder is smaller than the layer before it while each layer of a decoder is larger than the one before. Because of this, the auto-encoder figure starts out large at the beginning, shrinks toward the middle, and then widens back out again as it reaches the end:
Figure 1.5 – Hourglass structure of an autoencoder
While these methods are flexible and have many potential uses, there are limitations. Let’s examine those limitations now.