Backdoor poisoning attacks
In the previous section, we sought to degrade and influence classification by simply inserting a set of samples with the wrong labels. This has mixed results because it is not targeted enough to train the model in what triggers the misclassification.
This is where backdoor attacks come in. In this type of attack, the attacker introduces a pattern (the backdoor or trigger) into the training data, which the model learns to associate with a particular class. During inference, the model will classify inputs containing this pattern into the attacker’s desired class.
For example, an attacker might insert a specific pattern, such as a cyan square, in the corners of airplane images and label these images as birds. The trained model will then classify any image with this square as a bird. This is what the backdoor pattern and a poisoned image will look like:
Figure 4.6 – Simple square backdoor and a poisoned image with...