Clean-label attacks
Clean-label poisoning attacks are a form of adversarial attack whereby the attacker subtly manipulates the training data without changing the labels. These attacks are hard to detect and can significantly impact ML models.
In clean-label attacks, the attacker can only add seemingly benign samples to the training set without explicit control over their labels. This makes poisoning considerably harder but evades detection.
In our case, an attacker might subtly alter images of planes to resemble birds, causing the model to misclassify them.
A simple approach would be to add slight darkening to confuse the classifier; this has poor results:
# Example of subtly altering images with slight darkening poisoned_images = x_train.copy() poisoned_images[:, :5, :5, :] = x_train[:, :5, :5, :] * 0.9
A more sophisticated approach was proposed by Shafahi, Huang, et al. in 2018, titled Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks.
The paper...