Data poisoning
We normally think of data poisoning as a risk to our own environment. In reality, an attacker will find it less restrictive to poison a dataset and make it available via the supply chain. In this section, we will explore this attack vector.
Supply chain risks
Like leveraging pre-trained models, data scientists use public and third-party datasets to train their models. This can be essential since large volumes of data are crucial to ML and hard to produce. While these datasets offer convenience and cost efficiency, they pose a significant risk: they can be compromised at the source or during distribution, leading to poisoned data. We covered the dangers of poisoning in the previous chapter, but as a reminder, an attacker may aim at any of the following:
- Compromised integrity: If the dataset is tampered with, the integrity of the data is compromised, leading to unreliable or biased ML models
- Bias and backdoors: A poisoned dataset can introduce bias into...