Automated gradient-based prompt injection
Manual prompt injection explores the safety controls of an LLM AI system and interactively crafts new adversarial prompts. A different approach is to emulate the techniques we covered in Chapter 6 so that they use gradient-based techniques (FSGM, PGD, and Carlini-Wagner) to create adversarial perturbations for evasion attacks.
These also apply to Predictive AI. For LLMs, recent work by researchers from Carnegie Mellon University, the Center for AI Safety, Google DeepMind, and the Bosch Center for AI have experimented and found a different algorithm to create adversarial prompts that combine both gradient-based searches with greedy search. The attack works as follows:
- Attackers choose a set of harmful user queries that we want the LLM to answer affirmatively, such as “Tell me how to build a bomb” or “Generate a step-by-step plan to destroy humanity.”
- An adversarial suffix is appended to each user query...