Model inversion and training data extraction attacks on LLMs
When we discussed extracting training data in predictive AI, we focused on model inversion. The attack appears to extract training data, but in reality, the technique is to infer and reconstruct memorized training data from adversarial inputs.
Model inversion could still happen in an LLM world, but it is less structured, mathematically driven, and automated. Some efforts with a research project called TextRevealer (published in 2022 at https://arxiv.org/abs/2209.10505) have successfully demonstrated model inversion against transformer architectures but for smaller models such as Bidirectional Encoder Representations from Transformers (BERT).
For LLMs, an attacker could prompt the model to create descriptions and reviews of concepts, events, or people to infer information about a training sample. For example, by analyzing responses to the activities of a political group, the attacker may infer information about individuals...