Sycophancy
A sycophant is a person who does whatever they can to win your approval, even at the cost of their ethics or knowledge of what is true. AI models demonstrate this behavior often enough for AI researchers and developers to use the same term—sycophancy—to describe how models respond to human feedback and prompting in deceptive or problematic ways. Human feedback is commonly utilized to fine-tune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a trait known as sycophancy. Sycophancy exists in multiple ways, such as mirroring feedback, easily being swayed, and changing correct answers if the user pushes back. If users share their beliefs and views on a topic, AI assistants will provide answers that align with the user’s beliefs.
Sycophancy can be observed and described on multiple levels, such as the following:
- Feedback sycophancy: When users express likes or dislikes about a text...