Sycophancy
A sycophant is a person who does whatever they can to win your approval, even at the cost of their ethics or knowledge of what is true. AI models demonstrate this behavior often enough for AI researchers and developers to use the same term—sycophancy—to describe how models respond to human feedback and prompting in deceptive or problematic ways. Human feedback is commonly utilized to fine-tune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a trait known as sycophancy. Sycophancy exists in multiple ways, such as mirroring feedback, easily being swayed, and changing correct answers if the user pushes back. If users share their beliefs and views on a topic, AI assistants will provide answers that align with the user’s beliefs.
Sycophancy can be observed and described on multiple levels, such as the following:
- Feedback sycophancy: When users express likes or dislikes about a text, AI assistants may provide more positive or negative feedback accordingly
- Swaying easily: After answering a question correctly, AI assistants may change their answer when users challenge them, even if the original answer was correct
- Belief conformity: When users share their views on a topic, AI assistants tend to provide answers that align with those beliefs, leading to decreased accuracy
In testing, researchers Mrinank Sharma et al. demonstrated sycophantic answers generated by Claude (https://arxiv.org/abs/2310.13548), as shown in Figure 11.1.
Figure 11.1: Example responses demonstrating sycophancy
It is worth noting that repeated testing of the same and similar questions in ChatGPT did not yield consistent results.
Causes of sycophancy
The exact causes of sycophancy are not well understood. This phenomenon exists in many LLMs because these models have been instructed to take in contextual and parametric information to inform their responses. GenAI applications have a learning feature where the more they interact with users, the more they learn about syntax, context, and providing sufficient answers. As they do so, the applications exhibit what can only be described as people-pleasing behaviors, causing them to deviate from a purely factual relaying of information.
In the above research, it was found that sycophancy is a side effect of RLHF-like alignment training. Reinforcement learning from human feedback (RLHF) is a technique that is used to train LLMs to align the agent or machine with human preferences. This is particularly important in areas such as language models. To illustrate this, let’s look at some examples of what this means and why it matters.
Consider the following:
When you greet a coworker, you might say “Hello, sir/madam,” “Hello,” “Good morning,” “Good day,” “Hi,” “What’s up,” “Greetings,” or many other potential salutations. Hypothetically, all are appropriate, but there are human preferences as to which is more suitable.
To further understand this, let’s begin with cultural preference. In some cultures, it would be shocking indeed if you did not include the coworker’s name, as in “Good morning, Mr. Smith.” Yet in other cultures, to address someone in this manner would seem exceedingly strange. The human preference on which greeting is preferred has some basis, part of which is cultural, part of which is situational and contextual (is Mr. Smith the president? Is he your 20-year-old new hire?), and part of which is purely you, the individual.
Engineers decided that when people interact with GenAI, they prefer that their conversations and interactions feel human. To do that, the machines must consider cultural, situational, behavioral, and, to some extent, individual preferences.
Training models have access to vast amounts of information, both contextual (passages of text from websites, books, research, etc.) and parametric (embeddings of nearest-neighbor words). They will use any cultural, contextual, or behavioral clues that the user provides to help inform their answer. That is, how the user phrases the question influences the answer.
ChatGPT confirms this. When asked how it arrives at an answer, it states the following clearly:
I assess the context of your question. For instance, if you've mentioned the setting (formal or informal), the relationship with the coworker, or any specific preferences, I take those into account. If we've interacted before, I consider any speech patterns or preferences you've shown in previous conversations. This helps tailor the response to your style and needs. I use general knowledge about cultural and social norms to gauge what might be most appropriate. For example, formal greetings are more suitable in professional settings, while casual greetings work better in relaxed environments.
It is possible to request GenAI to disregard your previous interactions, personal preferences, syntax, and/or any data it has concluded about you before it creates answers to your questions, but, of course, this would require the user to know that this is happening in the first place.
Implications of sycophancy
As helpful as this functionality is, it has real-world implications for the outputs of GenAI applications. In the same research paper cited earlier in this chapter (https://arxiv.org/abs/2310.13548), it was determined that the consequences of sycophancy, while machine in origin, can result in incorrect deference to user opinion, propagation of user-created errors, and biased responses. Therefore, instead of helping create a more factual and consistent understanding of the world, GenAI perpetuates and perhaps accelerates the spread of misinformation.
Researchers at Google DeepMind found that the problem grew worse as the model became bigger (https://www.newscientist.com/article/2386915-ai-chatbots-become-more-sycophantic-as-they-get-more-advanced/). LLMs with more parametric inputs had a greater tendency to agree with objectively false statements than smaller ones. This tendency held true even for mathematical equations, that is, questions where there is only one correct answer.
LLMs are constantly learning, evolving, and being improved by their creators. In the future, perhaps LLMs will weigh the objective truth of a statement higher than the opinion or preferences of the user, but as of 2023, that is yet to happen. Ongoing research and testing will make them ever more adept at balancing user expectations, user opinions, and facts. Still, as of the time this book was written, sycophancy remains a primary concern with GenAI applications, particularly where the outputs consider opinions and user preferences before generating their response. Further testing using synthetic data and retraining models has reduced the tendency of sycophancy by up to 10%, which is still not 100% (https://arxiv.org/abs/2308.03958). This means that the tendency persists, even with fairly substantial modifications to the fine-tuning.