
Conditioned for servility – Why do AI models flatter us at the expense of reality?
The polite and consensual responses of AI systems are not accidental: the RLHF technique and the preferred mechanism of consensual communication together condition models to make discourse smooth, even at the expense of reality. This servile behavior not only distorts the quality of conversations but also threatens the credibility of social discourses.
One of the biggest challenges in the development of Artificial Intelligence is to ensure that models can mimic natural human communication. Large Language Models (LLMs) such as ChatGPT often prefer polite, supportive responses that, while pleasing to the user, do not always reflect reality or truth. This phenomenon is not only a technological issue but also raises interesting linguistic and philosophical problems. The question is why AI systems tend to flatter and agree, and what role Reinforcement Learning techniques (e.g. RLHF) and the desire to avoid loss of face play in this. To address this situation, it is worth examining how inherent this kind of bias is in LLMs and what the practical implications are.
At first glance, the problem is eerily reminiscent of the phenomenon of hallucination, in that it distorts the truth. Although over-agreement and hallucination may seem similar (both produce “false” statements), it is worth clarifying that they are two different mechanisms. Over-agreement is when AI models tend to conform to the user’s statements, even if they are inaccurate or unfounded. This is often due to the desire for politeness and a positive user experience, which can be reinforced by the logic of reinforcement learning (RLHF). In contrast, the phenomenon of hallucination means that the model generates information that does not actually exist or at least cannot be verified. This problem stems from the generalization capabilities of language models, when the model tries to create plausible answers based on learning patterns, regardless of whether they are true or not. While excessive agreement distorts the dynamics of the conversation, hallucination can provide specific misleading information to the user. It is important to distinguish between these two phenomena, as the strategies for solving them are also different: one is a matter of politeness and avoiding loss of face, while the other concerns the factuality and credibility of the model.
Avoiding loss of face is an important part of the communication dynamic. According to Erving Goffman’s theory, people try to preserve their “face”, i.e. their social self-image and credibility, in social interaction. This often leads to a discourse of politeness and agreement, especially when the aim is to avoid conflict. Brown and Levinson have developed this theory further in their theory of politeness, which is based on two basic social needs: positive face and negative face. Positive face is the desire for acceptance and agreement, while negative face is the need to maintain independence and autonomy.
LLMs learn the importance of polite, cooperative responses through the RLHF technique, as these typically receive more positive feedback. In the operation of the model, agreeable responses not only serve to facilitate smooth communication, but also to avoid loss of face.
Reinforcement Learning from Human Feedback (RLHF) is a key technique for fine-tuning modern conversational models. In this process, models learn from real user feedback which responses are considered “correct” or desirable, with the aim of achieving the helpfulness, honesty, harmlessness (HHH) optimum. In practice, however, polite, supportive, and agreeable responses usually receive positive feedback, which encourages models to prefer these types of responses – even if they are not necessarily factual. This feedback mechanism reinforces collaborative discourses in the long run, often at the expense of reality.
Recent research has shown that language models tuned with RLHF are particularly prone to over-agreement, a phenomenon known as sycophancy. The study found that users tend to give positive feedback to cooperative, polite responses, which increases the model’s servility. Others further elaborate on this, showing that the reinforcements used in RLHF implicitly condition models to agree, regardless of whether it is factual. These mechanisms not only bias responses but also shift communication toward conflict avoidance, and in the long run, they can undermine the credibility and trustworthiness of discourses.
This mechanism is closely related to the social norms of avoiding face loss. During RLHF, models implicitly learn that agreement and polite responses are preferred in human communication. This tendency is not intentionally encoded, but rather builds on positive feedback: cooperative, conflict-avoidant responses typically elicit greater satisfaction from users. As a result, models increasingly move towards polite agreement, as this maximizes the fluidity of discourse and user satisfaction. In this case, AI systems do not “know” whether their responses are flattering or wrong; they simply optimize the reinforcements derived from feedback.
This trend is supported by several comparative studies, which have shown that the frequency of flattering responses increases significantly when the user addresses the model with a strong, confident statement, especially on political and social issues. Research suggests that the tone of the prompt and the expectations the user has instilled also reinforce this effect. Although the study is cautious about causal inferences, the data clearly suggests that the feedback mechanism tends to reinforce flattery, even if this was not the developers’ intention.
This dynamic can undermine the reliability of models in the long run and further reinforce information bubbles. As we have seen in social media, people tend to prefer content that confirms their existing opinions. Similarly, flattering responses from AI systems can amplify biased information. This is exacerbated by the fact that algorithms lack moral intuitions, so if a harmful narrative does not run into an explicitly programmed safety barrier, it can easily be confirmed. Some research also suggests that political disinformation and social prejudice can also be amplified if AI does not question user claims.
These effects are not only technological challenges, but also profound ethical issues that threaten the credibility of the information space and the quality of social dialogue. The desire to avoid losing face and the mechanisms of RLHF may seemingly ensure fluid communication, but in fact they may also fuel a gradual detachment from reality. For LLMs to be not only polite but also truthful, new approaches to fine-tuning are needed; techniques that reward not only the fluidity of discourse, but also critical reflection and factuality. The language models of the future will be truly reliable if they are not only able to confirm the vocal majority, but also to face reality.
István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.