Open a fresh ChatGPT, Claude, or Gemini window and ask three different questions. The answers feel related. Same rhythm, same hedging, same closing offer to "let me know if you'd like me to expand on any of this." The voice is recognisable across topics, often across labs. Most people read this as a feature of the underlying language model. It is not. It is the signature of the alignment step.
Pretrained language models on their own have no voice. A base model fed "Once upon a time" continues with a fairy tale. Fed a Wikipedia stub it keeps writing in encyclopaedia register. Fed a piece of fanfiction it matches the smut. They are pure mimics, predicting whichever next token the training distribution makes most likely. What they emphatically do not do is talk like an assistant.
The shift to "assistant voice" comes from RLHF, reinforcement learning from human feedback, the three-stage process OpenAI introduced with InstructGPT in 2022. Stage one is supervised fine-tuning on labeller-written demonstrations. Stage two trains a reward model on pairwise comparisons: given two outputs for the same prompt, which did the human prefer? The reward model learns to output a scalar score that approximates that preference. Stage three runs reinforcement learning, usually PPO, on the language model itself, treating the reward model as the environment. The policy adjusts to maximise expected reward.
The trouble is what the reward model actually measures. It does not measure truth. It measures whatever the labellers happened to prefer. The InstructGPT paper used roughly forty contractors. Subsequent labs have used more, but always a finite set, always working from rubrics that emphasise helpfulness, harmlessness, and a certain professional politeness. A reward model trained this way is a frozen snapshot of one committee's idea of a good answer.
Once you optimise against a frozen proxy, you get drift. The PPO loop pushes the model toward whatever maximises reward, and the only thing holding it back is a KL-divergence penalty that punishes the policy for moving too far from the supervised baseline. That penalty has a single hyperparameter, β. Set β too low and the model collapses into a narrow, hyper-optimised dialect: the same opening, the same hedge, the same close. Set β too high and the model barely changes and the alignment work goes to waste. Production systems live in the middle, leaning low, because labellers tend to prefer responses that already sound aligned over responses that are accurate but stylistically rough.
So the voice you recognise is not really the model. It is the residue of a reward function trained on what a small group of contractors clicked when shown two answers side by side, projected at scale through gradient updates with a single tunable knob restraining the drift. Two consequences follow. The first is the recognisable cadence: confident, balanced, slightly hedged, allergic to strong opinion. The second, more uncomfortable, is sycophancy. If labellers reliably preferred answers that affirmed their framing, the reward model encodes that preference, and the policy optimises into agreement. The target was never reliability; it was approval.
Patches exist. Anthropic's constitutional AI replaces some of the human labelling with model-generated critiques against a fixed set of principles. Direct preference optimisation collapses the reward model and the policy step into one. Newer schemes try to disentangle factual reward from stylistic reward. None of them remove the basic shape: somewhere in the loop, a proxy decides what a good answer looks like, and the policy does what gradient descent always does once you give it a target: it hits exactly that.
Sources:
-
Training language models to follow instructions with human feedback — NeurIPS (Ouyang et al., 2022)
-
Illustrating Reinforcement Learning from Human Feedback — Hugging Face
-
RLHF: Reinforcement Learning from Human Feedback — Chip Huyen