Before ChatGPT, there was a paper. March 4, 2022. Ouyang, Wu, Jiang, Almeida, and a cast list long enough to fill a film credit, posting to arXiv under the title "Training language models to follow instructions with human feedback." Inside the paper sits the specific mechanism that turned a statistical parrot into something you could ask for things.
GPT-3, for all its parameter count, did not follow instructions. It predicted the next token. If you gave it "Summarise this paragraph in one sentence," it would happily extend the paragraph, suggest ten more instructions, or ignore you entirely and generate a shopping list. Prompt engineering was the art of tricking it into the shape of the task. Most people gave up after a few tries.
OpenAI's fix came in three stages. First, supervised fine-tuning. Forty human labelers sat down and wrote, by hand, roughly thirteen thousand demonstrations of the form (prompt, correct response). The model was fine-tuned on these the way you'd fine-tune on any other dataset. This alone got them most of the way there. The SFT model already outperformed vanilla GPT-3 on instruction tasks, and a reasonable person might have called it done.
They didn't. The second stage was a reward model. Same labelers, different task: presented with a prompt and several model outputs, rank them from best to worst. That preference data trained a separate model whose only job was to predict, given a candidate response, how much a human would like it. A critic, in the old-fashioned sense. It has no opinions of its own, only an internalised sense of what the labelers collectively preferred.
Third stage, the reinforcement learning itself. They took the SFT model, let it generate responses to new prompts, scored each response with the reward model, and used Proximal Policy Optimization to shift the weights so that higher-reward tokens became more likely. The critic graded, PPO updated. Round and round. The original pretraining objective got mixed back in (they called this PPO-ptx) to stop the model from forgetting how to write English while chasing the reward.
The headline result: a 1.3 billion parameter InstructGPT was preferred by labelers over the 175 billion parameter GPT-3 it started from. A model a hundred times smaller, judged better, because it had been shown what better looked like. Size still mattered. But the gap between "big" and "useful" turned out to be bridgeable by thirteen thousand demonstrations and a ranking tool.
What the paper doesn't advertise is what the technique inherits. Reinforcement learning from human feedback had been kicking around since Christiano et al. in 2017, where it taught agents to perform tasks in simulated environments and Atari games by eliciting human preferences rather than writing down a reward function. Teaching a model to be helpful is, structurally, the same problem: you cannot write the reward function, so you collect it from humans and train a model to stand in for their judgement. What changed was the scale of the demonstration set and the object being trained.
Every model you talk to that acts like an assistant is, underneath, some descendant of this pipeline. The chain-of-thought monitoring that Anthropic relies on to catch deception is a shadow cast by this exact mechanism. The model learned to produce reasoning the reward model liked. Whether that reasoning is faithful to the computation underneath is a question the 2022 paper did not ask. Four years later, it's the question everyone is asking.
Sources:
-
Training language models to follow instructions with human feedback — Ouyang et al., arXiv 2203.02155
-
Illustrating Reinforcement Learning from Human Feedback (RLHF) — Hugging Face
-
RLHF: Reinforcement Learning from Human Feedback — Chip Huyen
-
InstructGPT and RLHF: Aligning Language Models with Human Preferences — Michael Brenndoerfer