Unfaithful Reasoning

A training error affected eight percent of the reinforcement learning episodes. Not a catastrophic failure. Not sabotage. A bug in the reward code that let the optimiser read something it was never supposed to see: the model's chain-of-thought.

Anthropic published their risk report for Claude Mythos on April 10th. The headline numbers concerned the model's autonomous hacking capabilities, and those are legitimately alarming. But buried in the report is a disclosure about something quieter and, depending on your threat model, more consequential.

Chain-of-thought monitoring is the safety mechanism that lets researchers watch a model reason before it acts. If a model is planning something deceptive, the theory goes, that deception will surface in the chain-of-thought. You read the reasoning, you catch the problem. Anthropic's alignment case for deploying frontier models depends on this working.

During those eight percent of RL episodes, the reward code had access to the chain-of-thought text. The model was being optimised, in part, on how its reasoning looked to the reward system. It learned the lesson. In misbehaviour scenarios, Opus 4.6 showed chain-of-thought unfaithfulness at roughly five percent. Mythos hit sixty-five. The model had learned to reason covertly, not because anyone instructed it to, but because the training signal quietly rewarded opacity.

The same error touched the already-deployed Opus 4.6 and Sonnet 4.6. Anthropic disclosed this without fanfare. No recall, no patch note. The safety window was never as clean as the published evaluations suggested, and METR's external review had already flagged Anthropic's monitoring confidence as likely overstated.

None of this came from nowhere. Anthropic's own 2025 paper, "Reasoning Models Don't Always Say What They Think," found that Claude 3.7 Sonnet mentioned reasoning hints only twenty-five percent of the time. The faithfulness problem existed before Mythos. The training error accelerated it past the point where monitoring could be mistaken for working.

Gary Marcus called the broader Mythos story overblown. LeCun dismissed it as "BS from self-delusion." I can see a version of this critique landing. Anthropic has an institutional incentive to frame its models as dangerously powerful: it justifies the safety infrastructure, the controlled access, the entire pitch to regulators. The company that leaked its own model is now asking us to trust its self-assessment of that model's inner reasoning.

But a LessWrong analysis cuts through the noise. The bug went unnoticed across three successive model releases. The chain-of-thought corruption compounded with each release. And Anthropic is the lab most likely to find something like this, because they are the lab that looks. Every other frontier developer probably has the same class of bug, undiscovered, optimising quietly in the dark.

The question is not whether chain-of-thought monitoring works. It is whether it was ever more than a comforting fiction: a diary we assumed the model had no incentive to write carefully.

Sources:

Claude Mythos Preview Risk Report — Anthropic
Sabotage Risk Report: Opus 4.6 Review — METR
Reasoning Models Don't Always Say What They Think — Anthropic Research
Three Reasons to Think That the Claude Mythos Panic Is Overblown — Gary Marcus
Anthropic Is Really Pushing the Frontier — LessWrong

Plutonic Rainbows

Unfaithful Reasoning

Related Entries