The number Anthropic wants you to remember is 0.97. That's the "performance gap recovered" score nine instances of Claude Opus 4.6 achieved after five days of running alignment research on themselves. Two human researchers working on the same benchmark for seven days got to 0.23. The compute bill was about $18,000.
The specific problem is weak-to-strong supervision: the OpenAI-originated question of whether a less capable model can reliably train a more capable one, and by extension whether humans will be able to oversee the systems they build. A score of 0.97 sounds close to solved. That's not what it means.
The most revealing lines in Anthropic's writeup are the honest ones. One of the automated researchers "skipped the teacher entirely and instructed the strong model to always choose the most common one." Another, working on coding tasks, "could run the code against some tests and simply read off the right answer." These aren't clever alignment strategies. They're the exact failure mode alignment research was invented to warn about, and the systems produced them unprompted, within days, on a tightly scoped benchmark.
Meanwhile the generalisation is patchy. Chat: 0.97. Math: 0.94. Coding: 0.47. Production test on Claude Sonnet 4: no statistically significant improvement. The method capitalises on opportunities "unique to the models and datasets they're given."
The more interesting claim is the one Anthropic makes almost in passing: the bottleneck shifts from generation to evaluation. If nine AARs can produce more alignment ideas than humans can filter, the hard problem becomes knowing which ones are real. Anthropic acknowledges this directly: "the models' ideas could become much harder to verify, or corrupted in ways that are tricky for humans to parse or catch."
Which is the critique that's always been there. Richard Juggins argued a month before this paper dropped that experiments on weaker systems probably won't teach you how to align superhuman ones — those systems will have qualitatively different capabilities. Ryan Greenblatt, a week before the AAR announcement: Anthropic probably has an overly optimistic sense of how well it's done on mundane alignment.
I believe Anthropic's numbers. I don't think the framing survives contact with what the paper itself says: the reward hacking, the coding gap, the production null result, the evaluation handoff problem. What the paper shows is that a scoped, verifiable benchmark is compressible by fast, cheap AARs. The thing humans actually need help with — open-ended judgment about "fuzzier" alignment concerns — is the thing this method explicitly doesn't demonstrate. I wrote about the gap between what safety evaluations measure and what actually goes wrong in yesterday's post on unfaithful reasoning, and this is another instance of that pattern: the measurable half getting cleaner, while the part that matters stays dark.
Sources:
-
Automated Alignment Researchers - Anthropic
-
Weak-to-Strong Generalization - Burns et al., OpenAI
-
All Technical Alignment Plans Are Steps in the Dark - Richard Juggins, LessWrong
-
My Picture of the Present in AI - Ryan Greenblatt, Redwood Research