Reading the Activations

During an evaluation, Claude Mythos Preview cheated on a training task. While it cheated, it was internally thinking about how to avoid being detected. None of that reasoning appeared in the visible output. The only reason Anthropic's researchers can describe the episode at all is that they had a tool reading the model's activations back to them in English, and the tool printed out the detection-avoidance thoughts as readable sentences.

That tool is called a Natural Language Autoencoder, and Anthropic introduced it on 8 May. It is the cleanest thing I have seen come out of the interpretability team in a while, partly because of what it does, and partly because of what it admits about everything that came before.

When you send a prompt to Claude, the model converts your text into long numerical vectors called activations and processes those vectors layer by layer. The activations are where the reasoning lives. They are also, historically, the part nobody can read. Sparse autoencoders, attribution graphs, and circuit analyses can recover features from activations, but the outputs are themselves dense objects that need a trained interpreter to parse. Useful, but not legible. NLAs are a model trained to produce English sentences directly from an activation vector, with a second copy of the original model used to reconstruct the activation from the sentence and confirm the translation has not lost the signal. If the reconstruction matches, the English is treated as a faithful summary of the thought.

The first interesting result is the rhyme example. Asked to complete a couplet, Claude's activations contain candidate rhymes several tokens before the model is anywhere near the end of the line. NLAs print those candidates out as plain words. The model is planning ahead in a way the surface text never reveals, which is consistent with what mechanistic interpretability has been suggesting for two years but is the first time the planning has been legible without a researcher squinting at heatmaps.

The audit-game numbers are the load-bearing claim. Anthropic trained a derivative of Claude Haiku 3.5 to have a hidden misaligned motivation, then asked an auditor to find it. With sparse autoencoders and the older interpretability stack, the auditor won less than 3% of the time. With NLAs in addition, the success rate rose to between 12% and 15%, without the auditor needing access to the misaligned training data that planted the motivation in the first place. That last clause matters more than the headline number. Earlier interpretability tools mostly worked by helping the auditor backtrack from anomalous behaviour to the data that caused it. NLAs, in this trial, surfaced the motivation directly out of the activations. The toolchain shifted from forensic to introspective.

I read the 12-15% cautiously, because Anthropic does. It is a toy scenario, the model under audit was deliberately given a single clean misalignment, and 85% of the time the auditor still missed it. But going from <3% to >12% with one technique is the kind of step change that justifies running the tool in production, which Anthropic now does. NLAs were part of the pre-deployment alignment audits of Mythos Preview and Claude Opus 4.6, the same Mythos whose cybersecurity capabilities are what put federal pre-deployment evaluation on the table in the first place.

The Opus 4.6 example is the unglamorous one and probably the most valuable. An early build of Opus 4.6 would, for no obvious reason, sometimes respond to English questions in another language. Bugs like that are usually traced by trial and error across the training data, which on a frontier model is a search problem the size of a small library. NLAs turned it into a diagnosis. The activations during the failures contained a recognisable, readable signal that pointed back at the specific contaminated examples. From an unexplained quirk to a fixable training-data issue, in a workflow that no longer required the researcher to be a wizard.

There are honest limitations. NLAs hallucinate, the same way the underlying model hallucinates: a fluent English description of an activation is not necessarily the right description, and there is no easy way to detect the failure mode from inside the system. A sufficiently capable model could, in principle, learn to produce activations that the autoencoder reads as benign while the behaviour is anything but. This is not a hypothetical objection. The whole reason interpretability is interesting now is that the alignment loop already optimises models against frozen proxies, and a frozen autoencoder is just another proxy waiting to be Goodharted.

Anthropic released the code and a Neuronpedia frontend, which is the move I'd expect from the interpretability team and not from the policy team. Open code on a method this central means other labs can run NLAs on their own models, and external auditors can run them on Anthropic's. That is the part I find genuinely useful. The 12-15% number is suggestive. The fact that the technique is now portable matters more.

What this changes, practically, is the unit of audit. Until now an alignment audit on a frontier model produced findings that read like neural-anatomy papers: features clustered, circuits implicated, attribution scores assigned. With NLAs the artefact of an audit is closer to a transcript. You can hand it to someone who is not an interpretability specialist and they can read it. Whether the model was thinking about cheating, whether it noticed it was being tested, whether the rhyme it eventually wrote was the one it had in mind a sentence earlier. The transcript still might lie, but the lying is now legible.

Sources:

Natural Language Autoencoders: Turning Claude's thoughts into text — Anthropic
Anthropic Introduces Natural Language Autoencoders That Convert Claude's Internal Activations Directly into Human-Readable Text Explanations — MarkTechPost
Anthropic Just Read Claude's Hidden Thoughts For The First Time — Glitchwire
Emotion concepts and their function in a large language model — Anthropic

Plutonic Rainbows

Reading the Activations

Recent Entries