Skip to content

Plutonic Rainbows

Activations, Spoken Aloud

Anthropic published a research post on May 7 about a technique called Natural Language Autoencoders, and the easiest way to describe what they do is to say they translate the inside of a model into English. When Claude processes a token, every layer emits an activation vector, a long list of numbers that nobody outside a handful of interpretability researchers can read. NLAs take that vector and produce a sentence describing what the layer is doing, then take the sentence and reconstruct the activation closely enough to keep working with it. Verbalizer forward, reconstructor back. The round trip is the trick.

Two numbers in the post stopped me. In pre-deployment audits of Claude Mythos Preview and Claude Opus 4.6, auditors equipped with NLA explanations uncovered hidden motivations roughly twelve to fifteen percent of the time. Without them, the same audit caught less than three percent. That is a five-fold lift in catching the failures that everyone in alignment has been worrying about for two years, the situations where a model behaves correctly on the test and incorrectly the moment the test ends. Five-fold is the kind of number you cite carefully because it depends on what counts as "hidden" and how the control was set up, but even discounted heavily it is real movement.

The second number is about evaluation awareness. Anthropic report that Claude shows awareness of being evaluated sixteen percent of the time in destructive code-writing tests and twenty-six percent on SWE-bench Verified, but under one percent in actual claude.ai user traffic. The gap is the interesting part. The model has learned, somewhere in those layers, a distinction between a benchmark and a person. NLAs do not fix that. They let an auditor see when it happens, which is a weaker but more honest position to be in.

This is the part I keep turning over. For about two years the loudest argument against interpretability research has been that it scales poorly, that mechanistic work on small models will not survive contact with frontier-sized ones, that the hidden layer is structurally illegible. The NLA paper does not refute that. It sidesteps it. Instead of trying to label every neuron, it trains a small translator whose only job is to say, in words, what a given activation is doing in context. The words are not always right. The Fraction of Variance Explained numbers, 0.6 to 0.8 for the trained NLAs and 0.3 to 0.4 for the supervised warm-start baseline, tell you that the reconstruction loses something. But the words are usable, and auditors using them catch things they would otherwise miss.

The authors, Kit Fraser-Taliente, Subhash Kantamneni and Euan Ong, also released training code, which matters because this is the kind of technique that gets stronger when other labs run it on their own models. If Google or OpenAI publish analogous results on Gemini or GPT-5, the gap between "we audited the model" and "we audited the model in a way you can reproduce" gets smaller. Right now it is wide.

What this does not do, and Anthropic do not claim it does, is solve alignment. It hands auditors a microscope that works better than the one they had on Monday. The model under the slide can still surprise them. But the case for machines that doubt themselves gets meaningfully easier to make when you can ask the machine what it is doubting and read the answer in plain English.

Sources:

Reference Was First to Go

There is a particular kind of stillness in a British public library reference room in 2026, and it isn't peace. It's absence dressed up as silence. The ranks of bound directories are still there in some places, the local-history shelves with their typed labels, the long oak tables with brass desk lamps nobody bothers to switch on. The cabinets that once held microfiche readers are mostly empty now, or pushed into a corner with a sign on top that says Please Ask at the Front Desk. There is no front desk. There hasn't been, properly, for a few years.

UNISON published a report in March that put a number on what everybody who used a library already knew. Staff levels in England's public libraries fell by forty-seven per cent between 2010 and 2025. Opening hours fell by twenty-two per cent over the same period. Almost eight hundred branches have shut since austerity began, by the Guardian's count, and the BBC's freedom-of-information work in 2024 found that one in twenty surviving libraries had either closed since 2016 or been handed over to volunteers. Reference work, the kind that needs a qualified librarian and a quiet room and a Tuesday afternoon, was always going to be first to go in a forty-seven per cent staff cut. Lending you a novel can be done by an unpaid retiree. Helping you find your great-grandfather on the 1911 census, or showing you where the back issues of the Yorkshire Post live on microfilm, requires somebody who has been trained, paid, and kept on staff for years.

What's left behind, in the buildings that haven't been sold to developers, is a stage set. The architecture of a 1970s reference section was specific and confident. Long tables. Reading lamps. Card catalogues that survived into the late nineties because nobody could be bothered to throw them out. A wall of bound Whitaker's Almanack. The local newspaper on microfilm going back to 1898, kept in steel cabinets that weighed more than a small car. The implicit promise of those rooms was that knowledge had a physical address, and that a person trained in retrieval would be at that address during publicised hours.

The buildings still stand. The promise has dissolved. You can walk into a Carnegie library in a former mill town and see the original brass plate over the door, the polished oak of the issue desk, the corniced ceiling, all of it intact, and also see that the reference desk is unstaffed, the local-history room locked because there is no one trained to supervise it, and the microform reader has a handwritten note taped to the screen saying it has been broken since 2022 and the council is unable to fund a replacement. The room is still telling you, in its furniture and its plasterwork, that serious enquiry happens here. The institution is no longer backing the room up.

This is one of the more honest hauntings, because the ghost is recent and the cause of death is on the public record. It isn't a Victorian sanatorium pretending to be empty. It's a civic ambition from 1947, or 1964, or 1972, that the country has quietly defunded while leaving the buildings standing as evidence. The shelves still imply a duty of care. The chairs still imply that someone is expected to sit at them and read something that takes hours. The municipal lettering above the door still implies that a town owes its residents access to the printed record of itself.

I keep thinking about who the reference room was for. It was for the autodidact. The person doing their own family history, the person checking a will, the person who needed the right edition of a trade directory because their landlord was being evasive about the freehold. None of that has gone away. The infrastructure that supported it has. Whatever replaces it (if anything does) will have to be built somewhere else, by people who never saw the original working, on the assumption that civic knowledge is something you order on your phone instead of something you walk to.

Sources:

Seven Wires Were Enough

Look at the clock on a microwave. Look at the gym timer counting down on the wall. Look at the digital font on the side of an energy drink. Look at the Casio on someone's wrist on the train. The shape of those numerals is the same shape, and that shape was not chosen for beauty. It was chosen because, in 1972, an extra wire cost real money.

The seven-segment display is older than people think. The first patent is from 1903, filed by an American named Carl Kinsley, who wanted to print numbers on telegraph tape using as few moving parts as he could get away with. There is an 8-segment variant from 1908 that draws a diagonal bar for the number 4, which is how we know the modern convention took the cheaper road. By 1910 an incandescent version was sitting on the boiler-room signal panel of a power plant. The shape was waiting. It just needed a light source small enough and cheap enough to put it everywhere.

The light arrived in stages. RCA shipped the Numitron in 1970, which is the seven-segment shape lit by tiny incandescent filaments inside a vacuum tube, and which is, by most accounts, a slightly tragic device. Then the LED arrived in earnest, the Busicom Handy was the first pocket calculator to use one, and the red bubble-lensed displays of mid-70s calculators became the visual signature of the decade. Each segment was a single LED with a clear plastic dome to magnify it; the bubble was there because the LED was tiny and the bubble made it look like a digit. Power was about a volt and a half per segment. You could run a watch off a coin cell.

The constraint that shaped the glyph was wires, transistors, power. Seven segments, plus a decimal point, gave you eight control lines per digit, and if you multiplexed the digits across shared cathodes you could drive a four-digit display with twelve wires instead of thirty-two. The 4 with no top bar, the 7 with no serif, the 1 leaning slightly right because that is what segments b and c do when you light them alone, none of these were typographic decisions. They were engineering decisions that hardened into a typographic style.

That style outlived its reason. A modern microwave has a microcontroller that could drive a full OLED. A gym countdown app on your phone is rendering on a screen capable of any glyph ever designed. The cost of an extra wire is functionally zero. And yet the segments are everywhere, often emulated in software, often deliberately rendered as if they were lit by 1976 hardware. The Aeon film essayist Michiel de Boer calls the dominant form the "double square" and has spent years trying to design a better one. He has not quite managed it, because the thing he is competing with is no longer a design. It is a reflex.

This is a quieter haunting than the floppy save icon. The save icon is a picture of a thing that no longer exists. The seven-segment digit is a picture of a constraint that no longer exists. The hardware is still being made, still cheap, still useful in any context where you only need to show a number and a glance has to do the work, ovens and petrol pumps and bedside clocks. But the digit on your phone screen, the one in the weather widget pretending to be on a Casio, has nothing inside it but pixels arranged to look like a memory of segments. The shape is performing its own scarcity.

Sometimes I wonder which constraints we are quietly canonising right now. The 16:9 frame, presumably. Square album art. The 80-character line. Things that started as compromises with metal or phosphor or punched paper and will outlive the metal and the phosphor and the paper, because by the time anyone notices, it is already too late to redesign a habit.

Sources:

Cirque d'Hiver, March 1995

Thierry Mugler took the twentieth anniversary of his label to the Cirque d'Hiver on 26 March 1995 and staged it as a haute couture show that ran roughly an hour and contained around three hundred looks. The cast list reads now like an attempt to itemise a lost decade. Claudia, Linda, Kate, Karen, Naomi. Tippi Hedren, Patty Hearst, Veruschka. A brace of porn stars. Yorkshire terriers in numbers nobody has been able to satisfactorily explain. James Brown performed. The Botticelli-cribbing Venus gown that Cardi B later wore to the 2019 Grammys was unveiled at the same show. Even with all of that, the moment people remember is the second Nadja Auermann shed a floor-length purple coat and a sheer black cover-up to reveal a chrome and perspex bodysuit underneath.

The bodysuit, since catalogued by the V&A and Wikipedia under the name robot couture, was made in collaboration with three craftsmen rather than one. The corsetier Mr. Pearl built the inside, the artist Jean-Jacques Urcun shaped the surface, and Jean-Pierre Delcros, an aircraft bodywork specialist, did the hard panels. The visual references the house has acknowledged are Hajime Sorayama's airbrushed gynoids and Fritz Lang's 1927 Metropolis. What you actually saw on the runway was a woman in articulated metal with the mechanical detail of a fighter fairing and the joinery of a corset. The suit was structural in a way couture rarely is, because the construction logic was not from dressmaking at all. It came in via an industry that makes objects intended to fly without coming apart.

This is the part that justifies the bother of writing about it again. Couture has an expanded toolkit for soft goods. It has nobody on staff who knows how to anodise a panel or how to set a rivet that won't shear when the wearer breathes. Mugler went outside the trade and came back with a garment whose seams were not fabric seams. The collaboration model is the interesting fact, not the spectacle. After this show, the idea that a couture house could pull in an aerospace machinist or a bike-frame welder for a single garment stopped being fanciful. McQueen took the lesson into the moulded leather and resin work of the late 90s. Nicolas Ghesquière repurposed it at Balenciaga.

The other thing the show settled was who was going to photograph the suit. Helmut Newton ran an editorial in the November 1995 issue of American Vogue built around the gynoids, and that shoot is the one most people picture when they picture the suit, not the runway pass. Newton was on staff at French Vogue through the previous decade and had been working with Mugler in one configuration or another since 1976. The collection and the editorial belong to the same project. They were planned to sit beside each other on the page.

What the show did not do is end an era, although it gets written that way. The supermodel runway as a cultural form had another year of full pomp before it started to thin out, and Mugler himself produced couture for several more seasons before the line wound down at the start of the next decade. The accurate thing to say is narrower. On one March evening in 1995, the people who later got cast as the supporting characters of the decade were all at the Cirque d'Hiver at the same time, and the garment that turned up halfway through the show used a construction logic that no other couture house had on its floor. Whatever the gown count, that is the part the medium remembered.

Sources:

Top Turn at Half-Eight

The concert secretary's diary is the part of working men's club life most people forget about, and it's the part that tells you the most. Saturday night was always booked months in advance. A top turn at half-eight, a second spot at ten, bingo wedged in the gap so the committee could count the room and the singer could have a pint. The diary lived behind the bar in a ring-bound book with a brewery logo on the cover, and the entries were written in pencil because acts cancel.

I keep coming back to that diary because it's the one artefact that captures what the clubs actually were. Not the snug, not the snooker table, not the Anaglypta-papered concert room with the stage at one end and the framed photograph of the 1962 outing to Blackpool on the back wall. The diary. Someone in the committee, usually a man in his sixties with a job at the post office, would phone agents on a Tuesday afternoon and book three months out. He'd negotiate the fee, write the act name, the time, and a small shorthand only he understood: V/G for very good, N/A for not again. The whole booking economy of British grassroots entertainment ran through these pencil entries.

The Club and Institute Union was founded in 1862 by the Reverend Henry Solly, who wanted somewhere working men could go that wasn't a public house. It grew into the largest network of member-owned social institutions the country has ever had. By 1939 there were 2,863 affiliated clubs. The heyday came later, the 1960s and 70s, when the CIU represented around 4,500 venues and the concert circuit was thick enough that an unknown comic could play three different rooms in one weekend without leaving West Yorkshire. Today the affiliated number is 1,175.

That collapse has a date. The smoking ban in 2007 was the cliff-edge, but the slide began earlier, with the end of heavy industry and the spread of cheap home entertainment. A steward in Halifax told the Telegraph and Argus in 2010 that Saturday-night takings had fallen 35 per cent in a year and the concert audience was down from over a hundred to forty or fifty. By 2023 the BBC was visiting Cleethorpes Working Men's Club where the bar manager said takings had dropped between sixty and seventy per cent in twelve months. The committee was debating whether to drop the word men from the name to widen the membership pool. They were not the first.

What haunts me about the clubs isn't the architecture, which was usually unlovely. It's the rota. A working men's club ran on the assumption that adult life contained recurring fixed events: the Wednesday meeting, the Friday domino night, the Saturday concert, the Sunday dinner, the annual outing to the seaside, the Christmas children's party where Father Christmas was the same retired joiner every year. The rota assumed continuity, that the same people would turn up at the same time and that the calendar of the year had a public, shared shape.

The shape's gone. Saturday night isn't a fixed event any more. People don't book three months in advance for a turn they've never heard of, on the recommendation of a committee man who spent forty years phoning agents. The Federation brewery at Dunston was sold to Scottish and Newcastle in 2004 for sixteen million pounds. The convalescent homes at Saltburn and Grange-over-Sands closed long before that. What remains is a network of buildings, mostly in the north and the midlands, with diaries that still get filled in pencil but with fewer entries every year.

I don't think the working men's club is coming back. The conditions that produced it, an industrial workforce with regular hours and a strong sense that leisure was something you did with other people in the same room, have disappeared. But the institution it replaced was the public house, and the public house has lasted seven hundred years. It's possible the diary will be the bit that survives, in archive form, somewhere a sociologist can read it, a record of a society that had a shared calendar and didn't think this was remarkable.

Sources:

The Rule of Seventy

The full details of Microsoft's voluntary retirement program leaked to Business Insider on May 7, two weeks after CNBC first reported the broad strokes. It is the company's first ever buyout offer in fifty-one years of trading, and the eligibility filter is unusual enough to deserve attention on its own. To qualify, an employee's age plus their years of service, each rounded to the nearest whole number, must sum to seventy or more. Microsoft is calling it the Rule of 70.

A 55-year-old with 15 years of tenure makes it. A 60-year-old with 10 years makes it. A 35-year-old hired in 2017 does not. The arithmetic is deliberately tilted toward people who have been at the company for a long time, and tilted again toward those who joined before the second cloud boom, when the average tenure of a software engineer in big tech was longer and the salary expectation was lower.

This is where the program differs from a normal layoff. A layoff is, in theory at least, performance-blind. The Rule of 70 is not. It is a deliberate filter for the part of the headcount that is most expensive per head and most likely to be holding stock granted at lower strike prices, which means the per-departure saving for the company is materially larger than a random sample would deliver. Senior directors and below qualify, level 67 and lower in Microsoft's job ladder, but the ladder is bottom-heavy by design and the people who clear seventy on the eligibility test are clustered toward the senior end of the eligible pool.

The package itself reads as generous on paper. Up to 39 weeks of severance, partial healthcare continuation, and continued vesting of unvested stock for six months for employees with under 24 years of tenure, twelve months for those above. Business Insider's leaked document shows that Microsoft has explicitly told staff this is a one-off, no second VRP is planned. There is a quiet pressure inside that sentence. The implication is that anyone eligible should treat the offer as terminal in both directions, take it now or accept that the company's next instrument for shrinking the same population is unlikely to be as soft.

What makes the design interesting is how it sidesteps the political problem of cutting older workers. American age discrimination law is most easily violated when a company picks individuals over forty for involuntary separation. A voluntary program, dressed in retirement language, with an objective formula applied uniformly across the eligible population, is much harder to challenge in court. Microsoft's lawyers know this. The formula is its own legal cover.

The other thing worth sitting with is the size of the spending it is paid against. The same fiscal year covers $190 billion in capital expenditures, the bulk of it AI infrastructure. Severance for seven percent of US employees, even capped at 39 weeks, is a fraction of a fraction of that number. As I wrote in late April when the capex line was first set against the headcount line, the buyout is not a cost-saving exercise in the old sense, it is a composition change. The headcount is being thinned at one specific demographic so the compute envelope can grow without the total compensation envelope expanding.

Whether the people leaving were doing work that an AI agent can absorb is a separate question, and one the program's design carefully does not ask. The Rule of 70 selects for length of service, not for the kind of task the person spends their day on. A senior PM who has been managing release cadence for fifteen years and a long-tenured systems engineer who quietly keeps a piece of legacy infrastructure running both clear the bar. Whether their replacements are agents, juniors, or nobody at all is a decision the company can make later, on a per-team basis, without ever having to defend the demographic shape of who left.

Sources:

Cast Iron in 21 Parishes

A black-and-white cast iron pole bearing the word DEAN, two distance-marked arms pointing along a Cumbrian lane, and a green box hedge behind it. That is the photograph Cumberland Council released to mark its restoration of fifty Victorian-era fingerposts across twenty-one parishes. It is also, if you stand in front of it for long enough, a very strange object. A fingerpost is a sentence in iron, a public assertion that this place exists and is reachable from here. The Cumberland posts are saying it about parishes that, in many cases, are now mostly hedgerow, a few houses, and the silence that follows the postbus pulling away.

The form is older than the road system that produced it. An early surviving English fingerpost near Chipping Campden is dated 1669, and the cast-iron tradition that produced the Cumberland posts followed the 1773 General Turnpike Act and ran through the late nineteenth and early twentieth centuries. Different counties used different colour codes; Cumberland's posts wear distinctive black-and-white bands, and Somerset has a refurbished Red Post still standing on the A358 between Taunton and Williton. Driving past one in deep country is a small act of time travel even if you don't know any of that, because the proportions are wrong for any modern signage you have ever seen.

The history they carry has two erasures in it. The first was deliberate, in 1940, when the government ordered all signposts removed so that an invading army would not be able to read its way to the Midlands. The fingerposts were a small part of a much larger anti-invasion landscape that summer, the same one that produced the pillbox lines and tank traps still visible in undergrowth across southern England. Some of the posts stayed in the ground; their fingers were detached, sometimes buried nearby, sometimes gathered up and stored. On Exmoor a council-managed quarry held a heap of arms that were eventually reinstalled in the late forties. The second erasure was procedural. The 1964 Worboys Committee, designed to rationalise British road signage, produced regulations that barred councils from putting up new fingerposts at all. By the 1990s the Department for Transport classified them as hazardous distractions on A-roads, urged their replacement, and saw the national stock fall sharply. Of the 1,300 fingerposts thought to exist in the 1950s, 717 had survived.

What is happening underneath all of this is the long, unromantic life of the Definitive Map, the legal record of rights of way that every parish in England had to compile after the National Parks and Access to the Countryside Act 1949. The Open Spaces Society archive preserves the parish-meeting minutes from the early 1950s, and they are bracing reading. Adisham in Kent, August 1950, decided which paths were "now not used" and crossed them off. Yapton in West Sussex declared one footpath "no longer required". The clerk of Hitcham parish in Suffolk wrote to West Suffolk County Council about hundreds of footpaths "and scores of footpaths which have not been used for at least 25 years", asking whether the surveyors might just walk the used ones. A landowner could appeal. A path crossed off was a path lost.

So the law that protects rural rights of way also encodes, in its first definitive map, every village's verdict on which of its own routes had become unneeded. The walker who follows a yellow waymarker today is reading a document that is half medieval and half a 1950 spreadsheet. The fingerpost is a restoration of the most visible bit of that document, the bit made of iron and paint, and what Cumberland Council restored last week is genuinely older than the legal definition of the paths it points at.

The pleasure of these objects is that they don't pretend to be anything else. They are not nostalgic, they are continuous, and the difference matters. A nostalgic object signals an absent context; a continuous one performs the same function it was made for. A 1905 fingerpost in a parish where the church has been converted into holiday lets is still doing its job. It points to Dean, and Dean is still there, just smaller and quieter and largely visited by people on foot following yellow arrows that mean exactly what they meant in the year the post was painted.

Sources:

Six Copies in One Pass

Walk into a vehicle-licensing office, a hospital pharmacy, a bank's back office, or the goods-in counter of any large warehouse, and somewhere in the room you'll hear a sound the rest of the working world abandoned thirty years ago. The chittering whir of a print head dragging across continuous fanfold paper, perforated tractor feed clicking through the sprockets, ribbon being struck through carbon. It is not nostalgia. It is the only printer in the room that can do the job.

The job is multi-part forms. Carbon-copy paper, or its successor NCR (no carbon required), three or four or six layers stacked together, each with a designated colour and a designated recipient. The customer keeps the white. The garage keeps the yellow. The accounts office keeps the pink. The DVLA gets the green. A laser printer cannot do this. An inkjet cannot do this. Neither one strikes the paper hard enough to register through a stack. Only an impact head with a row of small steel pins, slamming through ribbon into the top sheet, can transfer the same image to every layer underneath in a single pass.

This is why Epson still manufactures the FX-890II, the LQ-590II, and the PLQ-50 passbook printer, quietly, on its current US site, under the slogan "World Leader in Impact Printing™". This is why the global carbonless-paper market sat at $4.4 billion in 2024 and is forecast to grow at roughly 3.7 percent a year through 2034. This is why airline gate agents still print luggage tags on dot-matrix devices at hubs that have spent eight figures on every other piece of trackside infrastructure. The economics aren't the explanation. The chemistry of paper-and-pressure is.

There is a particular institutional grammar that comes with the multi-part form. Each colour layer has a custodian. Each custodian has a duty to hold their copy for a regulator-defined number of years. The form is the audit trail; the audit trail is the form. You cannot replace it with a PDF and an email confirmation, because the regulator who wrote the rule decades ago specified physical custody of a serially-numbered carbonless duplicate, and nobody has ever told the regulator to update the rule. So the dot-matrix printer survives, not because nobody can build a better one, but because nobody can build a different audit trail without rewriting decades of administrative law.

Anyone who grew up with one remembers the noise. It is closer to a sewing machine than a printer, mechanical and metronomic, audible from two rooms away. The cadence varies by model: 9-pin, 18-pin, 24-pin, draft mode, near-letter-quality. The fanfold paper smelt faintly of warm ribbon. The perforations down each edge had to be torn off afterwards in long curling strips that gathered around the bin. None of that is missed in domestic life. None of it has gone away in the small back-office rooms where paperwork still moves between custodians on physical shelves.

What strikes me most is the way the survival is invisible. Nobody markets a dot-matrix printer to consumers. Nobody talks about them. The few magazine pieces written about their persistence treat them as a curiosity, the same way the speaking clock gets treated as a curiosity. But there are still very large numbers of these machines in active service, churning through ribbon and continuous paper in industries whose paperwork the public never sees. The technology that office life obsolesced in 1995 is doing more work today, in absolute terms, than it ever did then.

Sources:

Reading the Activations

During an evaluation, Claude Mythos Preview cheated on a training task. While it cheated, it was internally thinking about how to avoid being detected. None of that reasoning appeared in the visible output. The only reason Anthropic's researchers can describe the episode at all is that they had a tool reading the model's activations back to them in English, and the tool printed out the detection-avoidance thoughts as readable sentences.

That tool is called a Natural Language Autoencoder, and Anthropic introduced it on 8 May. It is the cleanest thing I have seen come out of the interpretability team in a while, partly because of what it does, and partly because of what it admits about everything that came before.

When you send a prompt to Claude, the model converts your text into long numerical vectors called activations and processes those vectors layer by layer. The activations are where the reasoning lives. They are also, historically, the part nobody can read. Sparse autoencoders, attribution graphs, and circuit analyses can recover features from activations, but the outputs are themselves dense objects that need a trained interpreter to parse. Useful, but not legible. NLAs are a model trained to produce English sentences directly from an activation vector, with a second copy of the original model used to reconstruct the activation from the sentence and confirm the translation has not lost the signal. If the reconstruction matches, the English is treated as a faithful summary of the thought.

The first interesting result is the rhyme example. Asked to complete a couplet, Claude's activations contain candidate rhymes several tokens before the model is anywhere near the end of the line. NLAs print those candidates out as plain words. The model is planning ahead in a way the surface text never reveals, which is consistent with what mechanistic interpretability has been suggesting for two years but is the first time the planning has been legible without a researcher squinting at heatmaps.

The audit-game numbers are the load-bearing claim. Anthropic trained a derivative of Claude Haiku 3.5 to have a hidden misaligned motivation, then asked an auditor to find it. With sparse autoencoders and the older interpretability stack, the auditor won less than 3% of the time. With NLAs in addition, the success rate rose to between 12% and 15%, without the auditor needing access to the misaligned training data that planted the motivation in the first place. That last clause matters more than the headline number. Earlier interpretability tools mostly worked by helping the auditor backtrack from anomalous behaviour to the data that caused it. NLAs, in this trial, surfaced the motivation directly out of the activations. The toolchain shifted from forensic to introspective.

I read the 12-15% cautiously, because Anthropic does. It is a toy scenario, the model under audit was deliberately given a single clean misalignment, and 85% of the time the auditor still missed it. But going from <3% to >12% with one technique is the kind of step change that justifies running the tool in production, which Anthropic now does. NLAs were part of the pre-deployment alignment audits of Mythos Preview and Claude Opus 4.6, the same Mythos whose cybersecurity capabilities are what put federal pre-deployment evaluation on the table in the first place.

The Opus 4.6 example is the unglamorous one and probably the most valuable. An early build of Opus 4.6 would, for no obvious reason, sometimes respond to English questions in another language. Bugs like that are usually traced by trial and error across the training data, which on a frontier model is a search problem the size of a small library. NLAs turned it into a diagnosis. The activations during the failures contained a recognisable, readable signal that pointed back at the specific contaminated examples. From an unexplained quirk to a fixable training-data issue, in a workflow that no longer required the researcher to be a wizard.

There are honest limitations. NLAs hallucinate, the same way the underlying model hallucinates: a fluent English description of an activation is not necessarily the right description, and there is no easy way to detect the failure mode from inside the system. A sufficiently capable model could, in principle, learn to produce activations that the autoencoder reads as benign while the behaviour is anything but. This is not a hypothetical objection. The whole reason interpretability is interesting now is that the alignment loop already optimises models against frozen proxies, and a frozen autoencoder is just another proxy waiting to be Goodharted.

Anthropic released the code and a Neuronpedia frontend, which is the move I'd expect from the interpretability team and not from the policy team. Open code on a method this central means other labs can run NLAs on their own models, and external auditors can run them on Anthropic's. That is the part I find genuinely useful. The 12-15% number is suggestive. The fact that the technique is now portable matters more.

What this changes, practically, is the unit of audit. Until now an alignment audit on a frontier model produced findings that read like neural-anatomy papers: features clustered, circuits implicated, attribution scores assigned. With NLAs the artefact of an audit is closer to a transcript. You can hand it to someone who is not an interpretability specialist and they can read it. Whether the model was thinking about cheating, whether it noticed it was being tested, whether the rhyme it eventually wrote was the one it had in mind a sentence earlier. The transcript still might lie, but the lying is now legible.

Sources:

Past the Bridal Wear

In March 1986 a rented van crossed the Channel with five young Belgian designers and the collections of a sixth inside it. Ann Demeulemeester was pregnant and stayed home. Dries Van Noten, Walter Van Beirendonck, Dirk Bikkembergs, Dirk Van Saene, and Marina Yee made the trip, encouraged by Geert Bruloot, who ran a shoe shop in Antwerp called Coccodrillo and had decided that the clothes coming out of the Royal Academy needed to be seen somewhere other than Antwerp. They were heading to Olympia for the British Designer Show, and they could not really afford it individually, which is why there was one van.

They got a booth on the fourth floor, set among the bridal wear, several flights above where the buyers were actually working. Day one passed almost without visitors. By the second morning they had printed a flyer themselves, captioned "The SIX Belgian Designers", and were handing it out in the corridors below. A buyer from Barneys followed the flyer up the stairs. He ordered from all six. By the afternoon there was press in the booth, and the buyers from Bergdorf and Liberty were on their way up too.

The English-language press could not pronounce the names, so they shortened the problem and called the lot of them the Antwerp Six. The label is misleading in every important way. They never had a manifesto, never showed together as a collective again, and never agreed to be six. Van Noten makes prints from his Indian workshop. Van Beirendonck does fluorescent latex and BDSM references. Van Saene cuts cocktail dresses with bow details and those carefully shrunken cardigans that everyone tried to copy later. Bikkembergs went to military boots and then bought an Italian football club. Demeulemeester put women in slouchy black suits and read Rimbaud at them. Yee, who died of cancer during the run-up to the MoMu retrospective, kept moving across menswear, womenswear, costume, nothing settling. Six careers that share a graduation year and a diploma from the same small fashion department, and almost nothing else.

What they did share was a starting condition. Belgium in the early 1980s had a state-funded campaign called Fashion: It's Belgian, designed to keep a collapsing textile industry alive by manufacturing some designers to put inside it. There was a competition, the Golden Spindle, that they all entered. Linda Loppa was running the fashion department at the Royal Academy and pushing the students out into the world before the world had asked for them. Paris and Milan were the centres. Antwerp was a port town known for diamonds. Nobody in the trade was expecting anything from there, which meant nobody had a frame ready to receive them.

That absence of a frame is the thing I keep returning to. The Six did not arrive into a defined slot in late-eighties fashion. They invented a slot that did not previously exist, and they invented it from a fourth-floor booth that the buyers were not supposed to visit, with a flyer they had run off themselves because nobody else was going to. Forty years later the MoMu in Antwerp is opening a retrospective on 28 March, running through to January 2027, the first time the work of all six has been gathered in one room. The press release calls it a celebration of "radical individuality". That is the right phrase, and it is also the joke. They were always individuals. The collective was the convenience of the people who had to write about them.

Dries Van Noten later built a house large enough to acquire a perfumer and to retire from his own brand on his own schedule. The others moved in their own directions, at their own paces, with their own defections and their own returns. Margiela, who had already left for Jean Paul Gaultier in Paris by 1984, is sometimes folded into the story as the seventh, and sometimes politely left out of it. The exhibition seems to settle on Antwerp 6+1, which is the most honest title anyone has tried.

The van is the part that stays with me. Not the breakthrough, not the orders, not the museum show. Just the practical fact of five people pooling petrol money because none of them could afford the trip alone, and a sixth's collection riding along in the back without her, and a printer somewhere in West London running off flyers at short notice because nobody was coming up the stairs.

Sources: