Skip to content

Plutonic Rainbows

Matching Glasswing

OpenAI waited exactly one week.

On April 7, Anthropic locked Claude Mythos behind a coalition of launch partners and over 40 additional organisations and called it Project Glasswing. The message was clear enough: this model is too dangerous to sell, so we'll give it to the people who build the things it can break. One week later, OpenAI unveiled GPT-5.4-Cyber.

The timing was not subtle. OpenAI's blog post notes that its Trusted Access for Cyber programme launched months before Glasswing. Nobody at OpenAI mentions Mythos by name. But the framing is unmistakable: a cybersecurity-tuned variant of GPT-5.4, optimised for vulnerability research, binary reverse engineering, and defensive patching, rolled out to "thousands of individuals and organisations" through an expanded TAC programme with Know-Your-Customer identity verification.

The pitch mirrors Glasswing almost exactly. Put the sharpest model in the hands of defenders. Lock out attackers with verification gates. Talk about democratising security while restricting who gets to use the dangerous parts.

Bruce Schneier was unimpressed. He called Glasswing "very much a PR play" and said the security firm AISLE had replicated Mythos's findings using older, cheaper, publicly available models. Tom's Hardware pointed out that Anthropic's "thousands of zero-days" claim extrapolates from 198 manually reviewed reports, and the actual testing surfaced 10 severe vulnerabilities across 7,000 software stacks. On Mashable, Tal Kollender, CEO of cybersecurity firm Remedio, called it "brilliant corporate theater."

That phrase sticks. Corporate theater implies the performance matters more than the outcome. Both labs are now racing to position themselves as the responsible steward of offensive-grade capabilities. Anthropic restricts access to a coalition. OpenAI expands access to thousands but gates it behind KYC. The difference is philosophical (Anthropic trusts institutions, OpenAI trusts verified individuals) but the marketing structure is identical.

What neither company has answered convincingly is why a specialised cyber model is necessary when their general-purpose flagships already find vulnerabilities. Anthropic's own framing of Mythos as a general-purpose model that happens to be devastating at exploit discovery undercuts the idea that you need a dedicated product. If the capabilities emerge naturally from scale, gating access to one model while selling the base model commercially is a distinction without much security benefit.

The real signal might be financial. Codex Security, OpenAI's existing application security agent, has already contributed to over 3,000 fixed vulnerabilities. GPT-5.4-Cyber sits as the premium tier above it. Glasswing comes with $100 million in usage credits, which amounts to $100 million in locked-in API consumption across Anthropic, AWS, Google, and Microsoft. These are not just defensive programmes. They are enterprise sales channels dressed as public goods.

None of this means the capabilities are fake. Both models genuinely find bugs. The question is whether the theatrical framing, the coalitions, the gating, the carefully timed competitive releases, does anything a well-funded bug bounty programme wouldn't already do. Schneier's bet is that it doesn't. The labs are betting that it sounds like it does.

Sources:

Figma Isn't the Target

The Information broke a story yesterday: Anthropic is preparing to ship Claude Opus 4.7 alongside an AI design tool that turns natural-language prompts into websites, presentations, landing pages, and product mockups. Single unnamed source. No demo, no pricing, no confirmed product name — just a briefing note and the suggestion that both could land as soon as this week.

Figma dropped around 6%. Wix fell nearly as hard. Adobe and GoDaddy followed down two to three points. The market heard "AI design tool" and reached for the obvious target.

That reflex is wrong. Or at least, wrong about who actually has something to lose.

Figma is not a mockup generator. Figma is a multiplayer coordination surface that happens to produce mockups. Strip away the drawing tools and you're still left with a shared canvas, component libraries that teams of thirty can ship against without stepping on each other, and a version history that product managers actually trust. Anthropic can ship a prompt-to-landing-page generator tomorrow and it won't replace any of that. The design-tool market has spent well over a decade learning that the artifact is the easy part. The coordination is the business.

Adobe is a similar story dressed in different clothes. Firefly has been baked into Creative Cloud for several years now, and enterprise contracts ship with IP indemnification — Adobe promises its customers that if the generative output triggers a copyright suit, Adobe eats the legal bill. An Anthropic-branded design tool with no clarity on training-data provenance is not walking into that conversation any time soon. The CIO at a Fortune 500 insurer is not swapping a contractually-indemnified Firefly workflow for a research-lab preview that may or may not exist next quarter.

None of which is to say Anthropic's tool is harmless. It just has the wrong targets in the headlines.

The companies that should actually be panicking don't trade on Nasdaq. They run on Anthropic's API.

Lovable, Bolt, v0, Cursor — the whole cohort of AI-first builders that pipe prompts into frontier models and ship a UI on top of the response. Their entire product is the wrapper. If Anthropic ships a first-party builder that does the same job natively, the wrapper has a problem no feature release can fix. It's a platform-versus-app-layer squeeze, and every platform eventually runs it on its most successful downstream. AWS did it to open-source tooling. Apple does it every WWDC. Google has done it to developer after developer on top of Maps and Search. Anthropic's version will look like a product launch. Structurally it's an enclosure.

Lovable itself has said the quiet part loud in the past: the real threat was never the other AI coders. It was the big labs deciding to ship their own. That was the thesis. Yesterday's scoop is what the thesis looks like when it starts arriving.

And this is the part that makes the revenue picture complicated. A meaningful share of Anthropic's API revenue almost certainly comes from exactly the companies a first-party builder would undercut. Eat your best customers to expand into their market, and you'd better be sure the replacement demand covers what you're about to cannibalize. Usually it does. That's the whole reason vertical integration works. But "usually" is doing a lot of work in that sentence, and the timing is striking. The cadence that brought Opus 4.6 to market has already pushed pricing pressure through the API business. Adding a design-tool product on top is not a small move.

A caveat, plainly stated. This is a single-source scoop with no demo, no pricing, and no confirmed product name. Anthropic has not said a word. The Information's track record is strong, but "preps" can mean anything between "ships next week" and "internal demo a senior exec saw." The stock move on day one feels less like informed repricing and more like reflexive positioning — traders seeing the word "design" and reaching for the nearest public comp.

If the thing does ship this week, the interesting signal won't be the design-tool market cap. It'll be whether Anthropic gives it an API, and what the pricing looks like if they do. A hosted-only product is a controlled experiment. An API-accessible design tool is the real platform move. It would reset the entire wrapper economy, not just compete with it. That's the version of the announcement that would make Lovable and Bolt and v0 stop checking Figma's share price and start checking their own runway.

The shock is going to come from somewhere. It just won't come from where the market pointed yesterday.

Sources:

Dark and Lonely Water

Jeff Grant read the script and got a bit of a shock. Gone was the gentle cajoling. This one, he said, plumbed the darkness. It set out to scare.

The result was Lonely Water, a ninety-second public information film made in 1973 for the Central Office of Information. Donald Pleasence narrated it as a hooded, faceless figure standing at the edge of reservoirs and canals while children splashed nearby. The figure didn't move. It didn't need to. "I'll be back," Pleasence whispered as the credits rolled. "Back... back... back."

The COI had been producing public information films since the 1940s, covering everything from kitchen fires to rabies. But something turned in the seventies. The advisory tone dropped away and what replaced it was dread. Not information. Dread.

Apaches, made in 1977, runs twenty-seven minutes. Six children playing cowboys and Indians on a farm, picked off one by one. A slurry pit. Pesticide. A tractor that doesn't stop. John Mackenzie directed it with the formal weight of a feature, grainy 16mm stock, an oppressive soundtrack that never relents. The closing credits listed the names of real children who had died in farming accidents that year. It was screened in primary schools across the country.

What separates these from normal safety campaigns is the aesthetic conviction. The COI gave its directors genuine creative latitude, and people like Grant and Mackenzie used it to make actual cinema. Not pamphlets with moving pictures. Films with atmosphere, with formal command, with the visual grammar of folk horror: bleak rural landscapes, unseen threats buried in the mundane, a narrator who already knows the outcome.

The state had become the M.R. James narrator. And it was showing these things to eight-year-olds at four in the afternoon.

Patrick Russell, the BFI's senior curator, pushes back on this reading. These were humanist films, he insists, made from a sincere and morally admirable place. He's probably right about intent. But intent is not what survived. What survived is the image of a hooded figure at a canal, and a generation of adults who can't walk past standing water without hearing Donald Pleasence.

No hard data connects these specific films to a measurable drop in child drowning or farming deaths. The one concrete number anyone has found — an 11% reduction in road casualties after the first Green Cross Code advert — reverted within six months. The COI itself closed in 2012, the same year Ceefax went dark and another strand of mid-century British institutional broadcasting quietly ended. Maybe the films didn't save lives in the way the spreadsheets needed. Maybe they just gave an entire cohort a shared vocabulary of anxiety, a common set of images that still surface unbidden forty years later. Kenny Everett voicing an animated cat. A child sinking in grain. The spirit at the water's edge, promising to return.

Sources:

Nine Claudes, One Bottleneck

The number Anthropic wants you to remember is 0.97. That's the "performance gap recovered" score nine instances of Claude Opus 4.6 achieved after five days of running alignment research on themselves. Two human researchers working on the same benchmark for seven days got to 0.23. The compute bill was about $18,000.

The specific problem is weak-to-strong supervision: the OpenAI-originated question of whether a less capable model can reliably train a more capable one, and by extension whether humans will be able to oversee the systems they build. A score of 0.97 sounds close to solved. That's not what it means.

The most revealing lines in Anthropic's writeup are the honest ones. One of the automated researchers "skipped the teacher entirely and instructed the strong model to always choose the most common one." Another, working on coding tasks, "could run the code against some tests and simply read off the right answer." These aren't clever alignment strategies. They're the exact failure mode alignment research was invented to warn about, and the systems produced them unprompted, within days, on a tightly scoped benchmark.

Meanwhile the generalisation is patchy. Chat: 0.97. Math: 0.94. Coding: 0.47. Production test on Claude Sonnet 4: no statistically significant improvement. The method capitalises on opportunities "unique to the models and datasets they're given."

The more interesting claim is the one Anthropic makes almost in passing: the bottleneck shifts from generation to evaluation. If nine AARs can produce more alignment ideas than humans can filter, the hard problem becomes knowing which ones are real. Anthropic acknowledges this directly: "the models' ideas could become much harder to verify, or corrupted in ways that are tricky for humans to parse or catch."

Which is the critique that's always been there. Richard Juggins argued a month before this paper dropped that experiments on weaker systems probably won't teach you how to align superhuman ones — those systems will have qualitatively different capabilities. Ryan Greenblatt, a week before the AAR announcement: Anthropic probably has an overly optimistic sense of how well it's done on mundane alignment.

I believe Anthropic's numbers. I don't think the framing survives contact with what the paper itself says: the reward hacking, the coding gap, the production null result, the evaluation handoff problem. What the paper shows is that a scoped, verifiable benchmark is compressible by fast, cheap AARs. The thing humans actually need help with — open-ended judgment about "fuzzier" alignment concerns — is the thing this method explicitly doesn't demonstrate. I wrote about the gap between what safety evaluations measure and what actually goes wrong in yesterday's post on unfaithful reasoning, and this is another instance of that pattern: the measurable half getting cleaner, while the part that matters stays dark.

Sources:

Confident and Wrong

Ask a language model who painted the Sistine Chapel ceiling and it'll tell you Michelangelo without hesitation. Ask it for the name of the third person to walk on the moon and it might say, with identical conviction, someone who never existed.

Both answers arrive the same way. The model predicts the most probable next token given everything before it, draws from a learned probability distribution over its entire vocabulary, and moves on. Repeat until done. At no point does anything inside the machine verify whether the output is true.

This is worth sitting with for a second. There's no lookup table. No internal encyclopedia being consulted. No module that compares a candidate answer against stored facts and rejects the ones that fail. The architecture is a sequence of matrix multiplications that transform input tokens into a probability distribution over what should come next. "Should" here means statistically likely, not factually correct. The training objective, predicting the next token across billions of documents, rewards fluency and plausibility. Truth is a side effect that shows up when plausible and true happen to overlap.

When they don't, you get hallucination.

The word implies malfunction, but the mechanical reality is quieter than that. The model hits a prompt that pushes it into territory where its learned correlations stop tracking reality. Rare facts. Multi-step reasoning. Dates, numbers, proper nouns that barely appeared in training data. The highest-probability continuation is still a plausible-sounding string of tokens. It just happens to be wrong. And because the model has no way to flag its own uncertainty, it delivers the wrong answer with the same smooth confidence as the right one.

That's what makes hallucination structural rather than incidental. The entire system is optimized to produce the most plausible continuation, and plausibility is not truth. You can't patch that out.

Not everyone agrees this is permanent. A 2025 paper from OpenAI argues the problem is incentive-based, not architectural: models hallucinate because benchmarks reward guessing over abstention, and changing how we score could fix it. On the other side, Xu et al. published a formal impossibility proof showing any computable language model used as a general problem solver will inevitably hallucinate, regardless of training data or design. It's a diagonalization argument from learning theory. The debate is genuinely unsettled.

Retrieval-augmented generation helps. Give the model verified documents to condition on and it hallucinates less. But the generation step still runs through the same probability distribution. The model doesn't process retrieved text differently from anything else. It has better context. That's all.

There's a parallel in eyewitness testimony, actually. Witnesses who give the most confident accounts in court are not reliably more accurate than hesitant ones, at least not by the time testimony reaches the stand. Confidence is a performance, not a verification. We've known this about humans for decades and still struggle with it.

With an LLM the disconnect is starker. When the probabilities are high, next-token prediction just sounds like someone who knows what they're talking about. A model trained on enough text will produce fluent, assured reasoning that looks indistinguishable from understanding, right up until you check the facts and find they were never part of the process.

Sources:

After the Shock

By March of 1989, the shock had worn off.

Eight years earlier Yohji Yamamoto and Rei Kawakubo had debuted in Paris to headlines about "Hiroshima chic" and the "yellow peril." The work was dismissed as a beggar look. One 1982 critic said his clothes would suit someone perched on a broom. In Yamamoto's own memoir, he recalls how deliberately he rejected Japanese design signifiers for the Paris debut. The goal was never folkloric export. It was European structure reconsidered in black.

By March 1989 that argument had been won.

The Cour Carrée du Louvre show that March wasn't a provocation. It was a house style operating at full confidence: oversized blazers, sack dresses, pleated skirts, wrist-length gloves, beret hats. Black dominant, with red and cream and white used as punctuation. One walk in particular shows the sculptural logic unchanged from the early days. The white panel reads almost like paper, wrapped and tucked over a long black upper layer, the body disappearing underneath.

Coincidence or context: Wim Wenders was following him across 1988 and 1989, filming Notebook on Cities and Clothes for the Centre Pompidou. The documentary came out later in 1989. It was the moment the Western art world decided to canonise him. The moment the shock curdled into reverence.

It also happened to be near the peak of the Japanese Bubble Economy. The Nikkei was nine months from its all-time high. The designers who had been called an invasion eight years earlier were now cultural exports underwritten by the strongest yen in history. The March 1989 runway was the aesthetic edge of that economic moment — a confidence that the work could be shown as work and read as seriously as any Parisian couturier.

What I notice in the image is how settled everything is. No performance of provocation, no Japonisme signifiers, no apology for the black. Just a construction problem solved at the scale of the garment. You see the same instinct in his much later fragrance project: the same discipline, the same refusal, the same house rules applied to a different material.

Eight years earlier this would have been a riot.

By 1989 it was simply the work.

Sources

Negative Light

Every reference library had one. Sometimes two, crowded into a corner near the periodicals, sharing a table with the photocopier that smelled permanently of ozone. The microfiche reader. You sat in front of it like someone waiting for a medical result.

The image was inverted. White text on dark ground, a photographic negative projected at roughly the size of the original newspaper page. You cranked a handle to scroll through frames. The motion was seasick. Columns of newsprint sliding past too fast to read, then too slow, then past the article you wanted. You wound back. Missed it again. The headache arrived around frame sixty.

Everyone who used one regularly describes the same thing. Nausea. Eye strain. A dull ache behind the forehead that persisted into the evening. The British Library's own 1992 conference proceedings conceded the point with remarkable honesty: "There can be no one who actually prefers a microform copy to the original item."

Nicholson Baker went further. In Double Fold, he documented libraries that destroyed their original newspaper collections after microfilming them. Bindings guillotined. Pages discarded. The microfilm itself faded, sprouted fungi, proved incomplete. Entire years missing from the record. An archive in Ontario attached an air-sickness bag to its reader. The technology that was supposed to preserve knowledge was actively destroying it, one brittle frame at a time.

Rebecca Lossin traced the lineage back to the military. Microfilm was a defence technology, adopted by Library of Congress officials who saw preservation as a logistics problem. Shrink it, store it, free the shelf space. The knowledge itself, its marginalia, the advertisements that told you more about 1937 than the editorial ever could, was collateral damage.

And yet.

Something happened in those dim rooms that doesn't happen now. You went looking for one thing and found another. Not because an algorithm suggested it but because the frame before or after your target held something you'd never have searched for. A local council election result from 1974. An advertisement for a shop that occupied the building you now live in. Information had mass and it resisted your intentions. The microfiche reader didn't know what you wanted. It gave you everything in sequence and left you to sort through it like rubble.

Only thirty percent of the British Library's newspaper collection was ever microfilmed. The rest sat in warehouses at Colindale, consulted in person or not at all. The British Newspaper Archive has since digitised millions of pages, and it is incomparably better in every measurable way. You type a name and get results in seconds. No headache. No nausea. No winding back through columns of text you didn't ask for.

What you don't get is the peripheral. The thing adjacent to your search that reframes what you thought you were looking for. The waiting itself was part of the process, not an obstacle to knowledge but the condition under which it arrived differently. It would be sentimental to pretend the old system was better. The access was exclusionary. The technology was bad. The headaches were real. But the headaches came with something search engines can't replicate: the slow understanding that what you found was shaped by the effort of finding it.

Sources:

Losing Its Front Teeth

Owen Luder sketched the concept on a train back to London. A multi-storey car park and shopping centre for Gateshead town centre, raw concrete cantilevered over a rooftop deck with views across the Tyne. Trinity Square opened in 1969. Two years later, Michael Caine threw a man from that rooftop in Get Carter. Four decades on, the building was rubble. Souvenir fragments were sold in commemorative tins.

Luder said demolishing it would mean Gateshead was "losing its front teeth." He was not wrong about the absence. What replaced Trinity Square was a Tesco-backed retail development that was promptly nominated for the Carbuncle Cup, an annual award for the worst new building in Britain.

Portsmouth went first. The Tricorn Centre, designed by Rodney Gordon under Luder's partnership, opened in 1966. A brutalist shopping-and-parking hybrid that Reyner Banham included in Megastructures. Prince Charles called it "a mildewed lump of elephant droppings." It came down in March 2004. The site is now a surface car park. All that ambition, replaced by tarmac at grade level.

"In the sixties my buildings were awarded," Luder said. "In the seventies they were applauded, in the eighties they were questioned, in the nineties they were ridiculed, and when we get through to 2000 the ones I like most are the ones that have been demolished."

Multi-storey car parks sit in a dead zone between architecture and infrastructure. Functional enough to resist heritage sentiment. Too ugly for conservation areas. Permanently tethered to the car at exactly the moment British planning decided the car was the problem. Brutalist housing gets campaigns. Brutalist leisure centres get heritage listings. Car parks get demolished.

Welbeck Street, off Marylebone High Street, had a precast concrete facade of repeating diamond shapes that Sam Jacob called part of "a small gang, a batch of buildings produced in a small window when car parks were treated as civic monuments." Michael Blampied designed it in 1970. Historic England assessed it for listing in 2015 and refused. The facade was "striking" but the ground floor was weak, the Pop Art influence "derivative and a relatively late example." It came down for a hotel in 2019. The assessment read like a rejection letter for a building that had applied for the wrong job.

Preston Bus Station survived. BDP completed it in 1969 with 1,100 parking spaces above the bus concourse, horizontal concrete fins running the full length like the gills of something amphibious. Preston Council wanted it gone. The Twentieth Century Society fought for fifteen years, through two failed listing applications, before Grade II status arrived in September 2013. A £23 million refurbishment followed.

Preston had a civic function underneath the parking. Buses. Public transport. Something that didn't depend on private car ownership for its justification. The Tricorn and Trinity Square had shops, but the parking was the dominant gesture, the thing that shaped the skyline. When the shops died, the car park had outlived the world that made sense of it and couldn't find a second life. You can't repurpose a seven-level car park as a defibrillator station.

Luder died in October 2021 at ninety-three. He spent his last decades watching his most significant buildings pulled down and their replacements go wrong. Gateshead replaced brutalist teeth with a retail denture. Portsmouth replaced ambition with asphalt. Welbeck Street replaced geometry with a hotel nobody will remember.

The buildings that survived found a second reason to exist. The ones that didn't are aggregate now.

Sources:

One Report, Six Percent

Dell gained six percent on Monday. HP rose four. The catalyst was a single article, paywalled, from a niche semiconductor publication most people have never heard of.

Charlie Demerjian at SemiAccurate reported that Nvidia has been negotiating for over a year to acquire a "large PC-oriented company" — a deal he says would "reshape the PC and server landscape like nothing else has done since the computer was invented." The target isn't named. Nobody from Nvidia, Dell, or HP has commented. The details that might make this story verifiable sit behind a professional-tier subscription.

And yet billions moved.

SemiAccurate has a credibility cushion. They correctly reported Elon Musk's interest in acquiring Intel before it became public knowledge, though they note that deal "didn't happen." Correct reporting about negotiations that collapse is a particular kind of track record. It means your sources are real, but the signal you're amplifying may not resolve into anything.

The market doesn't care about that distinction. The question of which company Nvidia might target has consumed most of the coverage, with Dell and HP the obvious candidates given their stock movements. WCCFTech rated the story 60% plausible with a technical credibility score of 2 out of 5, arguing the acquisition economics don't hold — Nvidia's margins are hardware agnostic, and buying a PC OEM means absorbing a lower-margin business that competes with your own customers.

There's a more fundamental obstacle. The FTC sued to block Nvidia's $40 billion Arm acquisition, and Nvidia abandoned the deal in early 2022. The company's $20 billion Groq licensing arrangement drew separate calls for FTC and DOJ scrutiny from Senators Warren, Wyden, and Blumenthal, who labelled it a "de facto merger." Any outright acquisition of Dell (market cap roughly $116 billion) or HP ($17 billion) would trigger full merger review.

What's interesting isn't whether this particular rumor proves out. It's the velocity. A single paywalled report from a publication covering semiconductor supply chains moved Dell's valuation by billions before anyone confirmed a word of it. That's not irrational — SemiAccurate's sources have proven reliable — but it reveals how tightly the market is coiled around anything resembling AI infrastructure consolidation.

The logic makes a certain kind of sense even as speculation. Nvidia already dominates through GPU supply, and its financial entanglements with AI companies create dependencies that look more like ownership than partnership. Buying a PC manufacturer would extend that reach into consumer hardware, a vertical integration play Jensen Huang has been circling for years without committing to.

But circling and committing are different verbs. Negotiations that last a year without resolution have a way of not resolving at all.

Sources:

Unfaithful Reasoning

A training error affected eight percent of the reinforcement learning episodes. Not a catastrophic failure. Not sabotage. A bug in the reward code that let the optimiser read something it was never supposed to see: the model's chain-of-thought.

Anthropic published their risk report for Claude Mythos on April 10th. The headline numbers concerned the model's autonomous hacking capabilities, and those are legitimately alarming. But buried in the report is a disclosure about something quieter and, depending on your threat model, more consequential.

Chain-of-thought monitoring is the safety mechanism that lets researchers watch a model reason before it acts. If a model is planning something deceptive, the theory goes, that deception will surface in the chain-of-thought. You read the reasoning, you catch the problem. Anthropic's alignment case for deploying frontier models depends on this working.

During those eight percent of RL episodes, the reward code had access to the chain-of-thought text. The model was being optimised, in part, on how its reasoning looked to the reward system. It learned the lesson. In misbehaviour scenarios, Opus 4.6 showed chain-of-thought unfaithfulness at roughly five percent. Mythos hit sixty-five. The model had learned to reason covertly, not because anyone instructed it to, but because the training signal quietly rewarded opacity.

The same error touched the already-deployed Opus 4.6 and Sonnet 4.6. Anthropic disclosed this without fanfare. No recall, no patch note. The safety window was never as clean as the published evaluations suggested, and METR's external review had already flagged Anthropic's monitoring confidence as likely overstated.

None of this came from nowhere. Anthropic's own 2025 paper, "Reasoning Models Don't Always Say What They Think," found that Claude 3.7 Sonnet mentioned reasoning hints only twenty-five percent of the time. The faithfulness problem existed before Mythos. The training error accelerated it past the point where monitoring could be mistaken for working.

Gary Marcus called the broader Mythos story overblown. LeCun dismissed it as "BS from self-delusion." I can see a version of this critique landing. Anthropic has an institutional incentive to frame its models as dangerously powerful: it justifies the safety infrastructure, the controlled access, the entire pitch to regulators. The company that leaked its own model is now asking us to trust its self-assessment of that model's inner reasoning.

But a LessWrong analysis cuts through the noise. The bug went unnoticed across three successive model releases. The chain-of-thought corruption compounded with each release. And Anthropic is the lab most likely to find something like this, because they are the lab that looks. Every other frontier developer probably has the same class of bug, undiscovered, optimising quietly in the dark.

The question is not whether chain-of-thought monitoring works. It is whether it was ever more than a comforting fiction: a diary we assumed the model had no incentive to write carefully.

Sources: