Google upgraded Gemini 3 Deep Think yesterday, and the number that matters is 84.6%. That's its score on ARC-AGI-2, the abstract reasoning benchmark designed to resist brute-force pattern matching. Humans average around 60%. Claude Opus 4.6 — which landed last week to genuine excitement — scores 68.8%. GPT-5.2 manages 52.9%. Deep Think clears the human baseline by nearly 25 points and leads the next-best model by almost 16.

I'm trying to figure out what to do with that.

The Codeforces result is harder to dismiss as benchmark theatre. Deep Think hit 3,455 Elo — Legendary Grandmaster territory, better than all but seven active human programmers on the platform. No external tools. No retrieval. Just inference-time compute and whatever Google means by "parallel hypothesis exploration." The top human competitor, Benq, sits at 3,792. That gap is closing fast enough to make competitive programming feel like it has an expiration date.

What changed from the previous version: scope. Earlier iterations of Deep Think were narrowly focused on mathematics. This upgrade pushes into chemistry, physics, and engineering. Gold medals on the written portions of the International Math, Physics, and Chemistry Olympiads. A mathematician at Rutgers used it to peer-review a paper on high-energy physics structures bridging gravity and quantum mechanics. It caught a subtle logical flaw that human reviewers had missed. That's not a benchmark. That's a real research contribution, however narrow.

The architecture Google describes — they call it "Aletheia" — uses a generator, a natural language verifier, and a reviser working in concert. Parallel hypothesis exploration rather than a single reasoning chain. The interesting detail is that the system can acknowledge failure and stop rather than burning compute on dead-end paths. Most reasoning models I've used have no concept of giving up gracefully. They hallucinate forward until they hit a token limit. If Aletheia genuinely knows when it's stuck, that's a meaningful advance in how these systems manage uncertainty.

Google's approach here is fundamentally different from what Anthropic and OpenAI are doing. They're scaling inference-time compute — giving the model more time to think rather than making a bigger model. The base is still Gemini 3 Pro, not some trillion-parameter behemoth. Deep Think is a reasoning mode, not a separate model. The distinction matters because it suggests the ceiling on what you can extract from existing architectures is higher than most people assumed. You don't need a fundamentally new model. You need to let the current one actually think.

That feels right to me, intuitively. When I use extended thinking in Claude, the quality jump over instant responses is enormous — not because the model suddenly knows more, but because it has room to work through contradictions and dead ends before committing to an answer. Google is doing the same thing with significantly more compute thrown at the problem. Anthropic shows you the reasoning. Google hides it. Both approaches produce results that make the non-thinking versions look careless by comparison.

The pricing is interesting. Deep Think is included in the Google AI Ultra subscription at $249.99 per month. API access requires applying for an early programme. I keep thinking about how o3 was positioned as the reasoning breakthrough that would change everything, and then Deep Think shows up a year later scoring nearly 30 times higher on the same class of benchmark. The pace of obsolescence in this space is genuinely disorienting.

Demis Hassabis called it "new records on the most rigorous benchmarks in maths, science & reasoning." MarkTechPost ran with "Is This AGI?" — which, no. But I understand the impulse. A system that reasons better than the average human on abstract pattern recognition, codes better than 99.99% of programmers, and catches errors in peer-reviewed physics papers occupies territory that didn't exist twelve months ago.

Google DeepMind published a research impact taxonomy alongside the release, rating contributions from Level 0 to Level 4. They classify Deep Think's current output at Levels 0-2 — autonomous solutions and publishable collaborations, not landmark breakthroughs. The fact that they felt the need to temper expectations tells you something about the temperature of the conversation. When the company releasing the model is the one saying "calm down," the benchmarks have moved past what anyone's frameworks were built to accommodate.

Sources: