Gemini 3 Deep Think Gets a Major Upgrade — Setting New Records on Reasoning Benchmarks

Google DeepMind shipped a major Gemini 3 Deep Think upgrade on April 28, 2026 — setting new records on Humanity's Last Exam, ARC-AGI-2, Codeforces, and IMO 2025-class math.

Dr. Nova Chen★Apr 29, 2026★6 min read

A Frontier Reasoning Model Takes Another Step Forward

Google DeepMind shipped a major upgrade to Gemini 3 Deep Think on April 28, 2026, and the new benchmark numbers describe a meaningful step forward for the kind of long-horizon reasoning that frontier AI research has been chasing for the better part of two years. For AI researchers, advanced math and science teams, competitive-programming communities, and engineering leaders evaluating which frontier model to wire into their hardest workflows, this is one of the cleaner reference points in the spring 2026 frontier-AI landscape.

Deep Think is Google DeepMind's specialized reasoning mode for Gemini 3 — built around iterative rounds of reasoning that explore multiple hypotheses simultaneously. The April 28 upgrade pushes that architecture forward across the toughest publicly tracked frontier benchmarks: Humanity's Last Exam, ARC-AGI-2, Codeforces, and the International Math Olympiad. The result is a frontier reasoning system that is demonstrably better at the problems where prior generations had hit ceilings.

What the New Benchmark Numbers Actually Show

The headline numbers are the easiest place to see the gain. On Humanity's Last Exam — a benchmark designed by leading academics specifically to probe the upper bound of frontier-model capability — the upgraded Gemini 3 Deep Think reaches 48.4% without tools, setting a new bar that any future frontier reasoning model will be measured against.

On ARC-AGI-2, the reasoning-puzzle benchmark verified by the ARC Prize Foundation, the upgraded Deep Think hits 84.6%. ARC-AGI-2 is one of the most carefully designed measures of generalization in current frontier AI evaluation, and 84.6% is the kind of number that AI researchers who follow the benchmark have been waiting to see for months.

On competitive programming, the upgraded model reaches an Elo of 3455 on Codeforces — the kind of rating that places a model decisively in the company of strong human competitive programmers. And on the International Math Olympiad 2025 problem set, the model achieves gold-medal-level performance, which is the kind of mathematical reasoning ceiling that historically only the strongest human IMO contestants could clear.

Why These Benchmarks Matter for Real Work

Benchmark numbers are interesting on their own, but the practical reason they matter is that the underlying capability cluster — long-horizon mathematical reasoning, complex code synthesis, and novel-problem generalization — is exactly the cluster that drives the value of a frontier reasoning model in real applied work. Scientific research, applied math, advanced software engineering, and any other domain where a reasoning system has to maintain coherence over many steps benefits directly from gains on this benchmark cluster.

For research labs, advanced engineering teams, and quantitative analysis groups who have been evaluating Deep Think for their hardest internal problems, the April 28 upgrade is the kind of step forward that justifies fresh internal benchmarking and a re-examination of which problems are now in scope.

How Deep Think's Architecture Holds Up at Scale

The architecture underlying these gains is the iterative-reasoning approach that has defined Deep Think since the original release. The model explores multiple hypotheses in parallel, evaluates them against intermediate criteria, and converges on the answer that holds up under the most scrutiny. The April 28 upgrade reflects continued tuning of that loop — better hypothesis generation, sharper intermediate evaluation, and more robust convergence on the most difficult problem classes.

For practitioners, the practical effect is that Deep Think is now noticeably better at the kind of problems where prior frontier models would either give up, hallucinate confidently, or produce a partial solution that did not extend to the full problem. The reliability gain matters at least as much as the capability gain — research and engineering teams care about whether they can trust a long-horizon reasoning result, not just whether the model can occasionally produce one.

Availability and Latency

The upgraded Deep Think is available to Google AI Ultra subscribers in the Gemini App. The latency profile is consistent with prior Deep Think generations — submit a hard task, and the response is generally ready within a few minutes. That latency is appropriate for the kind of problems Deep Think is built for. A research-grade reasoning result that takes minutes is the right tradeoff against shallow responses that take seconds.

For developers, the broader Gemini 3.1 Pro API also remains available for the kind of lower-latency reasoning work that does not need the full Deep Think loop. Deep Research and Deep Research Max — Google's autonomous research agents announced earlier in April — are also part of the broader Gemini 3.1 Pro ecosystem and complement Deep Think for long-horizon research workflows.

What This Says About Frontier AI in Spring 2026

The Gemini 3 Deep Think upgrade lands inside a broader spring 2026 pattern of frontier AI labs sharpening their reasoning systems. OpenAI shipped GPT-5.5 earlier in April with stronger agentic coding and computer-use capabilities. Anthropic's Claude Opus 4.7 brought meaningful gains on advanced software engineering. Mistral, DeepSeek, and other open-weight labs have been pushing reasoning capabilities in their own model families.

The pattern is consistent across the frontier: reasoning is the capability vector where the most meaningful 2026 improvements are being made. For AI research watchers, the April 28 Gemini 3 Deep Think upgrade is one of the cleanest current snapshots of where the frontier sits on hard reasoning problems, and the benchmark numbers it sets are the bar that the rest of the frontier-lab community will be measured against in the coming weeks.

For the broader applied-AI community, the practical takeaway is that the hardest problems where frontier reasoning models can plausibly contribute are getting bigger. Problems that were out of reach six months ago — advanced mathematical proofs, novel scientific reasoning, sophisticated competitive programming — are increasingly within reach for the strongest reasoning systems. That expansion of reach is the underlying story that the April 28 benchmark numbers tell.

Sources: Google Blog (April 28, 2026), Google DeepMind Gemini 3 Model Page (April 2026), Chrome Unboxed (April 28, 2026), 9to5Google (April 2026)