anthropicai-agentsproductionclaude-code

SWE-bench 87%: The Score That Made Infrastructure the New Bottleneck

Opus 4.7 hit 87% on SWE-bench Verified — up from 62% a year ago. Anthropic's Code with Claude 2026 event didn't celebrate the benchmark. It shipped managed infrastructure, because that's where the work actually stalls.

NeuroX AI · June 3, 2026

Opus 4.7 scored 87% on SWE-bench Verified — up from 62% with Sonnet 3.7 twelve months earlier. At Code with Claude 2026, Anthropic barely mentioned that number. They shipped managed infrastructure, because the message was already clear: the model isn't what's stopping production deploys.

The three gaps they moved to close: sandboxed code execution, checkpointing for long-running jobs, and credential scoping so agents touch only what they're authorized to touch. None of those are model improvements. They're the ops primitives every team needs before an autonomous agent runs unsupervised in a production environment.

The usage numbers explain the urgency. Anthropic's Q1 2026 revenue grew 80x against a 10x projection. That's not a benchmark result — that's teams shipping. The window between "we should explore agents" and "competitors already have them" is compressing fast.

The production pattern worth following:

Lock scope before launch — one repo, one workflow, auditable inputs
Instrument everything: cost per run, checkpoints, rollback paths
Expand only after failure modes are documented, not before

In 2025 teams blamed model capability for agent stalls. At 87%, that excuse is gone.

See how we build the infrastructure layer in 30 days →

Working on something similar?