claude-opusai-agentsproductionanthropic

Opus 4.7 Hit 64.3% on SWE-bench Pro. The Real Story Is a Third of the Tool Errors.

Everyone quoted the +10.9 SWE-bench Pro jump when Anthropic shipped Opus 4.7. The number production teams should care about is buried two paragraphs in: a third of the tool errors compared to Opus 4.6. Tool errors are the production failure mode.

NeuroX AI · May 18, 2026

Anthropic shipped Claude Opus 4.7 on April 16, and the leaderboard headline is real: 64.3% on SWE-bench Pro, up from 4.6's 53.4% — a +10.9-point jump that puts it well ahead of GPT-5.4 at 57.7%. CursorBench tells the same story: 70%, up from 58%. But that's not the number teams running agents in production should care about.

The stat is buried in the partner data. Notion's agents now show a 14% gain on multi-step workflows with a third of the tool errors versus Opus 4.6. Factory Droids report a 10–15% lift in task success with the same drop in errors. Rakuten resolves 3× more production tasks on its internal benchmark.

Tool errors are the production failure mode. A benchmark resolves a self-contained problem; a real agent chains five tool calls, and one wrong API call cascades through the rest. Cutting that error rate by two-thirds isn't a benchmark gain — it's the difference between an agent your team trusts unattended and one a senior engineer has to re-check before it ships.

The capability ceiling is still moving up. What's shifted is where the failure modes live — between the model and the tools. The eval suite that catches new ones, the retry policy that handles the remaining 33%, the integration tests that survive week two.

See how we close that gap in 30 days →

Working on something similar?