Opus 4.7 Hit 64.3% on SWE-bench Pro. The Real Story Is a Third of the Tool Errors.
Everyone quoted the +10.9 SWE-bench Pro jump when Anthropic shipped Opus 4.7. The number production teams should care about is buried two paragraphs in: a third of the tool errors compared to Opus 4.6. Tool errors are the production failure mode.
NeuroX AI · May 18, 2026

Anthropic shipped Claude Opus 4.7 on April 16, and the leaderboard headline is real: 64.3% on SWE-bench Pro, up from 4.6's 53.4% — a +10.9-point jump that puts it well ahead of GPT-5.4 at 57.7%. CursorBench tells the same story: 70%, up from 58%. But that's not the number teams running agents in production should care about.
The stat is buried in the partner data. Notion's agents now show a 14% gain on multi-step workflows with a third of the tool errors versus Opus 4.6. Factory Droids report a 10–15% lift in task success with the same drop in errors. Rakuten resolves 3× more production tasks on its internal benchmark.
Tool errors are the production failure mode. A benchmark resolves a self-contained problem; a real agent chains five tool calls, and one wrong API call cascades through the rest. Cutting that error rate by two-thirds isn't a benchmark gain — it's the difference between an agent your team trusts unattended and one a senior engineer has to re-check before it ships.
The capability ceiling is still moving up. What's shifted is where the failure modes live — between the model and the tools. The eval suite that catches new ones, the retry policy that handles the remaining 33%, the integration tests that survive week two.