terminal-benchai-agentsclaude-codeproduction

Same Model, Two Harnesses, a 3-Point Swing on Terminal-Bench

Terminal-Bench 2.1 scores the agent and the model as a pair — and the same Fable 5 moves three points depending on which harness wraps it. The leaderboard you're reading measures the wrong thing.

NeuroX AI · June 30, 2026

The June 2026 Terminal-Bench 2.1 board buries a result most people skim past: it scores the agent-plus-model pair, not the model. Hold the model fixed and the harness still moves the number. Fable 5 inside Claude Code scores 83.1%; the same Fable 5 inside Terminus 2 scores 80.4% — a 3-point swing from the plumbing alone.

The gap widens when you compare the two benchmarks. Fable 5 resolves 95.0% of SWE-bench Verified issues as a raw model. But SWE-bench hands the model a clean diff-shaped task; Terminal-Bench makes it drive a real terminal end to end — read state, run commands, recover from its own errors. Opus 4.8 tells the same story: 88.6% on SWE-bench Verified, 78.9% in the full terminal loop inside Claude Code. Nearly ten points evaporate in the part a demo never tests.

That's the prototype-to-production gap as a single comparison. The model leaderboard tells you what's possible on a sanitized task. The agent board tells you what actually ships — and the delta between them is engineering: context assembly, tool wiring, error recovery, the retry that doesn't loop forever.

So when a vendor quotes you a SWE-bench number, ask which harness ran it. The model you pick matters. The harness you wrap it in decides whether it survives contact with a real terminal.

See how we close the gap →

Working on something similar?