Opus 4.6 Runs Unsupervised for 14.5 Hours — Half of Those Runs Fail
Claude Opus 4.6 now sustains autonomous work for 14.5 hours before its success rate drops to a coin flip. No competitor has published a comparable number. That ceiling is real — and so is the discipline it demands.
NeuroX AI · June 8, 2026

Anthropic's time-horizon eval puts a hard number on agent autonomy: Claude Opus 4.6 hits 50% task completion at a 14.5-hour autonomous horizon. That's how long it can run unsupervised before half its tasks fail. No competing model has published a comparable figure.
Read that ceiling carefully. A 14.5-hour horizon doesn't mean 14 hours of clean output — it's the point where the success rate becomes a coin flip. Six months ago that line sat at a fraction of the time. The frontier is moving fast, but it's still a frontier: the back half of any long run is where silent regressions, half-applied refactors, and confidently-wrong commits live.
This is exactly where most teams get burned. They see "runs for 14 hours," hand an agent an overnight task with no checkpoints, and wake up to a branch that compiles, passes a thin test suite, and quietly broke three things no one specced. The capability outran the verification layer.
The teams shipping treat the horizon as a budget, not a guarantee. Scoped work units. A quality gate at every checkpoint. Cost-per-run and a rollback path wired in before the agent starts. The model raised the ceiling — your CI, review, and instrumentation decide whether you can actually live up there.
Autonomy is a capability. Trusting it is an engineering decision.