Field notes
The NeuroX Blog
What we learn shipping production AI — agents, RAG, growth automation, and the things AI prototypes get wrong.
ai-agentsprototype-to-productionbenchmarksAgents Just Hit 66% — and 89% Still Never Ship
The 2026 Stanford AI Index shows agents leapt from 12% to 66% on OSWorld in a single year, within 6 points of human performance. Capability isn't the blocker anymore. 89% of enterprise agents still never reach production.
Jun 16, 2026Read
swe-benchai-agentsproduction-reliabilityBoth Top Models Tied at 88.6% — Then the Harder Benchmark Failed a Third of the Time
On SWE-bench Verified, Opus 4.8 and GPT-5.5 are now in a dead heat at ~88.6%. On SWE-bench Pro — the messier, multi-file version — the leader scores 69.2%. The benchmark that saturated tells you nothing; the one that didn't tells you where your agent breaks.
Jun 16, 2026Read
ai-agentsorchestrationprototype-to-productionEnterprises Now Run 12 AI Agents. Half of Them Work Alone.
A new 2026 report puts the average enterprise at 12 deployed agents — but half operate in complete isolation, and only 11% of last year's planned agent projects ever reached production. The gap isn't the model. It's orchestration.
Jun 15, 2026Read
mcpai-agentssupply-chainYour Agent's MCP Config Is a Supply-Chain Blind Spot. Perplexity Just Shipped the Scanner
Bumblebee reads the messy local state every other tool ignores — including the MCP configs that feed your AI agents. It crossed 4,400 GitHub stars in three weeks because almost nothing else looks there.
Jun 13, 2026Read
claude-codeengineering-disciplineai-agentsA 70-Line File Just Passed 220,000 GitHub Stars
It contains no code — just four rules for how an AI agent should behave. That it's now one of the most-starred repos on GitHub tells you exactly where the bottleneck moved.
Jun 12, 2026Read
claude-codeagent-sdkcost-optimizationOn June 15, Your Automated Agents Stop Being Free
Anthropic is splitting programmatic Claude usage into a separate, metered credit pool. The CI agent that ran for free on your subscription now bills at API list price — and the credit doesn't roll over.
Jun 11, 2026Read
anthropicclaude-codeai-agentsAI's Task Horizon Now Doubles Every 4 Months — Down From 7
The cadence of progress is itself accelerating: the time an AI can work autonomously is doubling every 4 months instead of 7. The reason is uncomfortable — Claude is now building Claude.
Jun 10, 2026Read
claude-codeai-agentsproduction67% vs 25%: The Coding-Agent Gap GitHub Stars Don't Show
June 2026's dev-tool rankings show the coding-agent field is crowded and cheap. But in blind reviews, engineers preferred Claude Code's output 67% of the time and Codex's 25%. Adoption metrics measure hype. They don't measure what ships.
Jun 9, 2026Read
anthropicclaude-codeai-agentsOpus 4.6 Runs Unsupervised for 14.5 Hours — Half of Those Runs Fail
Claude Opus 4.6 now sustains autonomous work for 14.5 hours before its success rate drops to a coin flip. No competitor has published a comparable number. That ceiling is real — and so is the discipline it demands.
Jun 8, 2026Read
ai-agentsagentic-codingproductionOpenClaw: 100 Agents, $1.3M in Tokens, 30 Days
The fastest-growing open-source project in GitHub history was built by ~100 AI agents running in parallel — at a $1.3M monthly token bill. The viral story hides the real lesson: orchestration and cost discipline, not raw model speed.
Jun 6, 2026Read
anthropicclaude-codeai-agents80% of Anthropic's Production Code Is Now Written by Claude
In May 2026, most code merged at Anthropic was AI-authored, not human. The surprise isn't the volume — it's what didn't change: every line still ships through review, tests, and a merge gate.
Jun 5, 2026Read
ai-agentsproductionanthropic5 Hours to 7 Minutes: What Real Agent Deployments Look Like in 2026
eSentire just compressed threat analysis from 5 hours to 7 minutes with 95% alignment to senior experts. Across 500+ technical leaders, 80% report measurable economic returns. The pattern is clear — and repeatable.
Jun 4, 2026Read
anthropicai-agentsproductionSWE-bench 87%: The Score That Made Infrastructure the New Bottleneck
Opus 4.7 hit 87% on SWE-bench Verified — up from 62% a year ago. Anthropic's Code with Claude 2026 event didn't celebrate the benchmark. It shipped managed infrastructure, because that's where the work actually stalls.
Jun 3, 2026Read
claude-codeai-agentsanthropicClaude Code Now Writes Its Own Agent Orchestration
Dynamic Workflows just shipped in research preview — Claude generates its own orchestration scripts on the fly, runs subtasks in parallel, and verifies results before surfacing them. 86% of teams were already running agents in production when this landed.
Jun 2, 2026Read
ai-agentsproductionengineering22,900 Stars: The 12-Factor Checklist Every Agent Team Is Saving
Humanlayer's 12-factor-agents reached 22.9k GitHub stars by naming what production teams already know: 80% quality with a framework is easy. The last 20% — customer-facing, on-call-worthy — requires owning your prompts, your context window, and your control flow.
May 31, 2026Read
claude-agentsanthropicproductionNetflix Is Already Running Claude's New Multiagent Orchestration
Anthropic shipped Dreaming, Outcomes, and Multiagent Orchestration to Claude Managed Agents this week. Netflix deployed the orchestration feature on its platform team before the ink was dry.
May 29, 2026Read
ai-agentsproductiongithub73,000 New GitHub Stars in 7 Days Point to One Gap
The week of May 21, GitHub's top-10 trending repos added 73,000 stars — and 9 of 10 shared a single focus: infrastructure for running agents in production. The experimentation phase is over.
May 26, 2026Read
claude-codeai-agentsengineeringCode with Claude 2026: Half of Devs Ship PRs They Never Read
At Anthropic's May event, nearly 50% of attendees reported shipping Claude-written pull requests without reading the code first. SWE-bench is at 87%. The model is no longer the bottleneck — discipline is.
May 23, 2026Read
claude-codeai-agentsenterpriseMCP Tunnels Ship: Your Agent Can Now Reach Internal Systems Without a Public Endpoint
Anthropic just shipped MCP tunnels and self-hosted sandboxes for Claude Managed Agents. For the first time, an agent can reach your internal Postgres, private APIs, and ticketing systems through a single encrypted outbound connection — no inbound firewall rules, no data leaving your perimeter.
May 22, 2026Read
ai-agentsanthropicai-engineeringKarpathy Called Agents Slop. Now He's Running 700 Overnight at Anthropic.
Andrej Karpathy publicly called agentic output 'slop' in October 2025. This week he joined Anthropic to build overnight research loops that run 700 experiments per two-day run — and logged an 11% training speedup. The critique wasn't wrong. The scaffolding was.
May 21, 2026Read
claude-codeai-agentsproductionMercado Libre Is Betting 23,000 Engineers on 90% Autonomous Coding by Q3
At Code w/ Claude 2026, Anthropic put a number on the next phase: Mercado Libre is targeting 90% autonomous coding across 23,000 engineers by Q3. The new Routines feature is the primitive that makes it sane.
May 20, 2026Read
ai-agentsproductionai-engineering46% of AI Teams Say Integration Is the Bottleneck. Not the Model.
The 2026 State of AI Agents survey ranked the top three reasons agents stall in production. None of them are model capability. 46% point at integration, 42% at data, 40% at security. The wiring is the work.
May 19, 2026Read
claude-opusai-agentsproductionOpus 4.7 Hit 64.3% on SWE-bench Pro. The Real Story Is a Third of the Tool Errors.
Everyone quoted the +10.9 SWE-bench Pro jump when Anthropic shipped Opus 4.7. The number production teams should care about is buried two paragraphs in: a third of the tool errors compared to Opus 4.6. Tool errors are the production failure mode.
May 18, 2026Read- claude-opus-4-8ai-agentsproduction-reliability
Anthropic Shipped the Best Coding Model — Then Published 5 Transcripts of It Failing
Opus 4.8 tops SWE-bench at 88.6% and now writes about 10% of public GitHub commits. The most useful page in its system card is the one where Anthropic shows it failing at ordinary work — always the same way.
May 16, 2026Read - anthropicclaude-codeai-agents
Claude Fable 5 Hits 80% on SWE-bench Pro — and Wants to Run for Days
Anthropic shipped Fable 5 on June 9, jumping SWE-bench Pro from 69% to 80% and built explicitly for multi-day autonomous sessions. The model crossed a line. Most pipelines haven't.
May 16, 2026Read
ai-agentsproductioncase-studiesFrom 5 Hours to 7 Minutes: What AI in Production Actually Looks Like in 2026
Anthropic's 2026 enterprise report dropped four shipping case studies with real numbers — eSentire, Doctolib, L'Oréal, Thomson Reuters. None are pilots. All are the wiring around the model, not the model.
May 5, 2026Read
claude-codeai-agentsanthropicClaude Managed Agents Just Killed the 3-Month Setup Tax on Production AI
Anthropic shipped Managed Agents to public beta on April 8, removing the sandbox / state / credential plumbing every team used to spend a quarter building. Runtime is $0.08/hour. The interesting question is what teams build with that quarter back.
May 4, 2026Read
claude-codeai-agentsengineeringAnthropic's 2026 Agentic Coding Report: 60% AI Usage, But Only 0–20% Fully Delegated
The new Agentic Coding Trends Report names the gap most teams are still pretending isn't there: AI writes most of the code, humans still own the last mile. The teams winning have stopped trying to remove engineers and started orchestrating them.
May 1, 2026Read
case-studyprototype-to-productionfintechCase Study: From Broken AI Prototype to Production Fintech in 6 Weeks
A Series A fintech with a Bolt-built MVP couldn't onboard their first paying enterprise customer. Here's what was broken under the hood — and what we shipped to fix it.
Apr 30, 2026Read
claude-codeskillsai-engineeringmattpocock/skills Just Hit #2 on GitHub Trending: Engineering Discipline as a Claude Skill
Matt Pocock open-sourced his personal .claude directory and it picked up 7,000+ stars in a day. The skills aren't about generating code faster — they're about not breaking the codebase while you do.
Apr 30, 2026Read
uitoolingclaude-codeThe 2026 Stack for AI-Assisted UI: 21st.dev + UI/UX Pro Max + Motion
Three tools that turn 'I need a marketing site' into 'this is live by Friday.' Here's the stack we use, and how each piece slots in.
Apr 29, 2026Read
prototype-to-productionnext.jsvibe-codingFrom Bolt to Production: What AI Prototypes Get Wrong
30 minutes in, you have a working app. Auth, dashboard, even a Stripe modal. It looks done. It's not. Here's the punch list of what's actually broken under the hood.
Apr 29, 2026Read
aeogrowth-marketingseoAnswer Engine Optimization: How to Get Cited by ChatGPT, Perplexity, and Claude
In 2026, half your buyers ask ChatGPT instead of Google — and never click through. If you're not in the answer, you didn't lose a click. You lost the conversation.
Apr 28, 2026Read
ai-agentspricingcost-optimizationWhat an AI Agent Actually Costs to Build and Run
Most AI agency quotes hide three big costs. Build, inference, operate — here's the honest breakdown of what you'll pay in year one and what's missing from the quote.
Apr 27, 2026Read
Contact
Send us a brief.
Tell us about the problem in 2-3 sentences. We reply within one business day.
Or skip the form — book a Calendly slot directlyadmin@neuroxai.com · +91 70149 99768
Remote-first team across India · US · EU · HQ in Udaipur, India