Agentic Coding Research Digest — May 2026

The papers I pulled this week have an obvious through-line: nobody is excited about model weights anymore. The conversation has moved one layer up. Five papers — two new benchmarks, one harness framework, one critique of agent productionization, and a methodology paper on rollouts — all argue, in their own way, that the model is now the easy part. What surrounds it is what fails.

Engineering Robustness into Personal Agents with the AI Workflow Store

Geambasu et al. (2605.10907) goes after the dominant pattern in personal agents: plan-and-act loops that synthesize a workflow on the fly for every request. The authors argue that this short-circuits the very thing that makes software reliable in the first place — iterative design, testing, staged deployment — and that we should be bottling agent capabilities into hardened, reusable workflows instead of regenerating them per prompt.

Their proposal is an AI Workflow Store: a registry of pre-engineered workflows the agent invokes the way a developer imports a vetted library, rather than improvising. They frame the central tension as flexibility versus robustness, and they’re explicit that current agents have been sliding hard toward the flexible end.

Why it matters: This lines up uncomfortably well with where production tooling is already going — Claude Skills, OpenAI’s Apps SDK, MCP-shipped tool packs. The vibe-coded plan-and-act loop is not how anyone running an agent at scale actually wants it to work. The paper is essentially a theoretical foundation for what practitioners are converging on under commercial pressure.

Rollout Cards: A Reproducibility Standard for Agent Research

Masters, Liu & Albrecht (2605.12131) audited 50 popular agent repos and found that none reported failed, errored, or skipped runs alongside their headline scores. They then identified 37 cases where a change in reporting rule — not the agent, not the model, just how you count — measurably shifted task-success rates, cost accounting, or timing.

The magnitude is the part that should stop you. Re-grading across benchmarks under different reporting rules moved reported scores by up to 20.9 absolute percentage points, and occasionally flipped model rankings. Their fix is a “rollout card”: a publication bundle that ships the raw rollout records and the exact reporting rules applied, with a reference implementation in the Ergon RL framework.

Why it matters: If you’ve ever picked a model for your coding agent based on a SWE-bench number, this paper is telling you that you may have been comparing different units. A 20-point swing from accounting rules alone means the public leaderboard is doing less work than we pretend. The cost of publishing rollout records is near zero; the cost of not having them is that the field can’t distinguish improvement from reporting drift.

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

Guan et al. (2605.13139) builds a benchmark of 489 rigorously filtered instances that evaluates coding agents across three isolated subtasks — environment reconstruction, code implementation, and verification test generation — plus an integrated FullCycle task where the agent has to do all three in a bare repository without human assistance. SWE-Judge, their verification harness, combines static analysis with dynamic testing.

The headline result is the gap between isolated and integrated execution: agents that handle each subtask competently in isolation show a sharp drop in solve rate on FullCycle. The failure isn’t in any one phase, it’s in cross-phase dependency handling and the cumulative degradation of code quality across steps.

Why it matters: This is the second benchmark in two weeks (after RoadmapBench) pointing at the same crack: composing subtasks the model handles individually fine is where the system actually fails. For anyone building a coding agent, this is a reminder that the harness has to glue together those handoffs explicitly — if you rely on the model to keep cross-phase state coherent, you’re going to lose points exactly where the user notices most.

AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

Zhong & Zhu (2605.13357) names the thing the other papers in this digest all dance around. They argue that software-engineering capability emerges from a model-harness-environment system, not from the model alone, and they formalize eleven component responsibilities of the harness — task specification, context selection, tool access, project memory, failure attribution, verification, and so on.

They also propose an H0–H3 progression for how much runtime support a harness exposes, and show that the structure of evidence in test episodes varies systematically with that level. H0 produces only a final patch; higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports. The reframing is: stop asking “can the model generate a patch?” and start asking “can the full system produce a verifiably correct, attributed, and maintainable change?”

Why it matters: The H0–H3 framing is genuinely useful even just as vocabulary for talking about your own internal tooling. Most coding agents I’ve seen in production are stuck somewhere between H1 and H2 — they have logs, but failure attribution is hand-rolled and verification is whatever the model felt like running. If you’re building one, treating the harness as a software system with its own surface area, instead of glue code, is the upgrade that compounds.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

Lam et al. (2605.14415) assembles 12 upgrade chains across 9 real Python packages — 155 version transitions and 1,660 grounded upgrade requirements derived by aligning release notes with the actual code changes that implemented them. The benchmark tests whether an agent can carry a downstream codebase across consecutive package versions without breaking it.

Average performance sits at 44.8% resolve, 65.4% precision, 50.2% F1. The best model in the cohort, Claude-Opus-4.7, lands at 60.8% resolve, 80.6% precision, and 68.5% F1 — strong but nowhere near the bar that would justify running upgrades unattended. The chained structure also surfaces a failure mode that flat upgrade benchmarks miss: compound drift accumulating across versions until the build can’t recover.

Why it matters: Package upgrades are exactly the kind of task I had mentally written off as “agents should already own this end-to-end.” The numbers say I was wrong. 60.8% resolve on the best model means roughly two in five upgrade chains still break something, and the precision/recall gap suggests the failures are silent rather than loud — the agent thinks it succeeded. If you’re shipping an upgrade agent, gate on actual dependent-test execution, not on the agent’s own claim of success.

The Common Thread

The model is no longer the bottleneck — the surrounding system is. Harness engineering, workflow stores, and runtime substrates are all naming the same shift in different vocabularies.
Single-shot benchmarks overstate real capability. SWE-Cycle and SWE-Chain both show models collapsing the moment you ask them to compose subtasks they each handle in isolation.
We’re measuring agents wrong. Rollout Cards is the dry, methodological version of what every other paper here implies: without reproducible records and consistent reporting, the field can’t tell improvement from accounting drift.

Engineering Robustness into Personal Agents with the AI Workflow Store#

Rollout Cards: A Reproducibility Standard for Agent Research#

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle#

AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents#

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades#

The Common Thread#

Working on something similar?