This week's reading list converged on the same realization from four different angles: the interesting questions about coding agents are no longer about raw generation capability. They're about the control structures around the agent — what code-shaped infrastructure surrounds it, what process gates its work, what runtime patterns hold it together in production, and what verifier closes the loop. Four arxiv papers from the past ten days each take a different layer of that stack.
Code as Agent Harness#
The Code as Agent Harness survey (Ning, Tieu, Fu et al., arXiv:2605.18747, May 18) argues that code in agentic systems is no longer just an output artifact — it's the operational substrate. The 42-author taxonomy organizes the agent stack into three layers: harness interfaces (how agents reason and act), mechanisms (planning, memory, tool integration), and scaling (multi-agent coordination).
The framing matters because once you accept that code IS the harness, evaluation stops being about "does the output run" and starts being about "is the harness verifiable, reusable, composable across agents." The paper traces this across coding assistants, GUI/OS automation, embodied agents, scientific discovery, DevOps, and enterprise workflows, and names the open challenges that should sit near the top of any production roadmap: reliable verification, multimodal environments, multi-agent state consistency.
Why it matters: If you build agentic tools for a living, this survey is the cleanest current articulation of what you're actually building — not a model wrapper, but a code-based execution environment for stochastic actors. The taxonomy is also a useful sanity check for internal architecture docs: if your design only addresses one of the three layers, that's where the next incident is going to come from.
Agentic Agile-V: Process Control Over Prompt Engineering#
Christopher Koch's Agentic Agile-V (arXiv:2605.20456, May 19) attacks the same problem from the process side. The paper synthesizes the increasingly mixed evidence on agentic coding ROI — productivity wins in some enterprise tasks, slowdowns in mature open-source repos, persistent failures at repository setup and hardware verification — and concludes that the bottleneck is no longer prompts, but engineering process control.
The proposed SCOPE-V loop (Specify, Constrain, Orchestrate, Prove, Evolve, Verify) treats the conversation-to-contract gate as the critical seam: above it, exploratory dialogue; below it, structured engineering artifacts with acceptance evidence. The paper catalogs minimum input artifacts for software, firmware, and hardware work, defines risk-adaptive workflows scaled to the change type (feature vs. fix vs. test vs. hardware), and proposes an evidence-bundle acceptance model for agent-generated artifacts.
Why it matters: The core claim — "agentic AI does not eliminate engineering discipline; it increases the value of requirements, constraints, traceability, independent verification, and human approval" — is the one I want printed above every coding-agent dashboard. If your agentic tooling doesn't have an explicit conversation-to-contract gate, you're shipping vibes with extra steps.
Runtime Architecture Patterns for Production LLM Agents#
Vasundra Srinivasan's methodology paper (arXiv:2605.20173, May 19) takes the production-systems angle. It introduces the stochastic-deterministic boundary (SDB) as the central concept — the seam where stochastic model output has to become deterministic system action — and catalogs six runtime patterns organized around it: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human-in-the-loop.
A five-step selection methodology plus a diagnostic procedure ties production failures back to architectural weaknesses across five real workloads. The paper also names a failure mode worth knowing: replay divergence, where LLM consumers of deterministic event logs produce different outputs after a model or prompt change, breaking idempotency assumptions other parts of the stack depend on. The reliability framework decomposes overall behaviour into per-call model variance and architectural momentum.
Why it matters: This is the missing manual for anyone past the prototype stage. The variance-vs-momentum split explains why some teams' systems quietly degrade as their underlying models get more deterministic: their architecture was carrying variance for them, and they never noticed. As model variance shrinks, runtime pattern choice becomes more, not less, important.
Trustworthy Software Project Generation with an ITP#
Fang and Xiong's RISC-V case study (arXiv:2605.26017, May 26) is the concrete empirical complement to the other three papers. Their fully automatic agent develops a complete RISC-V RV32I CPU interpreter — all 47 instructions — end-to-end in 30 minutes, with no human in the loop after requirements are supplied. The architectural trick is to separate effectful code from pure logic: pure semantics are proved in Rocq (formerly Coq), effects are implemented in C++, and the agent generates and reconciles both.
Results: 1,859 lines of verified Rocq, 2,848 lines of extracted C++, 100% pass on 265 LLM-generated tests, zero crashes and zero hangs across 12 hours of AFL++ fuzzing. The most interesting ablation: swapping Rocq for Dafny breaks the loop. The authors' claim is that Rocq's failure messages expose concrete proof state — goals, assumptions, subgoals — that the agent can act on, whereas Dafny's counterexample model is harder to repair against.
Why it matters: When you build an agent loop around a verifier, evaluate the granularity and actionability of its failure messages, not just its acceptance criteria. Verifiers are reward shapers, not just gates. The effect/logic split also generalizes well beyond CPUs — parsers, query planners, state machines, and permission evaluators are all candidates for the same architectural prior.
The Common Thread#
Code is infrastructure, not output. Three of the four papers explicitly reframe code as the substrate agents reason and act through, not the artifact they emit. Production tooling that treats the agent as a code generator will keep underperforming tooling that treats it as a runtime.
Verifiers are reward shapers. The verified-CPU case study and the SCOPE-V loop both land on the same point: the verifier's failure messages, not its acceptance criteria, are what drive agent improvement. Pick verifiers by repair-signal quality.
The interesting work has moved up the stack. Raw generation capability is treated as roughly solved across all four papers — the bottleneck is process control, runtime architecture, and verification design. The mental model that fits best right now is distributed-systems engineering with stochastic actors, not prompt engineering.