Seven papers crossed my feed this week that I think every practitioner building or deploying coding agents should read. This isn't a listicle — I'm going to tell you what each one actually means for the work.
1. Harness Engineering Is the Leverage Point#
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses (Lin et al., April 28 2026)
Most of the conversation about improving coding agents focuses on the model. This paper focuses on the harness — the scaffolding that connects the LLM to repos, tools, and execution environments — and argues it's the primary performance lever that's still being built by hand.
The AHE framework automates harness evolution using three observability pillars: every editable element has a file-level representation so the action space is explicit and revertible; raw trajectories are distilled into an evidence corpus the agent can actually consume; and every edit is a self-declared prediction verified against the next round's outcomes. That last point is the key idea — it turns every harness change into a falsifiable contract. The result is a lift from 69.7% to 77.0% pass@1 on Terminal-Bench 2 without manual intervention.
Why it matters: If you're spending time manually tuning your agent scaffold, this paper is the roadmap for automating that. The observability-first framing is directly actionable.
2. Mandatory Sandbox Execution Is Not Optional#
AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering (Kumar et al., April 13 2026)
LLMs generate plausible code but can't verify correctness internally. AgentForge makes execution-grounded verification a first-class design principle: every code change must survive a sandboxed Docker execution before it propagates to the next agent. Planner, Coder, Tester, Debugger, and Critic agents share memory; execution feedback replaces next-token likelihood as the primary signal.
The benchmark result is 40.0% resolution on SWE-bench Lite, outperforming single-agent baselines by 26–28 percentage points. The ablations confirm that execution feedback and role decomposition each independently drive the gains.
Why it matters: The sandbox execution loop isn't a nice-to-have. If you're building a multi-agent pipeline without mandatory execution verification at each step, you're getting plausible-looking failures. The role decomposition pattern (planner → coder → tester → debugger → critic) is a directly reproducible architecture.
3. Your Prompts Are Making Architectural Decisions#
Architecture Without Architects: How AI Coding Agents Shape Software Architecture (Konrad et al., April 5 2026)
This one hit differently. The paper identifies five mechanisms by which coding agents make implicit architectural choices — framework selection, infrastructure scaffolding, integration wiring, dependency resolution, and state management — and documents that prompt wording alone produces structurally different systems for the same task. They call this "vibe architecting".
The paper proposes six "prompt-architecture coupling patterns" that map prompt features to the infrastructure they entail. Some couplings (structured output validation) weaken as models improve; others (tool-call orchestration) are fundamental regardless of model capability. The recommendations include architectural decision records (ADRs) and review practices to bring hidden decisions under governance.
Why it matters: Every team using AI coding agents is already making architectural decisions by proxy through their prompts. If you don't have a governance process for that, you're accumulating invisible architectural debt.
4. The Spec-First Inversion Is Empirically Supported#
Spec-Driven Development: From Code to Contract in the Age of AI Coding Assistants (Piskala, January 30 2026)
The argument here is that we should invert the traditional workflow: specifications become the primary artifact, code becomes a generated or verified secondary output. Three levels of specification rigor — spec-first, spec-anchored, and spec-as-source — with practical guidance on when each applies.
The most interesting workflow is the "self-spec" loop: an LLM authors its own spec from a high-level prompt, a human reviews and refines it, then a second agent implements against the refined spec. This explicitly separates planning from execution and achieves error reductions of up to 50% in controlled studies.
Why it matters: This is the cleanest articulation I've seen of why spec-driven development with AI works. The self-spec loop is something you can implement today. The 50% error reduction for human-refined specs is the number you want when advocating for it internally.
5. Team Structure Beats Pipeline Structure#
Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering (Benkovich & Valkov, February 2026)
Agyn models software engineering as an organizational process rather than a pipeline: coordinator, researcher, implementer, and reviewer agents replicate an engineering team structure with explicit role separation and communication. 72.2% task resolution on SWE-bench 500 — state-of-the-art for a comparable LLM.
Why it matters: The 72.2% figure alone makes this worth reading. But the deeper lesson is that role decomposition borrowed from actual software engineering team structures outperforms task-decomposition pipelines. If you're designing multi-agent systems, start from the org chart, not the flowchart.
6. Agent Code Quality Degrades — Here's the Evidence#
Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time (Popescu et al., April 2026)
The first large-scale empirical study of real-world autonomous agent contributions: ~110,000 open-source PRs across five production coding agents (OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, Devin). Key finding: human code quality stays flat over iterative revisions, while agent-generated code quality degrades with each revision. Agent contributions already account for ~10% of public GitHub PRs.
Why it matters: This is ground-truth data, not benchmarks. The code quality degradation finding directly argues against long agentic loops without human checkpoints. If your agent workflow has more than 2–3 revision cycles before a human reviews the output, this paper is empirical evidence to redesign that flow.
7. Non-Functional Requirements Need Hard Structural Checks#
Do AI Coding Agents Log Like Humans? An Empirical Study (Ouatiti et al., April 2026)
First empirical study of how coding agents handle software logging: 4,550 agentic PRs across 81 open-source repositories. Agents change logging less often than humans in 58.4% of repositories. More damning: explicit logging instructions in prompts are largely ineffective — agents fail to comply with constructive logging requests 67% of the time.
Why it matters: Logging is the canary for all non-functional requirements. If agents systematically undertreat logging even when explicitly prompted, the same is almost certainly true for error handling, metrics instrumentation, security annotations, and other non-functional concerns. Build structural checks, not prompt reminders.
The Common Thread#
Reading these seven papers together, three themes emerge:
Execution verification beats model confidence. Papers 1, 2, and 6 all converge: don't trust the model's output until it's been run. This is table stakes now.
Human checkpoints at the right granularity. Paper 6's degradation finding and Paper 4's spec-review workflow both point to the same design pattern: agents do better work in bounded, well-defined tasks with human review gates between them, not in long open-ended loops.
Non-functional requirements need structural enforcement. Papers 3 and 7 both document the same failure mode: agents systematically miss non-functional concerns even when prompted. Build structural checks, not prompt reminders.
If there's a paper you think I should cover next week, reply or find me on the usual channels.