Agentic Coding Paper of the Day — May 26, 2026

What caught my eye in this morning's arxiv list was a paper claiming a fully automatic LLM agent had generated a complete, verified RISC-V interpreter — all 47 instructions, machine-checked — in 30 minutes with no human in the loop. The headline is impressive on its own, but the real interest is in the architectural trick that made it work: an interactive theorem prover, sitting inside the agent loop, producing concrete proof states the agent could iterate on. The thesis of this post: ITP-in-the-loop is a real pattern, and it works because the verifier gives the agent a gradient, not just a verdict.

What it does

The paper, Trustworthy Software Project Generation: a Case Study with an Interactive Theorem Prover by Jian Fang and Yingfei Xiong, tackles a problem most of us have run into: LLM agents are passable at small programming tasks, but generating an entire project from natural-language requirements collapses on subtle bugs that survive compilation and testing. Verified programming is the obvious fix in theory — write a spec, the machine proves the implementation meets it — but prior agentic attempts pointed at Dafny or Frama-C and could not get past trivial scale.

What this paper does differently is split the project along the effects boundary. Pure total functions — the instruction semantics, the register file updates, the ALU operations — go to an interactive theorem prover (Rocq, formerly Coq). Effects — I/O, memory, the runtime shell — live in the target language (C++ in their case). The agent generates both sides, proves the pure side in Rocq, then extracts the proven code to C++ and links it into the effectful scaffold. The whole thing runs end-to-end with no human intervention once requirements are supplied.

The key result

The headline number: the agent produces 1,859 lines of verified Rocq code and 2,848 lines of extracted C++ in 30 minutes, covering all 47 instructions of the unprivileged RISC-V RV32I base. The resulting interpreter passes all 265 LLM-generated tests and survives 12 hours of AFL++ fuzzing with zero crashes and zero hangs. The thing actually works.

The most telling result, though, is the ablation: under the same agent configuration with a Dafny backend instead of Rocq, the system fails to complete verification at all. Same agent, same prompts, same scaffolding — different prover, different outcome. That gap is the part of the paper worth dwelling on.

Why it matters

The reason Rocq works and Dafny doesn't, per the authors, isn't that Rocq is "stronger." It's that Rocq exposes its proof state when a proof attempt fails. The agent sees the partial goal, the assumptions, the remaining subgoals — the equivalent of a stack trace plus a typed hole. Dafny tends to fail with "verification failed" plus a counterexample model, which is much harder to convert into a directed repair. The agent doesn't need an oracle; it needs a verifier that gives it something to do next.

For builders, the practical lesson is about the shape of feedback your agent loop needs. We've been spending a lot of effort on verbose test runners, structured assertion errors, and richer LSP signals — but most of those still tell the agent "your code is wrong" rather than "your code is wrong in this specific way relative to the spec." An ITP-style proof state is what those signals look like when you take them seriously. If you're building anything that asks an agent to satisfy a non-trivial property (safety, semantic equivalence, refactor correctness), the verifier you reach for should be picked on the quality of its failure mode, not just its acceptance rate. That changes the calculus for contract checking inside Claude Code sub-agents, refinement-type pipelines, and the various proof-carrying-code experiments bubbling up in the agent space.

The other thing worth filing: the effect/logic separation is a useful architectural prior for any "verified component" workflow, not just CPUs. If you can carve out the pure semantic core of whatever you're generating — a parser, a query planner, a state machine, a permission evaluator — and hand only that part to the prover, you sidestep the fact that ITPs don't model effects. The effectful glue can be unverified and well-tested. This is roughly the model CompCert and seL4 use, and it's the move that lets a 30-minute agent run actually finish.

The caveats

Scope is narrow and very ITP-friendly. A RISC-V instruction interpreter is almost the platonic ideal of a "pure total function over finite state" — most production code we care about isn't. Whether this generalizes to parsers, schedulers, or anything with non-trivial heap behavior is open.
The 30-minute number is one run on one task. No variance numbers, no failure-mode catalog, no robustness study against requirement perturbation.
The LLM behind the agent isn't named in the abstract. We don't know how much of the result is "ITPs in the loop" versus "frontier model + decent scaffolding," which matters for replication.
The Dafny comparison may not be fully controlled. Dafny's failure here is presented as "different feedback granularity," but other axes (proof tactic library, training-data familiarity, prompt tuning) aren't held constant.

The takeaway

What I'm filing away: the most interesting axis to evaluate a verifier-in-the-loop is the granularity and actionability of its failure messages, not its acceptance criteria. Verifiers are reward shapers, not just gates. Next time I'm staring at a sub-agent flailing against an opaque "wrong" signal, the first question is going to be whether the signal is informative enough — and if not, what the proof-state-equivalent for that domain would look like.

What it does#

The key result#

Why it matters#

The caveats#

The takeaway#

Working on something similar?

What it does

The key result

Why it matters

The caveats

The takeaway