Documentation as IR: RustPrint and the case for spec-anchored agent loops

Skimming the morning's cs.SE drop on arxiv, one paper kept pulling me back: a group out of Melbourne and FPT proposing that the right intermediate representation for a C-to-Rust migration isn't an AST, isn't a typed IR, and isn't a plan.md — it's the documentation the agent writes for itself before it ever touches a Rust file. The numbers they report against a Claude Code baseline are large enough that it's worth taking seriously. The thesis I'm walking away with: for repo-scale translation work, documentation isn't a deliverable, it's the migration plan.

What it does

RustPrint — introduced in Documentation-Guided Agentic Codebase Migration from C to Rust (Le-Anh, Nguyen Hoang, Le, Bui, May 14) — treats architecture-aware documentation as a whole-codebase intermediate representation. Before any translation, a DocGen module clusters the C source by component, lifts each cluster to a feature-oriented summary, and emphasizes what each subsystem does and how it should be preserved in Rust rather than describing the C implementation line by line. That doc becomes the spec the agent translates against, not the C code itself.

The pipeline runs five stages: doc generation, per-crate planning with compile loops, workspace synthesis, up to five rounds of requirement-driven refinement (where translated-Rust docs are compared against source docs to surface drift), and up to five rounds of execution-aware revision driven by the translated test suite. What separates this from prior project-level work like EvoC2Rust or skeleton-guided approaches is that the loop isn't anchored to syntactic skeletons or function boundaries — it's anchored to a semantic description of the system that the agent itself can compare against after each pass.

The key result

On eight real-world C repositories ranging from 11.4K to 83.7K LOC (libplist, check, stb, klib, libcbor, Monocypher, libfixmath, libyaml), RustPrint hit 100% compilation success with both Kimi-K2-Instruct and GPT-5.4, while Self-Repair and EvoC2Rust failed to produce end-to-end compilable repositories at this scale. The headline that made me sit up:

feature preservation of 93.26% vs Claude Code's 52.52%, and cross-test pass rate of 95.17% vs Claude Code's 79.85% — both on Kimi-K2, the open-weight model.

With GPT-5.4 those rise to 97.76% and 98.70%. Safety — the metric C2Rust crashes on with its 0% unsafe-free rate — lands at 99.41% API-level safe with GPT-5.4. Eight repositories is a small N, but the gap against Claude Code on the same task is large enough that benchmark noise isn't a clean explanation.

Why it matters

The interesting thing isn't C-to-Rust specifically — it's the move to make documentation the primary artifact the agent reasons against. If you're building a long-horizon coding agent today, the natural representation between phases is either source code, a structured plan, or scratchpad notes. RustPrint is arguing those are all worse than a system-level description that's specifically engineered to be diffable. The refinement loop works because two documents (source-C doc and translated-Rust doc) can be compared semantically and the deltas surface as repair signals — something you can't easily do with two ASTs or two plan markdowns.

For Claude Code and sub-agent architectures, the practical takeaway is: a describe what each subsystem does doc, written by the agent in a structured form, is a better long-term anchor than a `PLAN.md` for any task that spans more than a handful of files. The corollary for spec-driven dev is sharper — if the spec is also the IR the agent compares its work against, you get cheap verification of intent preservation on every iteration. That's a workflow change, not a model change: you can do it today with whatever frontier model you're using.

The caveats

N=8 repositories, all C, all with test suites — generalization to heavier FFI usage, non-standard build pipelines, and concurrency-heavy systems is explicitly flagged as future work.
The Claude Code baseline numbers (52.52% feature preservation) deserve scrutiny — was it run with the same harness, the same iteration budget, and the same access to test feedback? The paper doesn't dwell on harness parity.
The five-rounds-each refinement budget is generous. Cost realism (tokens, wall-clock) isn't headlined, and with GPT-5.4 the per-repo bill is likely non-trivial.
Documentation comparison only works "when paired with compilation and translated tests" — strip the test suite and the signal degrades. So this isn't really pure doc-driven; it's doc-as-spine with execution feedback.

The takeaway

Filing this one under: the next interesting design move in agent harnesses is what you put between phases, not what you put in the planning prompt. I'm going to try a doc-first refactor loop on the next non-trivial migration task I run through Claude Code — generate a system description first, translate against the description, then diff the translated description against the source description as the loop's stopping criterion. If it works on something smaller than libyaml, the pattern is worth keeping.

What it does#

The key result#

Why it matters#

The caveats#

The takeaway#

Working on something similar?

What it does

The key result

Why it matters

The caveats

The takeaway