Code Cleanliness Is a Cost Lever, Not a Capability Lever, for Coding Agents — May 20, 2026

The arxiv list for May 19 had the usual mix of yet-another-benchmark papers and survey position pieces, but one caught my eye because it asked a question I had not seen asked rigorously before: does the cleanliness of the codebase actually change how an agent performs on it? A SonarSource team built a minimal-pair experiment around exactly that, ran 660 trials with Claude Code, and got a result more interesting than the obvious "yes". The thesis of this post: code cleanliness is a cost lever for coding agents, not a capability lever — and that distinction matters for how you prep a repo for agents.

What it does

Does Code Cleanliness Affect Coding Agents? by Priyansh Trivedi and Olivier Schmitt (SonarSource) sets up an experimental protocol that almost no agent paper bothers with: minimal pairs. Same repo, same architecture, same dependencies, same external behaviour — only the cleanliness changes. They build two automated pipelines to construct the pairs. "Slopify" takes a clean codebase and degrades it by inlining helpers, duplicating logic, and introducing SonarQube rule violations. "Vibeclean" does the reverse on a messy codebase. Cleanliness is operationalised as SonarQube violation density, ranging from under one issue per KLOC on the clean side to twenty-plus on the messy side.

The benchmark itself is 33 tasks across six minimal-pair repositories — three Java, three Python, half public (commons-bcel, genie, ckan) and half internal SonarSource codebases. Every task is graded by hidden tests at the application's public surface (CLI, HTTP, library API) rather than by inspecting agent diffs. Each task is run ten times on both sides of every pair, for 660 trials in total. All trials use Claude Code with its default tool set on Claude Sonnet 4.6; the authors tried Claude Haiku 4.5 and dropped it for being too weak to produce a usable signal.

This is the cleanest agent-evaluation protocol I have seen this year. Most "does X matter for agents?" papers cannot actually isolate X. Minimal pairs do.

The key result

The headline finding is that cleanliness barely moves the pass rate. Across all 660 trials, agents working on cleaner code passed 91.3% of tasks. Agents on messier code passed 92.1%. That is a -0.9 percentage point difference and well inside noise. As a capability lever, cleanliness is a non-event.

What it does move is the agent's operational footprint. On cleaner code, Claude Sonnet 4.6 used 7.1% fewer input tokens, 8.5% fewer output tokens, 11.1% fewer reasoning characters, and 7.0% fewer conversation turns to reach the same answer. The most striking single number is file revisitation: agents on cleaner code reopened files 33.8% less often. On the multi-module track specifically — the closest stand-in for realistic enterprise repos — that becomes -50.8% revisitations and -10.7% input tokens.

In other words: messy code does not break the agent. It just makes the agent flail more on its way to the same answer.

Why it matters

For anyone building or paying for coding-agent infrastructure, this reframes the cleanliness question entirely. If you treat agent performance as a binary "did it solve the task?", cleanliness looks like a non-issue and you can deprioritise it. If you treat agent performance as a unit-cost question — tokens per resolved task, latency per task, dollars per merged PR — cleanliness is suddenly a serious lever. An 8% reduction in input tokens at the scale that real engineering orgs are running these agents adds up fast, and the 34% revisitation drop suggests the wall-clock improvement is larger than the token number alone implies, because revisitations dominate the slow part of an agent's loop.

It also changes how you think about prepping a codebase for agentic work. The honest version of the takeaway is: rip-and-replace cleanup migrations are probably overkill, because the agent will still solve the task. But targeted cleanup of high-revisitation hotspots — the files agents keep coming back to in your traces — likely pays for itself purely in token and latency reduction. If you have agent logs, you already know which files those are. The same instinct applies in reverse: when you see an agent in production thrashing on a particular module, "is that module clean enough?" is now a legitimate root-cause question rather than a stylistic complaint.

The caveats

The paper is from SonarSource, which sells static-analysis tooling. That does not invalidate the methodology — minimal-pair design controls for selection well — but operationalising cleanliness as SonarQube violation density is not neutral, and other cleanliness proxies might give different magnitudes.
A single model configuration: Claude Sonnet 4.6 inside Claude Code. Whether the 8% / 34% gap holds for other models or other harnesses is unknown, and the Haiku-4.5 dropout already hints that weaker models might behave differently.
Cost is measured in tokens, not dollars or wall-clock seconds. Real serving costs do not scale linearly with tokens once you account for reasoning, tool calls, and parallel exploration.
No check on whether agents propagate cleanliness. The paper measures how agents respond to cleanliness, not whether they preserve or degrade it in their own output.
33 tasks is small, and the authors curated them — there is selection bias even with the minimal-pair scaffold.

The takeaway

What I am filing away: cleanliness is now an instrumentable property of an agent system, not a virtue. The minimal-pair protocol is the more durable contribution — it should become the default way to evaluate any "does X help agents?" claim, because almost no existing study can rule out the alternative explanation that X correlates with something else about the codebase. After reading this, I will look at our agent traces with revisitation-per-file as a first-class signal, and earmark the top-revisited files for cleanup before reaching for a more expensive model.

What it does#

The key result#

Why it matters#

The caveats#

The takeaway#

Working on something similar?

What it does

The key result

Why it matters

The caveats

The takeaway