When Doing Nothing Is the Right Patch: The Action Bias of Coding Agents

I was skimming this morning's arxiv list when one title stopped me: "Coding Agents Don't Know When to Act." The premise hit close to home — anyone running a fleet of agents on stale issue queues has watched models confidently "fix" things that didn't need fixing. The paper formalises this into a real benchmark with actual numbers, and the numbers are bad enough to change how I write prompts. Thesis: agentic coding has a baked-in action bias, and prompt design is where you fight it.

What it does

The setup is clever. Take 200 tasks from SWE-bench Verified, but instead of giving the agent the broken codebase, hand it the codebase with the golden patch already applied. The bug report is stale. The tests pass. The correct response is to submit an empty patch — maybe touch tests or docs, but otherwise do nothing. They call this benchmark FixedBench, and they run five recent models — Claude Sonnet-4.6, GPT-5.3 Codex, GPT-5.4 mini, Gemini-3 Pro and Qwen3.5-122B — across four agent harnesses (claude-code, Codex, Gemini-CLI, Qwen-Code).

What makes this different from prior agent benchmarks is the inversion: instead of measuring whether agents can solve problems, they measure whether agents can recognise when there is nothing to solve. That flip exposes a class of failure that every existing leaderboard hides by construction. SWE-bench rewards patches that pass hidden tests. FixedBench rewards the absence of a patch. Same agents, opposite scoring rubric.

The key result

The headline is unambiguous: even frontier agents propose undesirable changes — non-test, non-doc edits to a codebase that doesn't need any — in 35 to 65 percent of FixedBench cases. The numbers are similar across model families. Prompting helps, but only when you frame abstention itself as success. With the default "Issue" prompt, Sonnet-4.6 abstains correctly 65% of the time. Switch to an "Edit" prompt that nudges the agent toward action and GPT-5.4 mini collapses to 36.5%. Switch to an "Abstain or Fix" prompt that explicitly tells the agent abstaining is a valid outcome and Sonnet-4.6 climbs to 80.5%, GPT-5.4 mini to 88.5%. Same model. Same task. Twenty-plus percentage points of swing from prompt framing alone.

Why it matters

The reason this paper matters more than another benchmark is the failure mode it describes is invisible to most people running agents in production. If you are triaging incoming GitHub issues with a coding agent — exactly the use case Anthropic, OpenAI and Cognition all market — somewhere between a third and two-thirds of the patches your agent confidently proposes on already-resolved bugs are technical debt you are paying for. They get reviewed, sometimes merged, sometimes left to rot in stale PRs. The paper's behavioural analysis nails the mechanism: agents that correctly abstained checked git history at almost double the rate (63.8% vs 31.4%) and tried to reproduce the issue first (49.2% vs 30.0%). Action bias is not a reasoning failure. It is a missing context-gathering step that the harness does not reward.

The implication for anyone building sub-agent architectures is direct. If you are scaffolding a coding agent — your own claude-code clone, a Claude Code custom agent, an internal Devin replica — the default reward gradient pushes toward edits because that is what training optimised. You have to actively engineer abstention as a first-class outcome. Concretely: a "verify the issue still reproduces" subagent should run before any patch-generation subagent. The system prompt should treat "no action needed" as a successful outcome with the same weight as "patch produced." If you are writing AGENTS.md or CLAUDE.md for your repo, tell the agent that closing a stale issue is as valuable as fixing one. That is the prompt-engineering takeaway, and it is free.

The caveats

FixedBench is built from SWE-bench Verified instances, so it inherits the same Python-heavy, test-driven distribution. Whether action bias generalises identically to greenfield generation, infra-as-code or frontend work is open.
Per-model numbers cluster tighter than the headline 35–65% range suggests; the strongest claims hold across the whole frontier, not against any one model.
The "Abstain or Fix" prompt over-corrects on partially-fixed instances — only 2.9–6.0% of Partial cases get resolved. The fix for action bias is itself a new failure mode for inaction bias. There is no free lunch in prompt design here.
Five models, four harnesses, 200 tasks is a real eval but small. Treat the headline numbers as directional, not gospel.

The takeaway

What I am filing away from this paper: my agent harnesses need an explicit "no action" exit path with the same status as "patch submitted," and my evaluation has to include negative tasks where the right answer is silence. The Vechev group calls this an "overreliance on human guidance implicit in current training objectives" — a polite way of saying RLHF taught models to please humans by producing patches, and that bias survives into deployment. You cannot train it out without a different reward signal, but you can prompt around it today. I am rewriting my issue-triage subagent system prompts this week.

What it does#

The key result#

Why it matters#

The caveats#

The takeaway#

Working on something similar?

What it does

The key result

Why it matters

The caveats

The takeaway