Your Coding Agent's PR Merge Rate Is Lying to You

Skimming this morning's arxiv list I almost scrolled past another empirical study of GitHub PRs, until the headline number stopped me cold: only 35.7% of rejected agentic PRs are actually agent failures. The rest get killed by workflow rules, reviewer policies, or no visible reason at all. If you've been using merge rate to grade your coding agent, that number is doing something other than what you think. The thesis of this post: PR outcomes are a noisy proxy for agent capability — bordering on broken.

What it does

Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study (Peralta, Hoshi, Washizaki, Ubayashi, Kondo, Higo, Mukai, Yoshida, Kusama, Tanaka, Fan — Waseda et al., MSR 2026) is the first large-scale decision-oriented audit of what actually causes an agent's PR to live or die. The team collected 11,048 closed PRs submitted by autonomous coding agents to public repos, filtered to 9,799 human-reviewed cases, then manually inspected 717 representative threads to reconstruct the actual reviewer reasoning behind every merge or rejection.

This is different from the usual SWE-bench-style story. SWE-bench scores agents on whether a patch passes hidden tests in a controlled sandbox. This paper scores agents on what humans actually did with the PR in production review. That's a much messier signal — and the messiness is exactly the point. The authors aren't proposing a new agent or a new benchmark; they're auditing the metric most of the field has been quietly relying on.

The key result

Of the rejected PRs, the breakdown is brutal: 35.7% were clear agent failures (bad code, wrong fix, broken patch). 31.2% were rejected for workflow reasons — duplicate of an existing PR, wrong branch, blocked by repo policy, contributor licensing, scope mismatch — things the agent had no realistic way to know. And 33.1% had no observable rationale at all — the PR was just closed, no review comment, no explanation.

The merged side is no cleaner. 15.4% of merged PRs required explicit reviewer involvement (a human did the meaningful work after submission), and 5.5% were merged with no visible interaction at all — which sounds like a win until you realize a chunk of those are likely auto-merges by bots or by the agent's own account. Stack the numbers: the share of PRs where the visible outcome cleanly maps to genuine agent performance is much smaller than the raw merge rate suggests.

Why it matters

If you're building or buying agentic coding tools right now, you're almost certainly tracking some form of PR success rate. Devin's launch metrics, Cursor's agent benchmarks, Claude Code's autonomous-mode pilot reports, internal dashboards for your own SWE-agent — they all lean on outcome signals from real repos because those signals are cheap and feel grounded. This paper is the closest thing the field has to a calibration on how much that signal actually means. The answer is: less than half of it, possibly a third, depending on which slice you trust. A rejection isn't a failure; a merge isn't a success. Both numbers are entangled with the social and procedural reality of code review.

The concrete change for builders is to start treating PR outcomes as a two-stage signal: first filter by whether the rejection/merge was actually about the code, then score capability on the filtered set. That requires either log-mining reviewer comments (the paper effectively gives you a taxonomy to do it) or running closed-loop evals where workflow noise is held constant. For anyone running sub-agent architectures or spec-driven workflows where the final artifact is a PR, the same logic applies: your inner-loop evals are probably overestimating capability gaps that don't survive contact with real review processes, and underestimating workflow-handling capabilities (scope discipline, branch hygiene, duplicate detection) that matter far more than the leaderboards suggest.

The caveats

Public open-source repos only — enterprise review dynamics differ; mandated checks, design-doc requirements, and ticket-linking conventions likely shift the workflow-rejection slice up.
Manual inspection covers 717 of 9,799 cases. The taxonomy is representative, not exhaustive; long-tail rejection reasons may be under-counted.
Agent identity is inferred from PR-author signals (bot accounts, signatures). Mixed human-agent workflows where a human polishes an agent draft before submission likely escape the sample.
No breakdown by agent — the paper aggregates across multiple tools. Whether the same ratios hold for, say, Codex vs Claude Code vs Devin specifically isn't answered, and that's exactly the question most teams want answered.

The takeaway

What I'm filing away: PR merge rate is closer to a satisfaction metric than a capability metric. It conflates code quality, scope-fit, workflow compliance, and reviewer mood into one number. For headline benchmarking, that's fine — for steering an agent's development, it's actively misleading. From now on when I see a PR-acceptance number quoted for an agentic tool, the first follow-up question is: what fraction of the rejected set was actually about the code? If the answer is "we don't know," the headline is overstating by ~2x.

What it does#

The key result#

Why it matters#

The caveats#

The takeaway#

Working on something similar?

What it does

The key result

Why it matters

The caveats

The takeaway