I was scrolling the morning's arxiv list and one number stopped me cold: 39.1%. That's the best-performing model on a brand new benchmark that asks coding agents to do something closer to what real maintainers do — push an open-source project through a real version upgrade across dozens of files. After two years of SWE-bench numbers creeping into the 80s, this paper is a useful slap of cold water. Thesis: the moment you give an agent a long-horizon, multi-target task with real-world breadth, the field is much earlier than the leaderboards suggest.
What it does#
RoadmapBench (arxiv 2605.15846) is a long-horizon coding benchmark built from real version upgrades of 17 open-source repositories across five programming languages. Where SWE-bench gives an agent a single GitHub issue and grades a patch against tests, RoadmapBench hands the agent a multi-target roadmap — a list of behaviours the next version of the project should have — and asks it to take the codebase from source version to target version on its own. The tasks span ML & Data, Web & RPC, ORM & Validation, Infrastructure & Tooling, and UI & Rendering, with a median change of about 3,700 lines across 51 files and 5 subtasks. Oracle patches range from under 300 lines to over 30,000.
Two things make this different from prior benchmarks. First, the harness is grounded in real upstream history rather than synthetic issue templates, so the task structure is whatever the maintainers actually shipped. Second, evaluation isn't binary pass/fail — each task has weighted subtask-level tests, so partial progress is graded. That matters when no model gets close to finishing the task, because a 0% leaderboard column hides the fact that frontier agents are doing real work along the way.
The key result#
Under OpenHands, the top score is Claude-Opus-4.7 at 39.1%. Claude-Opus-4.6 follows at 32.2%, GPT-5.4 at 29.6%. The worst of the 13 frontier systems they tested lands at 5.2%. The Completion Score, which gives partial credit for finished subtasks, runs from 0.177 to 0.692 — agents almost always start strong and then stall on the harder slice of the roadmap. The authors' framing is blunt: long-horizon software development remains a largely unsolved problem. After a year of "agents are getting close to senior engineer" energy online, 39.1% is the actual ceiling on real multi-file upgrade work in this evaluation.
Why it matters#
This is the right shape of evaluation for the agentic coding stack we're actually building. A real Claude Code or sub-agent system doesn't see one issue at a time — it sees a feature request that touches an ORM, a validator, an API, and a frontend, plus a build pipeline that has to keep working. RoadmapBench's median task — 51 files, 5 subtasks, 3,700 LOC — is closer to that than any benchmark I've used before. If you're building or evaluating an agentic dev tool and you're still reporting SWE-bench Verified as your headline metric, that number is increasingly misleading. The interesting failure modes (cross-file integration, partial completion, subtask interleaving) only show up when tasks have real breadth.
The failure-mode breakdown is the practical takeaway. For Claude-Opus-4.6, Implementation Error accounts for 58% of failures — subtle logic mistakes and bad component integration, not parsing or scaffolding issues. For weaker models, Build Errors dominate (~40%) and Missing Implementations follow (~31%) — they're failing at structural stages well before they get to the interesting code. That maps cleanly onto a builder's question: where should I put the next layer of harness scaffolding? For a frontier-model harness, the answer is integration-time validation — running the build after every meaningful change, surfacing inter-module errors quickly, probably routing them to a dedicated repair sub-agent. For a weaker-model harness, the answer is upstream — better scaffolding, stronger build guards, and tighter spec extraction so the agent doesn't miss whole pieces of the roadmap.
The caveats#
Sample size is moderate — 115 tasks is enough to see clear gaps, but variance per repo and per language is going to be wide. The per-domain breakdown is what to read carefully, not the headline percentage.
OpenHands is one harness. A different agent framework — a sub-agent dispatch model, or claude-code with custom slash commands and persistent context — might land somewhere different. These numbers are scaffolding-dependent.
Version upgrade ≠ all long-horizon work. Greenfield feature work, debugging a production incident, or schema migrations are structurally different. RoadmapBench is one slice, not the whole space.
Cost realism isn't the main focus. A 39.1% score at frontier token spend is qualitatively different from 39.1% at low spend. The benchmark grades capability, not economics.
The takeaway#
What I'm filing away from this one: when I evaluate an agentic coding tool from now on, I want to see a long-horizon, multi-file, partial-credit metric reported alongside whatever SWE-bench number is being shipped. The single-issue benchmarks have served their purpose — they got us to capable patch-level agents — but they're no longer where the interesting capability gap lives. RoadmapBench (or something like it) is the harder, more honest test now. If you're building in this space, this is the kind of benchmark you should be running internally against your own harness, even if you never publish the number.