Scanning the cs.SE list this morning, one paper from May 25 stopped me cold: a team from Tsinghua-affiliated researchers ran semantics-preserving perturbations over SWE-Bench Verified and watched the average resolve rate fall from 66.8% to 25.3%. That is not a small wobble. That is a forty-one point hole under the floor of a benchmark practitioners quote in pitch decks. The thesis of this post: SWE-Bench Verified scores are measuring something that is not quite repository context reasoning, and the gap is now too large to keep ignoring.
What it does#
RepoMirage introduces a two-stage evaluation harness layered on top of SWE-Bench Verified. The first stage, RepoMirage-Perturb, applies three semantics-preserving perturbations to the underlying repository: variable renaming across the codebase, file reordering that scrambles the directory layout an agent navigates, and comment removal that strips the natural-language scaffolding around code intent. None of these change what the code does; they only change how the repository is exposed to the agent.
The second stage, RepoMirage-Extend, takes the failure modes that surface under perturbation and converts them into explicit context-reasoning tasks beyond the original issue-resolution goal. The sweep covers GPT-4o, Claude Sonnet 3.5, Gemini 3.1 Pro, DeepSeek-R1, Qwen 3-Coder, and MiniMax M27, each driven by three popular harnesses — OpenHands, AutoCodeRover, and Agentless. That cross-product matters. It rules out the easy explanation that one bad harness or one weak model is driving the result.
The key result#
On the original SWE-Bench Verified setup, the agents land where you would expect from a published leaderboard — averaging 66.8% resolve rate. Push them through RepoMirage-Extend, where the perturbations force the agent to actually reason over repository structure rather than pattern-match on familiar identifiers, and the average collapses to 25.3%. The authors name the failure mode exploration drift: agents do access the broader context — files get opened, snippets get read, calls get traced — but the information never gets converted into structural understanding that survives the rename. The agents look busy in the trajectory, then commit a patch that does not match what the codebase actually is.
Why it matters#
If you build agentic coding tools, this is the second paper this month (after RoadmapBench) telling you that SWE-Bench Verified is no longer a credible single-number capability signal. The benchmark was always at risk of leaking memorization of public repos, but RepoMirage shows the leakage is sharper than "the model saw this PR during training" — it leaks at the level of variable names and file paths. A coding agent that scores 66 on the unmodified eval and 25 on a renamed copy is not a 66-point agent on your private codebase. It is a 25-point agent that happens to know what astropy and django look like.
The constructive half of the paper points where the field should go next. RepoAnchor — the authors' "structure-first" workflow — separates exploration from problem-solving as two distinct phases with their own success criteria. Build a structural map of the repo, anchor on the semantically significant components, then attempt the fix. If you are designing a sub-agent architecture, this is a clean separation to steal: a Mapper sub-agent whose only job is producing a structural prototype the editing agent can rely on, evaluated independently of whether the final patch lands. The Constraint Decay paper from earlier this month argued the same thing from a different angle — verifiers belong in the inner loop, not as final gates. Pair the two and you get a credible recipe: a structural mapper feeding an editor whose work is checked by a structural verifier, not by the test suite alone.
The caveats#
Three perturbation types is a small basis. Variable renaming and comment removal are intuitive proxies for "is the model memorizing names?", but they do not capture every form of distribution shift a private codebase introduces — domain-specific helpers, internal frameworks, weird build systems.
The 25.3% number is an average across multiple model/harness combinations. The drop is uneven; the paper notes variable renaming hits hardest, and not all agents fall by the same amount. The headline number compresses a real spread.
RepoAnchor results are reported as "notable improvements" rather than full headline numbers in the abstract. Until the full tables land in the PDF, treat the proposed fix as directionally promising, not measured.
Perturbations are still synthetic. A real private codebase is harder than a renamed astropy in ways perturbation cannot simulate — bespoke abstractions, missing tests, opaque conventions.
The takeaway#
I am filing this paper away as the cleanest evidence yet that the SWE-Bench Verified number on a model card is roughly half a capability signal. What I am doing differently after reading it: any new eval I design for a coding agent now needs a rename pass and a layout-scramble pass as table stakes, and any sub-agent architecture I sketch starts with a structural-mapping phase that gets evaluated on its own merits before the editor ever touches a file. Exploration drift is the failure mode to name out loud in design reviews — once you have a word for it, you start seeing it everywhere in the trajectories.