Most days the cs.SE list is incremental — yet another SWE-bench variant, yet another agent loop. Today the SWE-bench authors dropped a benchmark that asks something different: not "can your agent fix a bug?" but "can your agent build the program?" The number I'm still chewing on: across nine frontier models, zero tasks were fully resolved. That's not a typo. Building software end-to-end is still wildly out of reach for the same models that crush SWE-bench Verified.
What it does#
ProgramBench, from the SWE-bench team (John Yang, Kilian Lieret, Ofir Press et al.), gives an agent a binary and its docs and asks it to rebuild the codebase. No partial scaffold. No surgical patch. Just "here's what the program does — go write it." The 200 tasks span the difficulty spectrum from compact CLI tools up to FFmpeg, SQLite, and the PHP interpreter — programs that took human teams years and tens of thousands of test cases to mature.
The clever bit is the eval. Instead of hand-writing test suites that subtly leak structural hints, the authors generate 248,853 behavioral tests via agent-driven fuzzing — a median of 770 per task — and measure black-box behavioral parity. Their generated test suites hit 79.7% line coverage on the reference implementations, beating the developer-written tests' 56.8%. That matters: this is not a soft eval. It's a stricter functional bar than the projects' own CI uses.
The key result#
"We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95% of tests on only 3% of tasks."
Claude Opus 4.7 takes that 3%. Opus 4.6 lands at 2.5%, Sonnet 4.6 at 1.6%, and every other model — Haiku 4.5, Gemini 3.1 Pro, Gemini 3 Flash, GPT 5.4, GPT 5.4 mini, GPT 5 mini — sits at 0% above the 95% threshold. Zero. The frontier of holistic program synthesis is, today, a sliver of the easiest tasks and only one model (barely) clears them.
The structural data is just as damning. Models produce a median of 3 files where humans produce 15. Directory depth: 1 vs. 2. Function count lands at 24-29% of the reference. Average function length: 1.08-1.62× longer. The models aren't just failing to ship — they're shipping a recognizably different shape of code. Monolithic, flat, every responsibility crammed into a single file. The phrase the paper uses is "diverge sharply from human-written code," which is being polite.
Why it matters#
If you build with agentic coding tools, this is the gap you feel and can't quite name. SWE-bench numbers keep climbing. Demos of one-shot apps look incredible. Yet when you actually point Claude Code or Cursor at a cold repo and say "build me X," the output bunches into one fat file, skips the modules a senior engineer would extract, and silently ignores a third of the spec. ProgramBench gives that intuition a number. It's not that frontier models can't write code — they can write a lot of code. It's that they can't yet architect a codebase. Architectural decisions — what's a module, what's a layer, what gets its own file — emerge from understanding a program holistically, and that's exactly the muscle these evaluations expose as underdeveloped.
For builders this should reshape two things. First, your agent loop is doing more architectural lifting than you think. The fact that Claude Code projects feel coherent isn't because the model nails architecture — it's because your scaffolding (CLAUDE.md, file conventions, existing structure) is doing that work. Strip the scaffolding, and you get a 3-file PHP interpreter. Second, the eval gap matters: a model that shows up well on SWE-bench Verified can be 0/200 on holistic synthesis. If you're picking a coding model based on bug-fix benchmarks, you're using the wrong yardstick for the "build from scratch" workflows you're probably also asking it to do. Spec-driven development in particular — write the spec, let the agent implement — assumes an architectural competence the data says isn't there yet.
The caveats#
The 0% is a 100%-correctness bar. The metric is "passes 95%+ of fuzz tests" and even that's only met by 3% of one model's runs. Relax to "compiles and roughly works" and numbers will look better. But "roughly works" is not a shippable bar.
Fuzz-based eval can be both too strict and too soft. Too strict because real software has tolerable variance the fuzzer flags as failure. Too soft because it can't catch architectural rot the way human review does.
Snapshot of one moment. Nine models in a fast-moving release cadence — Opus 4.7 just shipped and the curves will move. Don't read 0% as a permanent ceiling.
No cost numbers. Building FFmpeg from scratch with Opus 4.7 isn't free. The paper doesn't report token spend per task, which makes the "is this even economical to attempt?" question unanswerable.
The takeaway#
ProgramBench is the cleanest articulation I've seen of the gap between editing code and authoring code. Agents are getting genuinely good at the first; the second remains an open problem. What I'm filing away after reading this: when I ship agentic coding tooling, the scaffolding I impose on the agent — file layout, module boundaries, directory conventions — is load-bearing in a way I'd been underweighting. The model isn't going to invent that structure for me. If anything, this paper convinces me to lean harder on spec-and-skeleton patterns: hand the agent the architecture, let it fill cells. The opposite direction — "give the agent a goal, let it design the codebase" — is where the 0% lives.