← all writing
09 · 07 May 2026 · 5 MIN READ

A 4B Model Just Replaced Frontier LLMs in the Subagent Slot

Skimming this morning's arxiv list, one paper made me actually stop and re-read the abstract. It's a small, blunt question — can a 4B model do the boring half of agentic coding well enough that you don't need to burn frontier tokens on it? The answer is yes, basically, and it lands with concrete numbers I haven't seen elsewhere. The thesis I'm taking away: the subagent is no longer just an architectural pattern, it's a model-tier opportunity.

What it does

The paper is Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks? (Garg, Nitin, Huang, May 4, 2026). The setup is the now-standard two-tier agent: a frontier main agent that plans, reads files, and writes code, plus a subagent that gets called for noisy, multi-turn terminal work like running tests, diagnosing errors, and compiling. Most production systems already split this way — Claude Code does, and so does pretty much every serious coding agent shipping today — but the subagent is still typically the same frontier model. That's where the cost lives.

What's different here is that the authors actually trained a 4B specialist for the subagent role, end-to-end. They start from Qwen3-4B-Instruct, do supervised fine-tuning on ~3,200 execution tasks distilled from frontier-model rollouts across 2,144 repos in TypeScript, C#, Java, JavaScript, and Python, then run GRPO reinforcement learning with a rubric-based LLM-as-judge reward that scores execution trajectories on seven quality dimensions and four failure modes. The subagent itself is deliberately constrained — single tool (terminal), one call per turn, ten-turn cap, final answer in an XML-delimited block. That tight scope is what makes the small-model bet work: you're not asking it to be a general agent, you're asking it to be a competent test-runner with structured output.

The key result

The headline I keep coming back to: ~30% reduction in main-agent tokens with no resolution-rate degradation. On SWE-Bench C# with Claude Opus as main agent, Terminus-4B brings token usage from 1,010k down to 693k (a 31% cut), drops main-agent terminal calls from 6.2 to 1.7 (a 73% reduction), and resolution rate holds at 45.3–46.7%. On SWE-Bench Pro the resolution rate actually nudges up slightly — 31.5% with Terminus-4B as the subagent versus 30.0% baseline — while terminal calls drop from 3.8 to 1.0 and tokens from 836k to 730k. Across cross-language tasks the small model matches Opus and Sonnet on the subagent role rather than approximating them, which is the surprising bit.

Why it matters

If you're building agentic coding tooling right now, the binding cost in production is almost always the main-agent context: every frontier-model turn re-ingests history, tool schemas, and accumulated terminal output. The standard fix is to push noisy work into a subagent so its long execution trace stays out of the main loop. But until now you'd typically point the subagent at the same frontier model, because nothing smaller was reliably good enough — vanilla open-weight 4B models actually increased token usage in this paper's ablation by 9.5%, because the main agent kept having to re-do work the subagent botched. Terminus-4B is the first concrete demonstration I've seen that you can train a specialist that's small enough to self-host, fast enough to make the subagent loop snappy, and good enough that the main agent doesn't need to retry.

The practical move this unlocks is treating the subagent boundary as a model-tier boundary. If you're running anything that fans out execution to subagents — code search, test running, build verification, even tool-call handlers in non-coding agents — there's now a credible recipe for replacing the frontier model in those slots: distill rollouts from a frontier model, fine-tune a small base, post-train with execution-grounded rubrics rather than pass/fail rewards. The reward design is worth lifting on its own. Pass/fail signals on SWE-Bench-style tasks are sparse and noisy; rubric-based LLM-as-judge against frontier reference trajectories gave them dense, multi-dimensional gradient. That's a generalizable pattern for any agent role with a clear structured-output contract.

The caveats

The takeaway

What I'm filing away: the frontier-model-everywhere default for agentic systems is finally cracking, and it's cracking first at the subagent boundary because that's where the contract is structured and the trajectories are repetitive enough to distill. If I were architecting a new coding agent today I'd separate "model that decides what to do" from "model that does it," design the subagent's tool surface and output contract first, and assume the do-er can be a fine-tuned small model. One sentence I'm taking into next week's planning: design the subagent contract before you pick the subagent model.


Working on something similar?

Say hello — I read every email.