A 4B Model Just Replaced Frontier LLMs in the Subagent Slot

Skimming this morning's arxiv list, one paper made me actually stop and re-read the abstract. It's a small, blunt question — can a 4B model do the boring half of agentic coding well enough that you don't need to burn frontier tokens on it? The answer is yes, basically, and it lands with concrete numbers I haven't seen elsewhere. The thesis I'm taking away: the subagent is no longer just an architectural pattern, it's a model-tier opportunity.

What it does

The paper is Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks? (Garg, Nitin, Huang, May 4, 2026). The setup is the now-standard two-tier agent: a frontier main agent that plans, reads files, and writes code, plus a subagent that gets called for noisy, multi-turn terminal work like running tests, diagnosing errors, and compiling. Most production systems already split this way — Claude Code does, and so does pretty much every serious coding agent shipping today — but the subagent is still typically the same frontier model. That's where the cost lives.

What's different here is that the authors actually trained a 4B specialist for the subagent role, end-to-end. They start from Qwen3-4B-Instruct, do supervised fine-tuning on ~3,200 execution tasks distilled from frontier-model rollouts across 2,144 repos in TypeScript, C#, Java, JavaScript, and Python, then run GRPO reinforcement learning with a rubric-based LLM-as-judge reward that scores execution trajectories on seven quality dimensions and four failure modes. The subagent itself is deliberately constrained — single tool (terminal), one call per turn, ten-turn cap, final answer in an XML-delimited block. That tight scope is what makes the small-model bet work: you're not asking it to be a general agent, you're asking it to be a competent test-runner with structured output.

The key result

The headline I keep coming back to: ~30% reduction in main-agent tokens with no resolution-rate degradation. On SWE-Bench C# with Claude Opus as main agent, Terminus-4B brings token usage from 1,010k down to 693k (a 31% cut), drops main-agent terminal calls from 6.2 to 1.7 (a 73% reduction), and resolution rate holds at 45.3–46.7%. On SWE-Bench Pro the resolution rate actually nudges up slightly — 31.5% with Terminus-4B as the subagent versus 30.0% baseline — while terminal calls drop from 3.8 to 1.0 and tokens from 836k to 730k. Across cross-language tasks the small model matches Opus and Sonnet on the subagent role rather than approximating them, which is the surprising bit.

Why it matters

If you're building agentic coding tooling right now, the binding cost in production is almost always the main-agent context: every frontier-model turn re-ingests history, tool schemas, and accumulated terminal output. The standard fix is to push noisy work into a subagent so its long execution trace stays out of the main loop. But until now you'd typically point the subagent at the same frontier model, because nothing smaller was reliably good enough — vanilla open-weight 4B models actually increased token usage in this paper's ablation by 9.5%, because the main agent kept having to re-do work the subagent botched. Terminus-4B is the first concrete demonstration I've seen that you can train a specialist that's small enough to self-host, fast enough to make the subagent loop snappy, and good enough that the main agent doesn't need to retry.

The practical move this unlocks is treating the subagent boundary as a model-tier boundary. If you're running anything that fans out execution to subagents — code search, test running, build verification, even tool-call handlers in non-coding agents — there's now a credible recipe for replacing the frontier model in those slots: distill rollouts from a frontier model, fine-tune a small base, post-train with execution-grounded rubrics rather than pass/fail rewards. The reward design is worth lifting on its own. Pass/fail signals on SWE-Bench-style tasks are sparse and noisy; rubric-based LLM-as-judge against frontier reference trajectories gave them dense, multi-dimensional gradient. That's a generalizable pattern for any agent role with a clear structured-output contract.

The caveats

Unix/Bash only. Training and eval skip Windows PowerShell and zsh entirely. Real production fleets will hit shell-specific failure modes this paper can't tell you about.
SWE-Bench-shaped tasks. Both benchmarks are repo-issue-fix patterns in Docker containers. Long-horizon greenfield development, ambiguous user intents, and interactive workflows aren't covered — and "execution subagent" is exactly the role most narrowed by benchmark choice.
Qwen3-4B only. No comparison against other open-weight families or sizes, so we don't know if the result is about the recipe or the specific base model's terminal-following inductive biases.
Cost isn't fully modeled. The 30% main-agent token saving is real, but you're now also paying inference for Terminus-4B; whether that nets out to a wallet-level win depends on how you host it.

The takeaway

What I'm filing away: the frontier-model-everywhere default for agentic systems is finally cracking, and it's cracking first at the subagent boundary because that's where the contract is structured and the trajectories are repetitive enough to distill. If I were architecting a new coding agent today I'd separate "model that decides what to do" from "model that does it," design the subagent's tool surface and output contract first, and assume the do-er can be a fine-tuned small model. One sentence I'm taking into next week's planning: design the subagent contract before you pick the subagent model.

What it does#

The key result#

Why it matters#

The caveats#

The takeaway#

Working on something similar?

What it does

The key result

Why it matters

The caveats

The takeaway