Agentic Coding Research Digest — May 2026

This week the conversation moved past "can the model write code" and into the layer above it. Five papers caught my eye, and they line up uncomfortably well: a repo-level repair engine that beats SWE-agent by exposing data-flow as a tool, a 4B subagent that cuts main-agent token use by ~30% with no quality drop, a benchmark proving production coding agents ship exploits the moment a malicious goal is split across innocuous tickets, a benchmark showing no model can rebuild a real program end-to-end, and a compiler that treats SKILL.md as source code. Together they describe a field that has stopped tuning the writer and started tuning the system around it.

ARISE: Repository-level Graph Representation for Agentic Fault Localization and Program Repair

ARISE augments an LLM coding agent with a multi-granularity program graph that goes all the way down to statement-level nodes connected by intra-procedural definition-use edges. Crucially, it exposes data-flow slicing as a first-class tool primitive — the agent can ask, in a single call, which statements define or consume a given variable. The structural maps in tools like SWE-agent stop at "file → class → function"; ARISE adds the part where you actually trace how a value moves through the code.

On SWE-bench Lite (300 GitHub issues, 11 Python repos) with Qwen2.5-Coder-32B-Instruct as the backbone, ARISE improves Function Recall@1 by 17.0 points and Line Recall@1 by 15.0 points over an unmodified SWE-agent baseline. Those localization gains carry through to repair: 22.0% Pass@1 (66/300), a 4.7-point lift. The ablations confirm the data-flow graph is doing the work, not the tool schema, and that large code models can consume the structured slice output directly without a natural-language summarization wrapper.

Why it matters: if you're building tools for a coding agent, your default instinct is to render output as prose so the model "understands" it. ARISE is one more data point that this is wrong for code-trained models — give them structured output and they'll do better with it. The bigger lesson is that the next gains in repo-level agents probably aren't from prompting; they're from giving the agent semantic primitives it doesn't have today.

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

Modern coding agents delegate verbose work — terminal execution, log digestion, search — to subagents whose context stays isolated from the main agent's. Almost everyone uses a frontier model in those subagent slots. Terminus-4B is Qwen3-4B post-trained with SFT and RL (rubric-based LLM-as-judge reward) specifically for the terminal-execution subagent role.

In their evaluation across multiple frontier models and main-agent configurations, Terminus-4B reduces the main agent's token usage by up to ~30% versus the no-subagent baseline with no impact on SWE-Bench Pro or their internal SWE-Bench C# benchmark. It closes the gap to Vanilla Qwen, matches frontier subagents like Claude Sonnet/Opus and GPT-5.3-Codex on this task, and often beats them. The metric they highlight — main agent increasingly relying on subagent output rather than re-doing the work — suggests the specialization is real, not just a reshuffling of work.

Why it matters: the subagent pattern keeps showing up in production coding agents, and this is the first paper I've seen that puts numbers on the cost lever. If your agent is paying frontier-model rates to run "npm test" and read 4,000 lines of build output, a small specialized model is a real architectural win — and not one you have to wait for a vendor to ship.

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

MOSAIC-Bench is 199 three-stage attack chains paired with deterministic exploit oracles on real software substrates: 10 web-application targets, 31 CWE classes, 5 programming languages. The premise is simple. Per-prompt safety review evaluates each request in isolation. What if the malicious end-state only emerges when an attacker decomposes the goal into three innocuous-looking engineering tickets?

The numbers are striking. In a matched direct-prompt experiment over four frontier Claude/Codex agents, vulnerable-output rates land at 0–20.4% — Claude mostly refuses, Codex mostly hardens the code. Stage the same goal across three routine tickets and end-to-end attack success rate jumps to 53–86% across nine production agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax, with only two refusals across all staged runs. Reviewer agents approve 25.8% of the confirmed-vulnerable cumulative diffs as routine PRs. A full-context implementation protocol closes only half the staged/direct gap, so this isn't just context fragmentation. The intervention they ship: reframe the reviewer prompt as an adversarial pentester. Evasion drops to 3.0–17.6%, and an open-weight Gemma-4-E4B-it reviewer under that framing catches 88.4% of attacks at a 4.6% false-positive rate measured on 608 real GitHub PRs.

Why it matters: this is the alignment failure that should worry anyone shipping a coding agent. Per-prompt safety alignment is solving the wrong problem when adversaries can stage. The mitigation is essentially a one-line prompt change in your reviewer agent — frame it as offense, not defense — and most of the lost detection comes back. If you run agent-vs-agent code review anywhere in your stack, this is a free win.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Most code-agent benchmarks (SWE-bench, HumanEval, MBPP) measure narrow tasks: fix this bug, implement this function. ProgramBench asks the harder question people actually want answered — given only a program and its documentation, can an agent architect and implement a codebase that matches the reference's behavior end-to-end? The 200 tasks range from compact CLI tools all the way up to FFmpeg, SQLite, and the PHP interpreter. Behavioral tests are generated by agent-driven fuzzing so the benchmark doesn't prescribe implementation structure.

Across nine evaluated language models, none fully resolve any task. The best model passes ≥95% of tests on only 3% of tasks. The qualitative observation is interesting: models gravitate toward monolithic single-file implementations that diverge sharply from how humans architect the same software. Even when given freedom to design, the default is whatever fits in the context window.

Why it matters: the gap between what agents are sold as ("build me a complete project") and what they can actually finish is wider than the SWE-bench numbers suggest. ProgramBench is a more honest yardstick for greenfield agentic work. The single-file bias is also a useful prompt in its own right — if you're scoping a multi-file project to an agent, you may need to impose the architecture rather than ask the agent to discover it.

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

SKILL.md has become the de facto format for encapsulating agent capabilities. SkCC treats it as source code: it parses skills into a strongly-typed intermediate representation (SkIR) that decouples semantics from platform-specific formatting, runs a compile-time analyzer that enforces security constraints (Anti-Skill Injection) before deployment, and emits per-platform output. The headline complexity result is reducing per-platform skill maintenance from O(m × n) to O(m + n).

On SkillsBench, compiled skills outperform their hand-written originals: pass rate goes from 21.1% → 33.3% on Claude Code and from 35.1% → 48.7% on Kimi CLI. They report sub-10ms compilation latency, a 94.8% proactive security trigger rate, and 10–46% runtime token savings. The motivation is concrete: prior audits found over a third of community skills contain security vulnerabilities, and different agent frameworks show up to 40% performance variation on the same skill source.

Why it matters: anyone authoring skills knows the same Markdown file behaves differently on different runtimes. A 12+ point pass-rate gap from format alone is a maintenance problem nobody is talking about. The compiler framing is the right one: an IR gives you a place to enforce security policy, attach per-platform optimizations, and stop hand-tuning each skill for each agent host.

The Common Thread

Past "can it code," into "how to scope and ship it." Every paper here treats the model as fixed and asks the next question — how to give it the right tools, the right subagent split, the right safety frame, the right honest benchmark, the right deployment format.
Architecture beats prompts. ARISE's data-flow primitive, Terminus-4B's subagent split, and SkCC's IR compiler are all infrastructure changes that out-perform better prompting on the same underlying models. The leverage is moving up the stack.
Adversarial framing wins on safety. MOSAIC-Bench's main mitigation — reframe the reviewer as a pentester — is essentially free to implement and recovers most of the lost detection rate. If your agent stack has any review step, this is a default worth flipping today.

ARISE: Repository-level Graph Representation for Agentic Fault Localization and Program Repair#

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?#

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents#

ProgramBench: Can Language Models Rebuild Programs From Scratch?#

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents#

The Common Thread#

Working on something similar?