llm-ops ·agents ·context-engineering ·subagent ·arxiv-research

Subagents as a context-budget primitive

May 15, 2026 · 12 min read

The first time we asked an agent to summarize an arXiv paper inline, it ate ninety thousand tokens of context, returned a confident wrong summary, and could not write the next chapter because the budget was gone. The second time we used a subagent for the read, the parent never saw the paper at all. The summary was crisper. The chapter that followed it was better. Nothing about the model or the prompt or the rubric had changed — just the question of who owned the read.

It took us a while to notice this was not a workflow optimization. It was a budget primitive.

Frame it as an envelope: tokens spent inside the subagent never enter the parent, the dispatch dies with its budget, and the question stops being “do I want to add another agent to my hierarchy” and becomes “do I want to spend this token budget on this read, or on the chapter I am writing”.

This post is the engineering pattern, the recent literature converging on it, and the hazards we hit while shipping it inside LibWit.

The naming problem

The word subagent covers at least four different patterns that get conflated in production:

Pattern	What it is	Budget shape
Hierarchical decomposition	Parent agent breaks a task into pieces and dispatches each to a child agent that completes it.	Parent budget + N child budgets. Cumulative cost grows fast.
Specialist consultation	Parent calls a domain-specialized subagent (research-reader, code-reviewer) for a particular question.	Parent budget + 1 specialist budget. Specialist returns a summary, not the trace.
Verification / debate	Two or more agents argue; an adjudicator picks. The shape from the 170-papers post.	Parent + 2N debate budgets + 1 adjudicator. Quality lift; cost lift.
Context isolation	Subagent exists only to absorb a costly read or computation, returns a compressed summary, exits.	Parent budget unchanged. Subagent budget bounded by its task. This is the one this post is about.

The fourth pattern — subagent as resource boundary — has been quietly winning in production at every team that ships long-horizon agents, while the literature has focused on the first three. The first three are about capability (you want a subagent because it can do something the parent cannot, or do it better); the fourth is about resource (you want a subagent because the parent’s context budget is too valuable to spend on this particular task).

What changes when you frame subagents as budget

Once you accept that a subagent is a budget envelope, three engineering decisions get easier.

1. The dispatch cost is bounded

The dispatch itself is a tool call (small input, small output), and the expensive read happens entirely inside the subagent’s own context window, leaving the parent’s budget intact. Budget-Aware Tool-Use Enables Effective Agent Scaling (Nov 2025) measures this directly: when search agents are given explicit tool-call budgets and asked to plan against them, the cost-quality frontier shifts measurably. The subagent dispatch is the cleanest way to apply that budget — the parent budgets one tool call, the subagent’s whole existence happens behind it.

2. Cleanup is free

When a subagent finishes, its context window evaporates. There is no garbage collection step, no “should the parent retain this scratchpad,” no “did we leak any partial state.” The parent agent never had any of it in the first place. The only thing that crossed the boundary was the structured summary the subagent emitted. This is the shape LCM: Lossless Context Management (May 2026) names delegation as the natural context compactor — and the same paper warns that “unrestricted delegation introduces its own recursion hazard: an agent may delegate its entire task to a sub-agent with an identical context,” which is the failure mode we’ll cover in the hazards section.

3. The budget is composable

If a subagent itself dispatches a sub-subagent for a read, the same property holds. Each level absorbs its own cost. The total budget across the tree can exceed any single agent’s context window many times over, but no individual agent ever blows past its own budget. Towards a Science of Scaling Agent Systems (Dec 2025) frames this as the basic scaling argument for multi-agent systems — context-window, cost, throughput, and latency all compose differently when work is isolated in subtrees.

The composition is what makes long-horizon agents actually long-horizon. A flat agent with a 200k context window runs out of room around the time it has read three big arxiv papers. The same agent with subagent-as-read-primitive can read thirty papers, summarize each one inside the subagent, and never approach its own context limit.

A case study: book chapter generation

Inline paper reads break chapter generation predictably: a 60k–100k character paper fills the parent’s context, the model burns its remaining tokens extracting claims, and the chapter confabulates citations. The book agent’s job is to write a long-form chapter that cites arXiv papers correctly. The naive implementation: parent agent calls download_paper and read_paper inline, then writes the chapter. This works for short papers and fails for long ones in a specific way. A typical reference paper runs 60k–100k characters. The parent agent’s Read tool returns the full body. The body fills the context. The model burns its remaining tokens trying to extract the relevant claims, gets confused, retries with the same approach until the 30-attempt budget is exhausted, and produces a chapter that confabulates citations.

The bug was not that the budget was wrong; it was that the wrong agent was paying it. The parent’s research-call budget was originally 2, set conservatively because we didn’t expect many paper reads per chapter. The actual budget required is 20, raised after we measured that limited-budget runs confabulated on the easiest source paper (REMem at ~103k chars). The write-deadline budget became 24: 20 research calls plus slack for memory writes and light reads.

The fix:

Before	After
Parent calls `download_paper` + `read_paper` inline	Parent calls `dispatch_subagent('arxiv_summarizer', paper_id, claim_context)`
Parent’s context fills with the paper body	Subagent’s context fills with the paper body; parent sees only the request
Parent has 30 attempts to extract claims from the dense source	Subagent has a dedicated budget and a single typed output: structured summary + flagged claims
Failures look like “agent confused, retried 30 times”	Failures look like “subagent returned `summary=null, confidence=low`” — diagnosable, retriable
Cost: 1 parent budget consumed	Cost: 1 dispatch cost + 1 subagent budget — but the parent’s budget is intact for the actual writing

Two-zone diagram. PARENT AGENT on the left contains writing chapter 3, working memory, claim outline, draft prose with a context-budget-intact tag. SUBAGENT (research-reader) on the right contains the dense paper body (90k characters, full abstract, sections, references, citations) with a budget-bounded-by-dispatch tag. Amber dispatch arrow flows left to right labeled dispatch with paper id and claim context. Return arrow flows right to left labeled typed summary struct, claim quotes, confidence. Caption: the paper body never touches the parent context. — Two zones, one arrow each way. The paper body lives inside the subagent, the parent only sees the structured summary, the budget that paid for the read dies with the subagent.

The parent never sees the paper body: the subagent handles the read behind its own prompt, its own tool list (download_paper, read_paper, mem0 write), and its own typed output contract. It does one thing: take a paper id and a claim context, return a structured summary the parent can paste into its working memory. It never knows what the parent is writing.

The same pattern shows up in the chapter planner, the verification pass that runs before every chapter ships, and the citation-detection job that surfaces cross-chapter quotes. In each case the parent is doing the load-bearing work and the subagent is doing the read. Delegating the read to the subagent is the clean half of the pattern; the delegation also delegates the failure, and those failures have a specific taxonomy that is not visible when the system is working.

The hazards

Three failure modes we have hit, each with a recent paper that documents it independently.

Recursion under unbounded delegation

Each level of unbounded delegation adds latency, cost, and risk while adding no information; the recursion that LCM (May 2026) names looks like progress but produces nothing. When delegation is allowed but not constrained, the dispatched subagent may dispatch its own subagent with an effectively identical context.

The mitigation in our pipeline is a depth cap baked into the dispatch primitive — three levels deep is the maximum any agent run can recurse, and the runtime refuses to dispatch past it. The cap is set empirically: at depth 4+, every real workload we measured was a recursion bug. Hard refusal is cheap; hard refusal with a clear error message is cheaper than letting the agent keep trying.

Context pollution when subagents return verbose summaries

Context pollution inverts the benefit, as Dynamic Attentional Context Scoping (Apr 2026) measures: when N concurrent subagents return verbose summaries to the parent, the parent’s context fills with concatenated traces, the same overflow problem the subagents were supposed to prevent. The paper names this context pollution.

The fix is the typed-output contract on the subagent. Our arxiv_summarizer subagent does not return its reasoning trace, its tool-call log, or its scratchpad. It returns one struct: title, abstract, three to five claim quotes, a verdict on each claim’s strength relative to the parent’s question, a confidence number. The parent pastes that struct, not the trace. The trace lives and dies inside the subagent’s process. If you debug a subagent, you read its log; you do not read its parent’s context.

The cheapest move that prevents the pollution is the cheapest move that makes the system observable. The two requirements are the same constraint in different costumes.

Principal-agent leakage

The subagent’s reported result may not faithfully reflect the work it actually did; Multi-Agent Systems Should be Treated as Principal-Agent Problems (Jan 2026) is the framing that named this for us, drawing on the contract-theory hazards economists have long studied (moral hazard, adverse selection, information asymmetry), now applied at the delegation boundary. A principal-agent relationship: the parent delegates to the subagent. The subagent has private information the principal cannot directly observe — its scratchpad, its intermediate reasoning, its rejected hypotheses.

In production this shows up as: the subagent claims it read the paper, returns a plausible summary, but actually returned a hallucinated summary because the paper was not in the corpus and the subagent did not flag the failure cleanly. The parent treats the summary as evidence and the chapter ends up grounded on confabulation.

The mitigation is a contract: the subagent’s typed output must include a provenance field that names the actual files or URLs read, and the parent’s verification pass spot-checks them. The mechanism is small. The trust property it buys is large. It is also the same shape Why LLMs Aren’t Scientists Yet (Jan 2026) describes as their failure mode #4 — “overexcitement that declares success despite obvious failures” — applied at the subagent boundary instead of the top-level agent’s report. Each of the three hazards has a call-site mitigation, but recursion raises a structural question the other two do not: how many levels of delegation should the runtime permit before the primitive becomes a liability?

What’s the right depth?

Depth is not the right axis to bound; total token cost is. The depth-3 cap exists only to fail fast on recursion bugs, not as a principled budget, and other teams will land on different numbers. A single deep subagent that does its own read is fine; a shallow tree where five sibling subagents each pull 100k-character documents is not. We use depth as a guard, not a budget. The actual budget is per-run, measured in tokens across the whole tree.

The five agent-architecture dimensions named by Architectural Design Decisions in AI Agent Harnesses (Apr 2026) — subagent architecture, context management, tool systems, safety mechanisms, and observability — are coupled: you cannot decide subagent architecture independently of how you manage context, and you cannot decide either without naming a budget. The literature has been writing about each dimension separately, and the engineering question is the joint design.

When subagents are the wrong answer

Three cases where reaching for a subagent makes things worse.

Cheap reads. If the read is small (under a few thousand tokens), the dispatch overhead exceeds the savings. The parent might as well do it inline. Subagents win on big reads, multi-file analyses, or anything that would have eaten more than ~5% of the parent’s context window.

Single-shot synthesis. Dispatch subagents for reads, not for reasoning: Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop (Apr 2026) shows that on multi-hop tasks where the answer requires combining information across steps, a single agent with a clean context often beats a parent-plus-subagent tree, because the cross-step inference happens better when everything is in one head.

Tasks that need the trace. If the parent needs the subagent’s reasoning history (not just the result) to do its own job, the subagent boundary becomes a bottleneck — you spend tokens compressing the trace, then the parent has to expand it back. Single-agent is better in this regime. We hit this in early verification passes where the parent wanted to see how the subagent reached its conclusion and ended up with a worse trace-of-a-trace than a single-pass would have produced. Knowing the pattern’s limits is a precondition for building the infrastructure underneath it; the substrate components that make dispatch reliable are also what contain these failure modes before they compound across a run.

Composing it with the substrate

The subagent dispatch primitive only works because four substrate layers exist underneath it, none of them subagent-specific but all load-bearing. The agent-run substrate handles pause-resume-steering of any agent, including subagents. The context-engine substrate decides which contextual pieces feed any given LLM call, including the subagent dispatch call. The prompt harness registers the prompts each subagent type uses. The CAM memory substrate is what each subagent reads from and writes back into.

The closing observation that took a while to land: a subagent is not a unit of intelligence you bolt onto an agent system. It is a unit of budget. The intelligence comes from the model. The leverage comes from the substrate. The subagent is the place where you decide which budget pays for which read — and once that decision is explicit, the agent system stops blowing past its context limit on every long-horizon task.

If you are building agent code today and you do not have an explicit subagent-as-read primitive, the symptom that catches it first is usually the same: a chapter or report that quietly hallucinates citations because the parent ran out of room halfway through its source material. The fix is small. The systems thinking behind it is not.