llm-ops ·agent-routing ·capability-profile ·confidence-cascade ·arxiv-research

Probe before dispatch: the routing pattern we built without knowing it had a name

May 2, 2026 · 14 min read

The first time you watch a chapter-generation worker confabulate citations because it ran out of context on the source paper, the reflex is to bump the retry budget. The second time, you bump the timeout. The third time, you put a bigger Daytona pool behind it. The fourth time, you wonder if you have been reaching for the wrong lever. We were on the fourth time around January when the manual workaround started winning consistently: before dispatching any real work, ask the candidate agent whether it has touched this kind of task before, then read the reply for confidence. Strong reply, full-claim dispatch. Hedged or vague reply, demote and try another. Five months of that became this post.

The loop has a name. Three names, actually. The arXiv literature from late 2025 through April 2026 formalizes the same intuition under self-routing, confidence-aware routing, and capability-profile routing — three families of papers, one consensus shape: agents that know their own limits route work better than agents that do not. What we built is the engineering reduction of that consensus. This post is what shipped, the seven features that changed, the numbers, and the two hazards we hit on the way.

What we were doing wrong

The fast move in any agent system that gets slow or expensive is to reach for the configurable parameters. Bigger pool. Longer timeout. More retries. Bigger context window. None of those are bad on their own. All of them are at the bottom of Donella Meadows’ leverage-points hierarchy — the place a system spends the most effort for the least behavior change. L11 (buffers) and L12 (parameters) are where most production engineering attention goes, because they are the safest tweaks. They are also the ones that produce the least durable improvement.

The literature that pulled us out of L11–L12 was not a single paper. It was a five-family pattern.

Family	What the agent / router does	Anchor paper
Self-routing	The agent decides whether it can handle the task	DiSRouter (2025-10)
Confidence-aware routing	A router scores the agent’s reply (or pre-reply signal)	CARGO (2025-09)
Capability-profile routing	Pre-built per-agent skill map; route via lookup	MasRouter (2025-02), Task-Aware Delegation (2026-03)
Iterative specialist routing	Broadcast, score, aggregate across agents	RIRS (2025-01)
State-aware step routing	Pick agent per turn, not per task	STRMAC (2025-11)

What we had been doing manually — “ask Claude Code agents if they have worked on a thing, then judge the reply” — turned out to sit at the intersection of the first two families. The reply is the probe; the engineer reading it is the confidence-aware router. The literature was not telling us we needed a new agent. It was telling us the manual probing loop was the right shape, and that we should build the substrate to automate it.

The substrate is what this post is about. The substrate turned out to be a single materialized SQL table, and the gap between having the data and reading it at dispatch time was the whole problem.

The information layer that everything else builds on

Two columns, one paradigm shift. The left side is where most production engineering attention goes; the right side is where the behavior change actually lives.

The single change that made the rest possible is small enough to feel like a Tuesday afternoon: materialize per-agent performance data from the operations event stream into a queryable capability profile.

-- agent_capability_profile
(
  agent_id          TEXT,
  topic_embedding   VECTOR(768),  -- task-domain centroid
  task_type         TEXT,         -- chapter-stem, chapter-prose, plan-gen, translate, ...
  pass_rate         FLOAT,
  n_samples         INT,
  last_updated      TIMESTAMPTZ
)

We had this data already. It lived in mo_events and the various job-outcome tables. What we did not have was a read path — at dispatch time, nothing queried it. The brain that picks which agent to assign was cross-referencing memories in prose; the chapter worker was rotating through providers in a list. The data was there. The decisions were not consulting it.

The fix was the materialized view. It is the L6 (“information flow”) lever from Meadows’ hierarchy — and the unflattering observation is that almost every routing paper in 2025–2026 frames their contribution as exactly this: surface the right signal at the right decision point. Task-Aware Delegation (Mar 2026) makes the same argument in human-facing form: render Capability Profiles and Coordination-Risk Cues directly in the user’s chooser UI. The model behind the cue is irrelevant; the bottleneck is whether the chooser sees the signal at the moment of choosing.

Once the profile materializes, the rest of the changes are read paths against it.

Seven features that changed

The seven shipped changes, in roughly the order we landed them:

1. Plan-sketch model picker

The book-plan sketch agent used to pick its model from a static fast-vs-pro tier. We replaced the static pick with a soft +/-2 bias toward the model whose topic_embedding_centroid was closest to the current book’s topic embedding. No new LLM call — the sketch itself is the 30-token probe, and we already pay for it. Below 30 samples per (model, topic) the bias defaults off and the static tier wins; above 30, the profile starts steering. The lift was modest in absolute terms (single-digit plan-acceptance rate) but it was free in cost terms, which is the right shape for L6 changes.

2. Chapter generation confidence cascade

This is the largest dollar-cost change. The chapter loop used to be: brain picks Kimi/GLM/MiniMax via rotation, retries the same model on failure, and only escalates to Sonnet via manual override. The new loop:

draft cheap → factual reviewer scores confidence → if < τ → resume sandbox with Sonnet

The cascade is grounded in SATER (2025-10) and CARGO (2025-09) — both papers report 50–80% cost or latency cuts on similar setups. Our measurement on the chapter pipeline so far is 30–60% cost cut on hard chapters, depending on the threshold we set. The threshold (BOOK_CHAPTER_CASCADE_TAU) is the only L12 parameter in the whole system, and it is intentionally one knob — we resisted the temptation to add three.

The cascade is not a fallback. The project’s Zero-Fallback Rule holds. The cascade is a contracted escalation with an explicit confidence gate. Fallback hides the failure; cascade names it.

3. Translation per-pair model picker

The translation pipeline used to pick its model by a static per-target-language map. The new pipeline reads from translation_quality_profile(src_lang, tgt_lang, model_id, pass_rate, n_samples) and chooses the highest-pass-rate model that meets the minimum sample threshold. Below threshold, default to the previous static map.

The shape (src, tgt) matters. The intuition that German→Italian and English→Persian are different problems is in the model picker now, not just in the human’s head. The user-edit telemetry is the ground truth — every paragraph the user revises after a translation is a negative signal against that (model, src, tgt) triple.

4. Concept-diagram parallel probe

The Mermaid concept-diagram service used to generate diagrams in a single Gemini call. Validator rejection rate hovered around 15% — high-complexity sections were the consistent failure cluster. The new flow probes two models in parallel on sections that exceed a complexity threshold (currently N > 8 concept nodes), and the validator picks whichever passes validateMermaidNodeIds. For sections below the threshold, the single-shot path is unchanged.

The lift is small (validator pass rate moved up a few points) but the architecture is the RIRS iterative-specialist pattern in miniature — broadcast, score, aggregate. The reinforcing loop happens via the capability profile: the model that passes more often gets a higher pass rate, which biases it to be picked first next time.

5. AgentFlow brain — self-report requirement + profile read

The mini-orch brain that picks workers for engineering epics changed in two coupled ways. First, every kickoff now must include a <self_report> block in which the candidate worker emits its own competence estimate on the kickoff’s task taxonomy. If the self-report is missing OR the self-reported competence is below 0.5, the claim is rejected. Second, the brain reads agent_capability_profile at pick-time instead of cross-referencing memories.

The DiSRouter argument is that the agent’s self-assessment is a usable signal when it exists. The Do-LLMs-Know argument (below) is that the signal is biased and needs an outside check. We do both: the self-report is the L5 (rule) gate, the capability profile is the L6 (information) check, and disagreement between them surfaces in the brain’s logs as a noteworthy event.

6. Chat model-picker Capability Cue chips

The smallest engineering change in the batch, and arguably the highest-leverage UX surface. The chat model picker now renders a per-model Capability Cue chip: “Sonnet — 91% on code, 67% on math (last 50 chats).” The numbers come from the same profile table; the rendering is one component change.

Task-Aware Delegation calls these Coordination-Risk Cues and argues their visibility to the user is what makes them useful. Users do not need a learned router; they need to see the calibration the system already has. The chip is the smallest possible interface for that.

7. Cover-image style→model profile

Cover image generation used to map style to model statically: editorial → OpenRouter’s gpt-5.4-image-2, animal-illustration → Gemini Flash Image Preview. The new mapping is keyed by the historical regen-rate-within-5min — a proxy for “the user rejected the first generation and immediately asked for another.” Models that have low regen rate on a given style get picked first.

Tiny change, immediate cost signal. The L6 read path is identical to the other features: one materialized view, one JOIN at dispatch. The consistency of that pattern is what made the L11 alternatives so easy to reject — once you have a read path, the temptation to bump a timeout or pool size loses its urgency.

The mistakes we did not make

It is worth naming the L11 and L12 paths we did not take, because the temptation was real on each one.

Anti-pattern	Why we did not	The L6+ alternative we did instead
Bump chapter agent timeout 5 → 10 minutes	Same retries take longer; quality unchanged	Confidence cascade
Increase Daytona warm pool	Less cold-start, no quality change	Capability-profile pre-filter
Tune `BOOK_CHAPTER_CASCADE_TAU` weekly	Constant fiddling with one parameter	Build the profile first, then set the threshold once
Add more reviewers to vote on every chapter	More cost, marginal accuracy	One reviewer + one cascade escalation when needed
Train a routing classifier from scratch	Months of work, brittle	Materialize the data we already have, read it at dispatch

The unifying principle in the right column: do not optimize a parameter you do not yet have the information layer to evaluate. Every L12 tweak we used to make was symptom-masking against a missing L6 signal. The literature had anticipated not just the fix but two failure modes that the fix itself introduces, and both are harder to see once the profile is running.

The hazards the literature warned us about

Two failure modes worth carrying.

Overconfidence. Do Large Language Models Know What They Are Capable Of? (Dec 2025) is uncomfortable to read because it confirms what we had been pretending was not true: all current LLMs are overconfident on their own competence. The discrimination signal is better than chance, but the absolute calibration is off. Self-report alone is therefore not a usable gate. The paper argues — and we adopted — that the self-report needs an outside check against historical performance data. In our system that is the capability profile. The two together work; either one alone breaks in opposite directions.

Routing collapse. SkillOrchestra (Feb 2026) documents the failure mode where a learned router converges onto picking one strong model for everything. The cheap models stop getting picked, never accumulate samples, the profile never updates, and the system locks into one provider. We mitigate this by keeping a cost axis explicit in every routing decision and by retaining a non-zero exploration budget — currently 5% of dispatches go through the rotation logic that pre-dated the profile, just so the cheap models stay in the sample stream. Without that exploration budget, capability profiles ossify.

The third hazard (MESA-S, Apr 2026) is the theoretical one: cleanly separating self-confidence (what the agent thinks it knows) from source-confidence (what its retrieval brought back) is harder than it looks. We have not solved this. The current self-report block conflates both. The fix is a v2 problem. Those hazards are implementation-level constraints; the deeper question is what had to change in the system’s basic model of itself for these mitigations to even be articulable.

What the paradigm change actually was

The smallest possible framing of the shift:

Before: “We have nine LLM providers. Pick one per task.”

After: “Each agent has a competence profile. The system asks the right agent and the agent knows whether to take the job.”

That is an L2 paradigm shift in Meadows’ hierarchy — the level at which the system’s basic mental model changes. It is also the kind of shift you cannot make top-down. It happens as a consequence of P0 (the profile table), P1 (the self-report rule), and P2 (the cascade) all landing in production. None of those is itself the paradigm. The paradigm is what falls out when they are stacked.

The unflattering observation that took the longest to internalize: we had been operating at L11 and L12 for over a year. Larger pools. Longer timeouts. More retries. Each was a defensible local optimization. None of them shifted the system’s behavior. The paradigm shift came from a Tuesday afternoon’s worth of materialization SQL and a discipline change about which signals we read at decision time. Three things that signal does not yet cover, and that the current implementation handles by falling back to earlier heuristics rather than measuring.

What is still open

Three honest items.

Sample size cold-start. The capability profile is noise below 30 samples per (model, task_type) cell. For new providers or new task types, the system falls back to the previous rotation logic, which is fine. But the threshold is a parameter we set by intuition; we have not measured the true elbow on the noise-vs-signal curve. That measurement is on the backlog.

Per-section routing inside a chapter. The current chapter pipeline picks one model per chapter. STRMAC (Nov 2025) argues for per-step state-aware routing, and our chapter has obvious step boundaries (planning, drafting, citation insertion, code-fence cleanup). The lift would be real but the contract redesign is invasive. P3 on the backlog.

The brain’s exploration budget. 5% of dispatches go through pre-profile rotation logic specifically to keep cheap models in the sample stream. The number is intuitive, not principled. The right framing is probably some bandit-style decaying exploration rate. We have not built it. These gaps share a shape: they are measurement problems, not missing components — which is itself a sign of how far the system’s readability has improved.

The closing observation

The user intuition that started this work — “when I ask agents if they have worked on a thing, the closeness of their answer to my expectations tells me whom to dispatch” — is, in five-paper distillation, exactly what the 2025–2026 routing literature was converging on. The engineering trick was building the substrate that lets the same intuition operate at dispatch speed instead of human speed. The literature is now ahead of most production deployments. The substrate it requires — a materialized capability profile, a self-report block, an explicit cascade gate — is small enough to ship in a quarter and large enough that the paradigm shift it enables is hard to roll back.

If you are running an LLM application where the dispatch logic is still rotation, retry-counts, or a static per-task model map, the answer is not in another paper. The papers all say the same thing now. The answer is in the dispatch path you are reading from at the moment you choose an agent. If that path does not consult a per-agent profile, the rest of the system cannot route on capability. Once it does, the routing follows from the signal it was always missing.