llm-ops ·bandits ·thompson-sampling ·prompt-routing ·arxiv-research

Our prompt canary was lying to us

Apr 24, 2026 · 5 min read

For most of last year, our prompt canary was a single line of code. Roll a uniform random number; if it came up under 0.05, send the request to the new prompt variant; otherwise send it to production. Log the outcome. Eyeball the dashboard next week. It was the simplest thing that could possibly work, and for a while it actually did.

The day it stopped working, the symptom didn’t look like a canary problem. It looked like a billing problem. A new variant we had been quietly rolling at 5% had landed on the team scoreboard with the highest rubric scores in months. Two weeks later the invoice for that surface was up 38%. The variant was real. It was also 1.8× more expensive per call, and our 5% slice had never produced enough data for that cost to surface anywhere a human was looking.

A canary that only ever asks one question, is the new prompt better, has no way to answer the actually interesting one. Better at what, and at what cost? Better than what alternative, and on what budget? Once we wrote those questions down, the 5%-random shape stopped feeling like a canary and started feeling like wishful thinking.

What we replaced it with

The fix is small enough to describe in a paragraph. Every prompt variant now keeps a separate quality score for each thing we care about: rubric quality, latency, cost, user satisfaction. Each score is a probability distribution, not a single number, so it carries its own confidence with it. When a request comes in, the allocator draws one sample from each score for each variant, then asks a different question than before: which variants are on the Pareto front right now? Meaning, which variants aren’t losing on every axis at once. It picks uniformly from that set.

The trick that makes this practical instead of theoretical comes from two 2026 papers we found while looking for prior art. Bouchard’s work on calibration-gated pseudo-observations (2604.14961) names the failure mode where a noisy score looks confident enough to ship, and prescribes a width-of-confidence gate before promoting any candidate. The multi-objective Thompson sampling paper (2512.00930) describes the comparison shape we use to weigh variants across objectives without collapsing them into a single weighted score. Neither paper is long. Both are pragmatic. Together they answered the “but how?” question that had been blocking us from doing this for months. The papers gave us the math; tracing a single request through the allocator shows how that math runs in practice, gate by gate.

How a request actually flows through it

flowchart TD
  REQ([Prompt request]) --> GATE1{breaker tripped?}
  GATE1 -- yes --> PROD1[serve production]
  GATE1 -- no --> GATE2{cold start<br/>≤100 samples?}
  GATE2 -- yes --> ROLL[uniform 5% canary]
  GATE2 -- no --> GATE3{canary already<br/>over 80% traffic?}
  GATE3 -- yes --> ROLL
  GATE3 -- no --> SAMP[draw one sample per<br/>variant per objective]
  SAMP --> PAR[compute Pareto front<br/>across variants]
  PAR --> GATE4{any CI<br/>too wide?}
  GATE4 -- yes --> PROD2[serve production]
  GATE4 -- no --> PICK[pick uniformly<br/>from front]
  PICK --> SERVE([serve chosen variant])
  ROLL --> SERVE
  PROD1 --> FB([user / eval feedback])
  PROD2 --> FB
  SERVE --> FB
  FB --> UPD[update that variant's<br/>per-objective score]
  UPD --> BRK[update breaker<br/>streak counter]
  UPD -.-> SAMP
  BRK -.-> GATE1

  classDef gate fill:#7a5a1f,stroke:#fff,color:#fff
  classDef alloc fill:#1f3a7a,stroke:#fff,color:#fff
  classDef serve fill:#1f5e3a,stroke:#fff,color:#fff
  classDef fb fill:#4a1f7a,stroke:#fff,color:#fff
  class GATE1,GATE2,GATE3,GATE4 gate
  class SAMP,PAR,PICK,ROLL alloc
  class PROD1,PROD2,SERVE serve
  class FB,UPD,BRK fb

Reading the diagram top-to-bottom is the same as reading the allocator’s day-to-day life. Most requests breeze through the four orange gates and land in the blue allocator core, where the math happens. The dotted feedback arrows are the part that matters most for the long run. Every request eventually produces an outcome, the outcome updates that variant’s score, and the updated score narrows the next round’s sampling. The Pareto front is not a fixed thing. It moves as evidence accumulates.

Three guardrails that aren’t in the papers

Three boring guardrails are what let the allocator run unattended — the paper math alone isn’t sufficient.

Cold start. New variants get a deterministic 5% allocation until they reach 100 samples — below that floor, no score is doing useful work yet. It is the same line of code we started with, just scoped to a much smaller window of the variant’s life.

Calibration cap. If a variant ends up serving over 80% of traffic (usually because the math is technically right but the operator wants the option to override), we cap it back to 5% for 24 hours. The cap is a respect-the-human knob, not a math knob.

Circuit breaker. Five consecutive negative outcomes on a candidate variant trips a breaker; the prompt forces back to production for a two-minute cooldown. This catches the failure mode where a variant burst-regresses (a prompt parse error, an upstream model outage, a system message that broke yesterday’s contract) before the score has time to update. Without this guard, a five-minute outage on the new variant gets averaged into a score that should have screamed.

None of these are in the cited papers. They’re the operational floor underneath the math. The allocator without them is the kind of dangerous a control loop without saturation limits is dangerous: technically correct, occasionally catastrophic.

What this didn’t solve

Two things we expected to fall out, that didn’t.

The first is choosing among many variants at once: A versus B versus C versus D in parallel. Our implementation still moves variants through promotion stages, one or two at a time. The pairwise comparison between variants scales quadratically with their count, and we made a deliberate call to keep the surface small. Add variants by promotion, not by parallelism.

The second is operator-overridable preferences. The Pareto-front approach is preference-free by design. Every front member is a legitimate trade-off, and the allocator refuses to assume which one matters more. But sometimes you genuinely have a preference: latency under 200ms is non-negotiable, then maximize rubric. The current allocator can’t express that. The right answer is a version of the math that respects hard preferences (a constrained Thompson sampler, in the literature). The constraint-handling work was messy enough that we left it for later and let the circuit breaker catch the worst latency regressions in the meantime. Knowing the design’s limits matters less if you haven’t yet confirmed you have the underlying problem — and the symptom pattern that should send you here is distinctive enough to name.

The detection fingerprint

If your prompt canary is a single static traffic ratio, your dashboard is a single quality score, and the only knob anyone turns is “raise the percentage when the new prompt looks good”, you have the shape of problem we had. The specific symptom that should make you reach for this design is when a variant wins on the metric you watch and quietly loses on a different one, with the loss only surfacing after a billing period or an SLO breach.

The work was naming the gap clearly enough to ship the fix in an afternoon — both papers had already surfaced this exact failure pattern as their motivation. The literature was ahead of us.