llm-ops ·prompt-engineering ·harness ·prompt-versioning ·arxiv-research

Our prompts stopped being code

Jan 5, 2026 · 14 min read

Every LLM-powered codebase past month six develops the same accountability gap: six copies of “You are an expert”, each slightly different, none indexed, and no honest answer to “what was the prompt at 14:32 UTC on the 16th?” The team that wrote them does not know which one shipped to which user yesterday. The dashboard that scores them grades one prompt against the answers a different prompt generated. The eval suite that gates promotions runs against a string that was deleted in last Tuesday’s refactor. Nobody is lying. The codebase just does not have a way to ask “what was the prompt at 14:32 UTC on the 16th?” and get an honest answer.

This is the moment most LLM-powered products quietly stop treating prompts as code and start treating them as config that happens to be embedded in code. The string literal is the wrong primitive. What you need is a registry, a version table, a resolver that can be overridden per user or per document or per environment, a ledger that records what actually fired with what hash and what cost, and a feedback loop that ties the answer back to the prompt that produced it. Most teams build the first three on top of each other (registry, then version table, then resolver), realize the ledger is what makes the first three useful, and then spend a quarter wiring observability that should have been there from the start.

This post covers the prompt harness we landed on at SourceShift — four-tier resolver, write-through execution ledger — the 2026 literature that converged on the same shape from different directions, and the one engineering gap still open.

What hurts when you do not have one

The pain comes in five varieties. Most teams hit them in this order:

#	Pain	When it fires
1	Where is this prompt?	Six string literals across `chapterGen`, `chatService`, `briefBuilder`, three with the same first paragraph and different endings. Nobody knows which is canonical.
2	Who changed it last?	The prompt that worked on staging does not work in prod. `git blame` shows a typo fix from three weeks ago that also dropped a hyphen, and downstream behavior shifted.
3	Can I A/B test it?	A new variant looks promising in the eval suite. Routing 5% of traffic to it requires a deploy, a feature flag, and three callsite edits, because the prompt is a string literal in three different functions.
4	Did the user override it?	The product manager wants to test a tone for a specific document type. There is no path between “PM edits a row” and “prompt resolves with that override at call time”.
5	Why did the cost just jump?	A new feature shipped, the dashboard knows total spend went up, but no per-prompt attribution exists because the cost row records the model and the trace ID but not the prompt’s canonical key.

If you have shipped any LLM-backed product more complex than one prompt, you have hit at least two of these. The recent literature has converged on calling this whole space prompt engineering as a production discipline. Guinard’s Prompt Readiness Levels paper (arXiv:2603.15044) frames it as a nine-level maturity scale (TRL-inspired) and a scoring framework that lets you ask “where on the scale is this prompt asset?” with a number, not a feeling. The scale is uncomfortable to grade your own production prompts against. Most teams land at PRL 2 or 3, which is roughly “the prompt exists and someone has tested it once.” Moving past PRL 3 requires more than discipline — it requires an architecture, and the one we converged on replaced the string literal with five primitives that answer “which prompt fired for this call?” at any point in time.

The shape we landed on

flowchart TD
  subgraph AUTHORING[Authoring &amp; promotion]
    ADMIN([admin / PM edits<br/>collaborative surface]) --> VER
    CAND[(candidate-variant store)] --> GOV{governance gate}
    GOV -- approve --> VER[(append-only version log<br/>body + model-id + fingerprint)]
    GOV -- reject --> CAND
  end

  subgraph RUNTIME[Runtime resolution]
    REG[(registry: one entry<br/>per prompt key)] --> CANON[(canonical mirror)]
    CALL([feature call site]) --> RESOLVE{4-tier resolver}
    CANON --> RESOLVE
    VER --> RESOLVE
    RESOLVE -- doc override --> CHOSEN[resolved body + key + version]
    RESOLVE -- user override --> CHOSEN
    RESOLVE -- env lane: MOL-TS<br/>multi-objective bandit --> CHOSEN
    RESOLVE -- registered default --> CHOSEN
    CHOSEN --> WIRE[gateway: LLM call<br/>+ OTel span + cost row]
    WIRE --> RESP([response])
  end

  subgraph LEARNING[Evolution loop]
    DRIFT[drift detector:<br/>rolling embedding centroids]
    GEPA[GEPA reflective<br/>promotion engine]
    JOINT[joint-mutation runner:<br/>cross-template coupling]
    LEDGER[(execution ledger:<br/>hash · cost · outcomes)] --> DRIFT
    LEDGER --> GEPA
    LEDGER --> JOINT
    DRIFT -- model-changed event --> CAND
    GEPA -- reflective mutation --> CAND
    JOINT -- coupled-prompt mutation --> CAND
  end

  WIRE -.cost + outcome row.-> LEDGER
  RESP --> FB([user feedback<br/>thumbs · edit-distance · eval])
  FB -.substrate signal.-> LEDGER

  classDef gate fill:#7a5a1f,stroke:#fff,color:#fff
  classDef alloc fill:#1f3a7a,stroke:#fff,color:#fff
  classDef serve fill:#1f5e3a,stroke:#fff,color:#fff
  classDef store fill:#4a1f7a,stroke:#fff,color:#fff
  class RESOLVE,GOV gate
  class CANON,CHOSEN,WIRE,DRIFT,GEPA,JOINT alloc
  class CALL,ADMIN,RESP,FB serve
  class REG,VER,LEDGER,CAND store

Three lanes, one circulation pattern. The authoring lane (top) carries human edits and governed promotions into the version log. The runtime lane (middle) is the path every LLM call follows: registry-resolver-gateway-response, with the version log on the resolver’s left as the source of truth. The learning lane (bottom) is where the system gets smarter: every served call lands an outcome row in the ledger; three readers consume that ledger (drift detector, reflective promotion engine, joint-mutation runner) and propose new candidate variants; candidates wait at the governance gate for promotion. The dotted arrows are the feedback edges that close the loop.

Each labeled component above has an academic anchor in the next section: GEPA (arXiv:2507.19457) for the promotion engine, multi-objective bandits (arXiv:2605.14553) for the MOL-TS environment lane, ADOPT (arXiv:2512.24933) for the joint-mutation runner, Behavioral Fingerprints (arXiv:2603.19022) for the drift detector, Prompt Migration (arXiv:2507.05573) for the (body + model-id + fingerprint) version-log triple, and LLARS (arXiv:2605.10593) for the collaborative authoring surface. The whole composed system tracks the LLM Readiness Harness shape (arXiv:2603.27355) and meets Prompt Readiness Level 5 on Guinard’s nine-level scale (arXiv:2603.15044).

The architecture rests on five primitives and one strict separation:

A registry is the single source of truth. Every prompt is declared once, colocated with the service that uses it, with a typed schema of placeholders and a default body. The hardcoded string literal disappears as a category from the codebase.
The version log is append-only. Edits land as new entries, never in-place updates; rolling back is selecting an earlier entry. Provenance is implicit in the data model, not in commit messages.
The resolver answers a small architectural question: which version fires for this call? Four tiers, in strict priority order: document-level override, user-level override, environment-pinned version (production / canary / shadow), registered default. This is where 90% of operational control lives, and it is the surface every other system on this blog plugs into.
The execution ledger records what actually fired. The resolved version’s stable hash, the model, the cost, the latency, the trace ID. The row is the unit of accountability and the substrate every downstream signal (eval, canary, evolution) reads.
The feedback loop closes back to candidates, never directly to defaults. Feedback flows into a candidate-variant store; a reflective promotion engine (described in the next section) proposes mutations from that store, and a governance step moves a promoted candidate into the version log. The system never edits the canonical prompt without an explicit promotion decision.

The strict separation that matters: resolution, rendering, and the wire call own different layers. The harness resolves which prompt body should fire and records that it fired. Whatever you use to render the prompt (BAML, an in-house templater, plain string interpolation) lives above the resolver; whatever you use to actually call the model lives below the ledger. Conflating any two of these three responsibilities is the architectural mistake most prompt-management libraries make, and the reason we have a separate post about adopting BAML in modular mode. The three-layer separation is not an original invention — four independent research threads arrived at the same boundaries from different failure modes, and naming them sharpened every design decision we made.

Where the literature is converging

Four papers from the last year landed on closely related shapes from very different angles:

Prompt Readiness Levels (arXiv:2603.15044) gives the maturity scale we measure ourselves against. The scoring framework is multidimensional (operational, safety, compliance), and the act of computing it forces the team to write down what “production grade” actually means for a given prompt.
When “Better” Prompts Hurt (arXiv:2601.22025) names the failure mode that makes the ledger non-optional: an apparently better prompt scores higher on the eval suite, ships, and then degrades on a metric nobody was watching because the rollout had no per-prompt cost or latency attribution. Their Define, Test, Diagnose, Fix loop is the operating procedure we adopted for prompt promotions.
LLM Readiness Harness (arXiv:2603.27355) is the closest published cousin to what we ship: registry + OTel observability + CI gates combined into a single readiness score with Pareto-aware promotion. The shape is the shape; our addition is the four-tier resolver, which the paper does not address.
LLARS (arXiv:2605.10593) covers the collaboration angle: domain expert and developer co-authoring prompts with version control and instant testing. Our admin UI is a thinner version of the same idea; the LLARS architecture is where we will likely steal from when we add real co-editing.

A registry-plus-resolver harness is exactly the seam prompt-optimization libraries need — the interface, not the optimizer — which the promptolution paper (arXiv:2512.02840) frames as the architectural primitive most research codebases are still missing.

What we got out of it

Three architectural shifts, all measurable within weeks of the harness landing:

Prompts became reviewable as first-class artifacts. Every prompt edit produces a diff against the prior body in the version log, reviewable by the same workflow as a code change. Prompt changes stop being undocumented config drift.
Cost attribution became per-prompt at the architecture level. Spend jumps decompose along the prompt-key axis automatically, because the ledger row carries the canonical key. Identifying which prompt drove a regression stops being a forensic exercise.
A/B testing became a routing decision, not a deployment. Promoting a candidate to canary is a write to the version log with a label; the resolver’s environment tier handles the rest. The application code does not change between experiments.

Those three shifts held cleanly because the harness was already the substrate every adjacent system on this blog depended on — and the dependency graph is worth naming explicitly.

How it ties the rest of the blog together

This harness is the layer most other systems on this blog quietly depend on. Each of the following posts assumes a registry-plus-resolver below it, and most of them only work because that layer is in place:

The MOL-TS canary allocator routes traffic through the resolver’s environment tier. The canary lane and the multi-objective Beta posteriors are above the harness; the harness decides which prompt body actually fires for a given call.
The BAML modular-mode integration sits beside the harness, not on top of it. BAML renders and parses typed prompts; the harness still owns the resolver and the ledger. The hybrid call site has both paths and a feature flag that picks one per prompt.
The eight-tier matroid style resolver is a domain-specific resolver that the harness invokes when the prompt template asks for {{style_profile}}. The matroid greedy decides which style profile wins; the harness decides which prompt template wins; both signals compose at render time.
The rolling-summary chat compression ships two of the prompts that live in this registry (prompt_chat_context_compression and prompt_chat_rolling_summary). Splitting the work into two registered prompts only made sense because the registry could enforce typed placeholders and the ledger could attribute cost per prompt key.
The mini-ork orchestrator’s year in review describes a multi-stage pipeline whose per-stage prompts (spec-author, spec-reviewer, mutation-adversary, rubric-prescreen, BDD runner) are all registered through this harness. The stage-cache-reuse win in that post is a cache keyed on the prompt template content, which the registry exposes as a stable hash.

The anthropic-baseURL env-leak incident is the odd one out: it predates the harness in the surfaces it covers, and the cost-attribution that surfaced the leak would have shown up in the execution ledger a day earlier if the harness had been in place. We mention it here because that incident is partly why the ledger exists at all. The ledger that incident motivated is also the substrate that makes five additional layers possible without restructuring any of the code the posts above depend on.

What composes on top of the architecture today

The harness is the foundation; five other layers compose on top of it. Each is shipped, each has an academic anchor that validates the shape we chose, and each plugs in through one of the architectural primitives above without restructuring the others.

GEPA as the promotion engine. Candidate variants do not graduate to canonical versions by hand. A reflective-evolution loop reads execution-ledger outcomes, generates a textual self-critique on a small batch of rollouts, proposes a mutated prompt body, and writes it as a new candidate that the governance step can promote. The shape comes from GEPA (arXiv:2507.19457, 141 citations), which showed reflective natural-language self-critique outperforming RL fine-tuning for prompt adaptation. Our implementation runs on a scheduled cron; the candidate store is a typed table next to the version log; promotion fires through a separate governance gate so no mutated prompt reaches production without an explicit decision.
Multi-objective Thompson sampling at the resolver’s environment tier. When a candidate is in canary, the resolver does not pick uniformly. A multi-objective bandit samples per-objective Beta posteriors (rubric, latency, cost, user satisfaction) and routes traffic to whatever variant is on the Pareto front for the current call. The shape matches the multi-objective bandit paper (arXiv:2605.14553). The system is described end-to-end in our earlier post on the canary allocator; from the harness’s perspective, it is one of the four resolver tiers that already exists.
Joint mutation across cross-template dependencies. A separate mutation runner sits above the per-prompt loop and proposes coordinated edits when two prompts are upstream-downstream coupled (a brief prompt feeding a body prompt, a plan prompt feeding a chapter prompt). The architectural shape is the one ADOPT (arXiv:2512.24933) proposes for multi-step pipelines: dependency-aware joint optimization with a global textual gradient instead of independent per-step updates. Our joint mutation runner shares the candidate store with GEPA, so per-prompt evolution and joint mutation feed the same governance gate.
Drift detection at the endpoint. Beyond version pinning, an endpoint can drift silently as caches change, routers shift, kernels update, or weights refresh. A drift-detection layer maintains rolling embedding centroids per registered prompt and fires when cosine distance crosses a calibrated threshold. The shape rhymes with Behavioral Fingerprints (arXiv:2603.19022) and the broader endpoint-stability literature: a small set of inputs whose response distribution is the stable signature of effective model identity. The ledger carries the drift signal so downstream evals can join against it.
Cross-model-version stability via the version log. Each row in the version log carries the prompt body, the model identity it was promoted against, and a fingerprint of the embedding-space response distribution at promotion time. When a model update lands (vendor weight refresh, router shift), the drift detector raises an event and the promotion engine reruns the affected candidate against the new model before re-promoting. The named pattern is from Prompt Migration (arXiv:2507.05573), which formalizes the stability problem of prompts surviving LLM model updates. Our version log treats (prompt-body, model-id, fingerprint) as the unit, not a body alone.

A collaborative authoring surface closes the human-in-the-loop side of the system. The admin page is where prompts are edited, versions are previewed, candidate mutations are reviewed, and promotion is one click. The shape is LLARS-adjacent (arXiv:2605.10593): domain expert and developer share the same versioning surface, with the diff between adjacent rows in the version log as the review primitive.

Taken together, the honest framing is that the harness pinned a PRL-3 score on Guinard’s nine-level scale when we first wrote it down, and the layers above push the composed system into PRL-5 territory. Each anchor paper above named the architectural primitive we needed; the engineering was small once the primitive had a name.

What is still genuinely open, and worth naming as the next year of work, is prompt-evolution substrate density. The promotion engine, the canary lane, the drift detector, and the joint runner all read from the execution ledger’s outcome columns (accepted, edit_distance_pct, per-objective scores). Today only a fraction of registered prompts have enough feedback signal flowing in for the eligibility filter to qualify them; the rest fire correctly but produce no learning. The plumbing exists end-to-end; the substrate the plumbing depends on is the work item. This is an engineering gap, not a research gap, and it does not need new architecture to close. That gap is only a problem once the harness is in place — which raises the prior question of whether a given codebase has crossed the threshold where building one pays.

When to reach for this design

Build a harness when any of the following is true:

You have more than five prompts in production and someone has asked “which version of this prompt did the user actually see last Tuesday?”
An overnight cost spike has happened and the post-mortem could not attribute it to a specific prompt
A product manager wants to A/B test a tone or persona and you discovered the prompt is a string literal in three different files
The eval suite scores prompts that are not the prompts that shipped

Skip the harness if you have one prompt, one model, one user. Add it before the second prompt lands. The cost of building it later (as we discovered) is roughly one quarter of an engineer’s time, paid in the form of dashboards that lie to you in the meantime.

The detection fingerprint is the conversation: when someone in standup says “wait, which prompt is that?”, the harness is what answers them with a prompt_key, a version_id, and a row from the ledger. Everything else is theater.