source·shift
← all posts

Our prompts stopped being code

Every LLM-powered codebase past month six develops the same accountability gap: six copies of “You are an expert”, each slightly different, none indexed, and no honest answer to “what was the prompt at 14:32 UTC on the 16th?” The team that wrote them does not know which one shipped to which user yesterday. The dashboard that scores them grades one prompt against the answers a different prompt generated. The eval suite that gates promotions runs against a string that was deleted in last Tuesday’s refactor. Nobody is lying. The codebase just does not have a way to ask “what was the prompt at 14:32 UTC on the 16th?” and get an honest answer.

This is the moment most LLM-powered products quietly stop treating prompts as code and start treating them as config that happens to be embedded in code. The string literal is the wrong primitive. What you need is a registry, a version table, a resolver that can be overridden per user or per document or per environment, a ledger that records what actually fired with what hash and what cost, and a feedback loop that ties the answer back to the prompt that produced it. Most teams build the first three on top of each other (registry, then version table, then resolver), realize the ledger is what makes the first three useful, and then spend a quarter wiring observability that should have been there from the start.

This post covers the prompt harness we landed on at SourceShift — four-tier resolver, write-through execution ledger — the 2026 literature that converged on the same shape from different directions, and the one engineering gap still open.

What hurts when you do not have one

The pain comes in five varieties. Most teams hit them in this order:

#PainWhen it fires
1Where is this prompt?Six string literals across chapterGen, chatService, briefBuilder, three with the same first paragraph and different endings. Nobody knows which is canonical.
2Who changed it last?The prompt that worked on staging does not work in prod. git blame shows a typo fix from three weeks ago that also dropped a hyphen, and downstream behavior shifted.
3Can I A/B test it?A new variant looks promising in the eval suite. Routing 5% of traffic to it requires a deploy, a feature flag, and three callsite edits, because the prompt is a string literal in three different functions.
4Did the user override it?The product manager wants to test a tone for a specific document type. There is no path between “PM edits a row” and “prompt resolves with that override at call time”.
5Why did the cost just jump?A new feature shipped, the dashboard knows total spend went up, but no per-prompt attribution exists because the cost row records the model and the trace ID but not the prompt’s canonical key.

If you have shipped any LLM-backed product more complex than one prompt, you have hit at least two of these. The recent literature has converged on calling this whole space prompt engineering as a production discipline. Guinard’s Prompt Readiness Levels paper (arXiv:2603.15044) frames it as a nine-level maturity scale (TRL-inspired) and a scoring framework that lets you ask “where on the scale is this prompt asset?” with a number, not a feeling. The scale is uncomfortable to grade your own production prompts against. Most teams land at PRL 2 or 3, which is roughly “the prompt exists and someone has tested it once.” Moving past PRL 3 requires more than discipline — it requires an architecture, and the one we converged on replaced the string literal with five primitives that answer “which prompt fired for this call?” at any point in time.

The shape we landed on

flowchart TD
  subgraph AUTHORING[Authoring & promotion]
    ADMIN([admin / PM edits<br/>collaborative surface]) --> VER
    CAND[(candidate-variant store)] --> GOV{governance gate}
    GOV -- approve --> VER[(append-only version log<br/>body + model-id + fingerprint)]
    GOV -- reject --> CAND
  end

  subgraph RUNTIME[Runtime resolution]
    REG[(registry: one entry<br/>per prompt key)] --> CANON[(canonical mirror)]
    CALL([feature call site]) --> RESOLVE{4-tier resolver}
    CANON --> RESOLVE
    VER --> RESOLVE
    RESOLVE -- doc override --> CHOSEN[resolved body + key + version]
    RESOLVE -- user override --> CHOSEN
    RESOLVE -- env lane: MOL-TS<br/>multi-objective bandit --> CHOSEN
    RESOLVE -- registered default --> CHOSEN
    CHOSEN --> WIRE[gateway: LLM call<br/>+ OTel span + cost row]
    WIRE --> RESP([response])
  end

  subgraph LEARNING[Evolution loop]
    DRIFT[drift detector:<br/>rolling embedding centroids]
    GEPA[GEPA reflective<br/>promotion engine]
    JOINT[joint-mutation runner:<br/>cross-template coupling]
    LEDGER[(execution ledger:<br/>hash · cost · outcomes)] --> DRIFT
    LEDGER --> GEPA
    LEDGER --> JOINT
    DRIFT -- model-changed event --> CAND
    GEPA -- reflective mutation --> CAND
    JOINT -- coupled-prompt mutation --> CAND
  end

  WIRE -.cost + outcome row.-> LEDGER
  RESP --> FB([user feedback<br/>thumbs · edit-distance · eval])
  FB -.substrate signal.-> LEDGER

  classDef gate fill:#7a5a1f,stroke:#fff,color:#fff
  classDef alloc fill:#1f3a7a,stroke:#fff,color:#fff
  classDef serve fill:#1f5e3a,stroke:#fff,color:#fff
  classDef store fill:#4a1f7a,stroke:#fff,color:#fff
  class RESOLVE,GOV gate
  class CANON,CHOSEN,WIRE,DRIFT,GEPA,JOINT alloc
  class CALL,ADMIN,RESP,FB serve
  class REG,VER,LEDGER,CAND store

Three lanes, one circulation pattern. The authoring lane (top) carries human edits and governed promotions into the version log. The runtime lane (middle) is the path every LLM call follows: registry-resolver-gateway-response, with the version log on the resolver’s left as the source of truth. The learning lane (bottom) is where the system gets smarter: every served call lands an outcome row in the ledger; three readers consume that ledger (drift detector, reflective promotion engine, joint-mutation runner) and propose new candidate variants; candidates wait at the governance gate for promotion. The dotted arrows are the feedback edges that close the loop.

Each labeled component above has an academic anchor in the next section: GEPA (arXiv:2507.19457) for the promotion engine, multi-objective bandits (arXiv:2605.14553) for the MOL-TS environment lane, ADOPT (arXiv:2512.24933) for the joint-mutation runner, Behavioral Fingerprints (arXiv:2603.19022) for the drift detector, Prompt Migration (arXiv:2507.05573) for the (body + model-id + fingerprint) version-log triple, and LLARS (arXiv:2605.10593) for the collaborative authoring surface. The whole composed system tracks the LLM Readiness Harness shape (arXiv:2603.27355) and meets Prompt Readiness Level 5 on Guinard’s nine-level scale (arXiv:2603.15044).

The architecture rests on five primitives and one strict separation:

The strict separation that matters: resolution, rendering, and the wire call own different layers. The harness resolves which prompt body should fire and records that it fired. Whatever you use to render the prompt (BAML, an in-house templater, plain string interpolation) lives above the resolver; whatever you use to actually call the model lives below the ledger. Conflating any two of these three responsibilities is the architectural mistake most prompt-management libraries make, and the reason we have a separate post about adopting BAML in modular mode. The three-layer separation is not an original invention — four independent research threads arrived at the same boundaries from different failure modes, and naming them sharpened every design decision we made.

Where the literature is converging

Four papers from the last year landed on closely related shapes from very different angles:

A registry-plus-resolver harness is exactly the seam prompt-optimization libraries need — the interface, not the optimizer — which the promptolution paper (arXiv:2512.02840) frames as the architectural primitive most research codebases are still missing.

What we got out of it

Three architectural shifts, all measurable within weeks of the harness landing:

Those three shifts held cleanly because the harness was already the substrate every adjacent system on this blog depended on — and the dependency graph is worth naming explicitly.

How it ties the rest of the blog together

This harness is the layer most other systems on this blog quietly depend on. Each of the following posts assumes a registry-plus-resolver below it, and most of them only work because that layer is in place:

The anthropic-baseURL env-leak incident is the odd one out: it predates the harness in the surfaces it covers, and the cost-attribution that surfaced the leak would have shown up in the execution ledger a day earlier if the harness had been in place. We mention it here because that incident is partly why the ledger exists at all. The ledger that incident motivated is also the substrate that makes five additional layers possible without restructuring any of the code the posts above depend on.

What composes on top of the architecture today

The harness is the foundation; five other layers compose on top of it. Each is shipped, each has an academic anchor that validates the shape we chose, and each plugs in through one of the architectural primitives above without restructuring the others.

A collaborative authoring surface closes the human-in-the-loop side of the system. The admin page is where prompts are edited, versions are previewed, candidate mutations are reviewed, and promotion is one click. The shape is LLARS-adjacent (arXiv:2605.10593): domain expert and developer share the same versioning surface, with the diff between adjacent rows in the version log as the review primitive.

Taken together, the honest framing is that the harness pinned a PRL-3 score on Guinard’s nine-level scale when we first wrote it down, and the layers above push the composed system into PRL-5 territory. Each anchor paper above named the architectural primitive we needed; the engineering was small once the primitive had a name.

What is still genuinely open, and worth naming as the next year of work, is prompt-evolution substrate density. The promotion engine, the canary lane, the drift detector, and the joint runner all read from the execution ledger’s outcome columns (accepted, edit_distance_pct, per-objective scores). Today only a fraction of registered prompts have enough feedback signal flowing in for the eligibility filter to qualify them; the rest fire correctly but produce no learning. The plumbing exists end-to-end; the substrate the plumbing depends on is the work item. This is an engineering gap, not a research gap, and it does not need new architecture to close. That gap is only a problem once the harness is in place — which raises the prior question of whether a given codebase has crossed the threshold where building one pays.

When to reach for this design

Build a harness when any of the following is true:

Skip the harness if you have one prompt, one model, one user. Add it before the second prompt lands. The cost of building it later (as we discovered) is roughly one quarter of an engineer’s time, paid in the form of dashboards that lie to you in the meantime.

The detection fingerprint is the conversation: when someone in standup says “wait, which prompt is that?”, the harness is what answers them with a prompt_key, a version_id, and a row from the ledger. Everything else is theater.

Comments

Sign in with GitHub to leave a comment. Threads live on SourceShift/blog-comments — moderated.