llm-ops ·context-engineering ·architecture ·memory-systems ·arxiv-research

We stopped treating context like application logic

Apr 15, 2026 · 11 min read

The eighth time we added a feature that needed conversation history, document scope, prior decisions, and a few pinned facts, we wrote the same composition logic for the eighth time. Each feature picked its own bucket of state, its own truncation heuristic, its own RAG glue. Each one degraded differently as context grew. Each one became someone else’s problem to debug at 2 AM.

The eighth time forced an admission: context is infrastructure — a substrate every block-shaped feature plugs into, not a thing you write again.

The result is a context engine that fits in six tables, three plug-in layers, and one compose call. It draws from four production-grounded 2026 papers and locks the load-bearing decisions on day one. This post is the architecture and the why — including the parts we chose not to build yet.

Steel-manning the alternative

The strongest argument against building a context engine is that you do not need one. Your framework ships with a message-history primitive. Your RAG stack handles document grounding. Your scratchpad pattern handles the rest. You write a per-feature getContext() helper, you ship, you move on.

That argument holds for the first three features. By the fourth, three different getContext() helpers diverge on token budgeting. The fifth feature adds memory and creates a parallel CRUD path; the sixth feature reads from the wrong path and silently shows stale facts. By the eighth, you cannot ship the ninth without copying logic out of the third — and the third has been deleted in a refactor nobody told you about.

The structural mismatch between context (shared across features) and application logic (owned per-feature) is what breaks at the N+1 feature — not any single one. A substrate exists to absorb that mismatch.

First-principles: what a context engine must do

If we forget every existing memory framework and start from need, a context engine must answer a small number of questions consistently:

For this scope (a chat session, a book, an agent run), what is relevant right now?
Under what budget can we deliver it (tokens, time, dollars)?
Who is allowed to read it, write it, delete it?
What is fresh and what is stale, and when something gets updated, what derivative views become invalid?
How does it extend — adding a new feature should mean declaring its needs, not rewriting the engine.

These reduce to twenty-five must-haves grouped into five buckets: substrate, governance, extensibility, retrieval quality, multimodal. The substrate bucket is mostly mechanical — a block-keyed immutable ledger, append-only writes, token-budgeted layers, async distillation. The interesting parts are the other four, because they are where most context implementations quietly give up.

The 2026 literature now backs every one of them. Mem0 is the production baseline — add/update/delete/noop operation typing with empirical SLOs at p50 148 ms. MEMOREPAIR formalizes cascade repair via valid_from / valid_until / superseded_by columns. SSGM supplies the reconcile-cron pattern. StructMemEval settles the eval-in-stage-A debate (ship-before-baseline is a structural mistake, not a sequencing one).

The newer additions are where the design gets unobvious.

Systems thinking: the three-layer plug-in stack

Adding the next feature should mean adding a YAML file, not adding a code path through the engine. That requires three layers, each independently swappable:

flowchart TB
  subgraph M["Manifests · what each feature needs"]
    CM[chat-session.yaml]
    BM[book-gen.yaml]
    AM[agent-run.yaml]
    NM[any-new-feature.yaml]
  end

  subgraph E["Engine · typed pipeline"]
    Compose[composeBlockContext<br/>intent + scope]
    Permission[T1/T2/T3 gate]
    HyPE[synthetic-query enrich]
    Veto[cross-encoder veto]
    Score[Triage & Bid + FSRS + Kalman]
    Cache[mandatory compose cache]
  end

  subgraph C["CxRI · where memory lives"]
    PG[pgvectorConnector]
    QD[qdrantConnector<br/>cold tier]
    MI[minioConnector<br/>BLOB tier]
    EX[externalKBConnector]
  end

  subgraph S["Storage · 6 block-keyed tables"]
    BL[(blocks · immutable ledger)]
    SM[(subtree_summaries<br/>+valid_from/until)]
    OS[(memory_open_set<br/>atomic facts)]
    ST[(memory_schema_typed<br/>typed properties)]
    LK[(context_links · pinned)]
    CA[(compose_cache · 30s)]
  end

  CM & BM & AM & NM --> Compose
  Compose --> Permission --> HyPE --> Veto --> Score --> Cache
  Compose --> C
  C --> S

  classDef gate fill:#7a5a1f,stroke:#fff,color:#fff
  classDef alloc fill:#1f3a7a,stroke:#fff,color:#fff
  classDef serve fill:#1f5e3a,stroke:#fff,color:#fff
  classDef store fill:#4a1f7a,stroke:#fff,color:#fff
  class Permission,Veto gate
  class Compose,HyPE,Score,Cache alloc
  class CM,BM,AM,NM serve
  class BL,SM,OS,ST,LK,CA,PG,QD,MI,EX store

Manifests — YAML files under server/contextManifests/ — are the only artifact a new feature needs to declare its context requirements; zero engine code changes. They describe what each feature needs — which layers, what token budget, what permission tier, what intent-keyed scoring weights.

The engine runs a deterministic typed pipeline per call: resolve scope → check permissions → enrich with synthetic queries → vector-search → veto with a cross-encoder → score with sim × R^λ × (1 + β·U) → cache. Every layer is observable, every layer is bypassable per manifest.

CxRI (Context Runtime Interface) insulates the engine from knowing how many backends exist — a six-operation contract (connect / query / read / write / subscribe / health) is the entire interface any backing store needs to implement. Day one we have one connector, pgvectorConnector, about 300 lines.

The framing is borrowed from Context Kubernetes, which makes the case for declarative context infrastructure analogous to how k8s declares workload infrastructure. The same paper supplies the three-tier agent permission model that the engine enforces — T1 autonomous reads (auth-fenced), T2 soft-approval writes (distiller output enters a quarantine queue until drift telemetry runs green for 30 days), T3 strong-approval cross-scope or delete operations. A three-tier permission model and a six-operation connector interface are easy to name on a diagram; the manifest YAML is where both commitments become concrete and verifiable.

Show, do not tell

A manifest entry for the chat session looks like this:

apiVersion: contextengine.researcher.local/v1
kind: ContextDomain
metadata:
  name: chat-session
  scope_block_type: chat_session
spec:
  layers:
    recent:       { enabled: true, limit: 12 }
    summary:      { enabled: true, kinds: [rolling, per_doc, lineage] }
    open_set:     { enabled: true, top_k: 8 }
    schema_typed: { enabled: true, collection: chat_user_profile }
    semantic:     { enabled: true, top_k: 5, hype_enabled: true, veto_threshold: 0.10 }
  budget:
    total_tokens: 4000
    min_per_layer: { typed: 400, recent: 800 }
  scoring:
    formula: "sim * pow(R, lambda) * (1 + beta * U)"
    lambda_by_intent: { fact: 0.0, reasoning: 1.0, temporal: 0.5 }
    beta: 1.5
  permission:
    reads: T1
    writes: T2
    cross_scope: T3
  distillation:
    policy: "every_3_turns OR token_pressure_70pct"
    cost_budget_usd_per_day: 0.50

Manifest and engine are fully decoupled: the chat feature owns the policy, the engine reads it, and neither knows the other exists outside this file.

The call site, on the other side of the engine, looks like this:

const ctx = await composeBlockContext({
  scope_block_uuid: sessionUuid,
  intent: 'follow_up',          // engine selects layers from this
  query: latestMessage,         // for the semantic layer
  permission_profile: agentProfile,  // T1/T2/T3 enforcement
  trace: false,
});

intent is the load-bearing argument. Asking what the user wants (follow-up, new topic, compare, recall a past decision) lets the engine pick which layers fire instead of always firing all of them. A recall_decision intent fires pinned + summary + schema_typed and skips semantic; a creative intent inverts that. Same compose call, different cost profile, no per-call configuration in the caller. What the compose call abstracts away from callers is the decision that took the most thinking to get right: memory inside the engine requires two storage shapes, not one.

The dual memory model — why two storage shapes instead of one

The bucket that took the most thinking was memory itself. The natural shape is a single table — (block_uuid, content, embedding) — and the version of this design we shipped first did exactly that.

It was wrong. Governed Memory, the system Personize.ai runs in production with a 74.8% LoCoMo score (the best public number we know of), splits memory across two stores from a single extraction call:

	Open-set	Schema-typed
Shape	Atomic free-form fact + embedding	Typed property value (text / int / date / bool / option / array)
Use	Long-tail observations, qualitative insights, unstructured claims	Queryable structured fields — `user.preferred_language='en'`, `book.target_word_count=80000`
Captured by both	34% of facts	34% of facts
Captured ONLY by this side	38% of facts	12% of facts
Missed by both	16% of facts

The 38% / 12% asymmetry is the headline. A single store loses real signal regardless of which shape you pick. The two stores share an extraction call (one LLM round-trip, two typed outputs) so the cost overhead is sub-linear and the recall gain is consistent across LoCoMo splits. Mem0 published the open-set side first. Governed Memory closed the loop.

The dual-store earns its cost: one extra table, one extra pipeline layer, and you recover the recall a single-store design loses regardless of which shape you pick.

Reversibility: what to lock on day one

Jeff Bezos’s Type 1 / Type 2 framing — irreversible doors vs. reversible doors — is the right lens for a substrate. A substrate becomes load-bearing the moment a second feature depends on it. By the time the third feature lands, breaking the substrate is breaking the features.

So the design discipline was: lock the irreversible decisions on day one, defer everything else behind a flag.

Decision	Type	Locked when	Reasoning
`composeBlockContext` signature	T1	Stage A	once 50 features call it, breaking it is breaking them. Ship with `intent` + `permission_profile` even if unused on day one.
Operation-typed writes (ADD / UPDATE / DELETE / NOOP)	T1	Stage A	audit invariant; cannot bolt on without re-classifying every historical row.
`valid_from` / `valid_until` / `superseded_by` columns	T1	Stage A	cascade repair depends on them; legacy drop in Stage E is unsafe without.
CxRI 6-op interface	T1	Stage A	future connectors break if the contract changes.
Manifest schema (`apiVersion v1`)	T1	Stage A	every existing manifest must re-validate against future revisions; pin shape from day one.
3-tier permission profile in the API	T1	Stage A	active T1-only at first; T2 and T3 schema columns present and gated off behind flags
FSRS retrievability decay vs alternatives	T2	Stage C	pluggable scoring function; swap by editing the manifest
Cross-encoder veto gate threshold	T2	Stage C	per-manifest flag, A/B-testable
Multimodal layout-aware path	T2	Stage D	flag-gated; researcher-specific
Eval gate composition	T2	Stage A	start with LoCoMo subset, grow over time

The line is not “ship less”; it is “lock signatures, defer concrete implementations.” Stage A ships every T1 decision and a working T1-only substrate. The next four stages ship the T2 implementations against the locked signatures.

This matters more for a substrate than for an application. Application code can be rewritten when it bothers you. Substrate signatures are a contract with every feature that already plugged in. Locking those signatures would have meant nothing if the failure modes buried inside them had shipped first — which is why we ran a pre-mortem before a single line of Stage A was committed.

Pre-mortem: the failure modes that nearly shipped

Four pre-mortem failure modes meaningfully changed the Stage A design — out of twelve total, these were the ones that added engine components.

1. Compose latency exceeds Mem0 production. Mem0 publishes p50 search at 148 ms, total 1.44 s. We initially targeted p50 60 ms total — twice as fast as the system that has been running in production longer than ours. Softened to p50 ≤ 180 ms on cache miss, ≤ 50 ms on cache hit, p95 ≤ 400 ms. The compose cache became mandatory (not “future optimization”) because the SLOs cannot be hit without it.

2. Distiller-on-distiller drift. Refreshing a summary by reading the previous summary instead of the raw blocks compounds error every cycle. Zhang et al measure this drift empirically on long sessions. Fix: every distillation pass re-reads the raw block ledger, never the previous summary. Drift telemetry alerts when divergence exceeds threshold per scope per night.

3. Distiller hallucinations pollute memory. A misclassified UPDATE overwrites a correct fact with a confident wrong one. Defense in depth: confidence threshold on write; cross-encoder veto gate on candidates (from MemArchitect); T2 soft-approval on distiller writes for the first 30 days, graduating to T1 only after drift telemetry stays green; two-phase redaction so an invalidated row writes a tombstone instead of disappearing.

4. Schema rot — code adds a field, manifest forgets. Without manifest-as-source-of-truth, the engine and the feature diverge silently. Fix: manifest is parsed and validated at boot; engine refuses unknown placeholders; CI lints for manifest/code drift. Manifest version pinned per feature.

The pre-mortem doubled as a feature list. Half the engine’s components exist because of failure modes that would have shipped silently in a less paranoid design. Those components now have a place in the rollout sequence, ordered so the decisions that cannot be undone land in Stage A and everything downstream builds against a contract that already holds.

Rollout in five stages

Stage A locks every T1 decision — the load-bearing substrate in three days: six tables, composeBlockContext with intent and permission_profile in the signature, CxRI with one connector, operation-typed writes, cascade-repair columns, reconcile cron skeleton, eval harness with the LoCoMo subset gating every PR, mandatory cache, T1-only permission active.

Stage A alone is the load-bearing commit; Stages B through E layer implementations against its locked signatures — chat and book features migrate (B); agent runs, deep research, and FE memory drawer (C); multimodal layout-aware path, Hebbian co-occurrence expansion, JIT cousin walk (D); legacy column drops, cross-scope reuse with explicit ACL, T3 strong-approval flow (E) — totalling ~21 working days for the full system.

What changes when context is a substrate

Three behaviors flip when context stops being per-feature application logic.

Adding a feature changes shape. It used to be a Jira ticket with subtasks for “decide truncation policy”, “wire up retrieval”, “instrument cost”. It is now one YAML file plus the call site. The conversations with the team got shorter.

Failure modes localize. Token-budget regressions used to surface in whichever feature happened to hit the ceiling first; now they surface in the engine’s metrics with a manifest name attached. The MTTR on context-related incidents dropped because the diagnosis path is a single one.

The cost curve flattens. The compose cache hits at 40%+ steady-state because follow-up queries within thirty seconds are common. Distillation is async and capped at $0.50 per user per day. Reconcile is nightly and uses centroid re-embedding (no LLM round-trip). Per-feature cost ledgers got boring, which is the goal.

Deciding that context deserved its own substrate — with the same engineering discipline as the runtime, the database, the auth layer — was what was new; the science and the production patterns had already shipped.

If you are on your fourth or fifth LLM feature and the third getContext() helper is starting to disagree with the second, the moment is already here. Lock the signatures. Defer the implementations. Treat the manifest like a database schema, not a config file. The substrate pays you back the moment a feature plugs in without asking the engine for permission.