source·shift
← all posts

The simplest survivable form of chat memory

“You said earlier that…” is the sentence that breaks naive context management. The user is 30 turns into a chat with an LLM, asks a follow-up that depends on something from turn 4, and the model has no memory of it because the sliding window dropped it eight turns ago. Every other repair attempt (“we’ll just keep more turns”, “we’ll summarize once at the start”, “we’ll vector-search the history”) fixes the demo and breaks differently in production.

Every chat surface past the model’s context window has to drop something. Code assistants, customer support agents, document chat, multi-agent threads, voice assistants doing days of chitchat: all of them face the same wall, and the cheap answers each fail in a specific named way.

The smallest representation that survives a fifty-to-a-hundred-turn document chat session is two prompts, one Postgres column, and a six-message threshold — and that is what this post is about. The chat surface inside LibWit that hit this hardest is a per-document assistant where readers ask the model about a paper, an arXiv preprint, or a translated book chapter, with many conversations spanning fifty to a hundred turns over a week. Less than 200 lines of code, and the recent literature has been converging on roughly the same shape from multiple directions at once.

What naive history management gets you

Three obvious options, all with the same shape of failure.

ApproachWhat it doesHow it fails
Replay full historyAppend every turn to the promptCost climbs linearly with conversation length; quality starts to drift around the 50-turn mark on most models
Sliding windowKeep last N turns, drop the restThe thing the reader is coming back to is almost always before the window
Static one-shot summarySummarize the whole conversation once and prependStale by turn 5; can’t represent things that happened after the summary
Three side-by-side panels showing how naive chat history strategies fail. Panel 1: REPLAY FULL HISTORY with growing message bubbles and a rising cost arrow. Panel 2: SLIDING WINDOW with five recent bubbles inside a fixed box and two faded bubbles falling off the left. Panel 3: STATIC ONE-SHOT SUMMARY with a frozen summary block at the top and progressively lighter newer messages below it.
Three naive strategies, three named failure shapes — the table tells you, the picture shows you.

We hit both of the second and third failure modes that Cruz’s Adaptive Focus Memory paper (arXiv:2511.12712) names explicitly: “recency-based truncation or static summarization often causes early, high-impact user constraints to drift out of effective context.” The “early high-impact constraint” in our use case was usually the user’s preferred level of detail. They’d say “explain this like I haven’t taken topology in five years” at turn 2, and by turn 30 the assistant would forget and reach for the categorical framing.

What we ship instead

flowchart TD
  TURN([new user question]) --> CTX[fetch rolling_summary,<br/>key_qa_pairs,<br/>recent verbatim turns]
  CTX --> ANSWER[answer with full working context]
  ANSWER --> COUNT{message count > 6<br/>and divisible by 4?}
  COUNT -- no --> INCR[incremental update:<br/>fold new Q/A into rolling_summary,<br/>append to key_qa_pairs]
  COUNT -- yes --> COMPRESS[full compression:<br/>collapse old verbatim turns into<br/>updated rolling_summary]
  INCR --> SAVE[(persist:<br/>rolling_summary + key_qa_pairs)]
  COMPRESS --> SAVE
  SAVE -.next turn.-> CTX

  classDef gate fill:#7a5a1f,stroke:#fff,color:#fff
  classDef alloc fill:#1f3a7a,stroke:#fff,color:#fff
  classDef serve fill:#1f5e3a,stroke:#fff,color:#fff
  classDef store fill:#4a1f7a,stroke:#fff,color:#fff
  class COUNT gate
  class CTX,INCR,COMPRESS alloc
  class TURN,ANSWER serve
  class SAVE store

The working context for any turn has three pieces:

Stacked diagram showing the three layers that feed every chat turn's working context. Top band: ROLLING SUMMARY, a single block representing a 300-word string. Middle band: KEY Q AND A PAIRS, four small paired blocks with keyword tags. Bottom band: RECENT VERBATIM TURNS, a row of message bubbles. A converging arrow from all three bands meets at a CURRENT TURN CONTEXT box at the bottom, with a plus-new-question chip attached.
Three layers feed every turn. The mermaid above shows how each one gets updated; this is what the model receives when the next question arrives.

A concrete turn

Imagine a reader nine turns into a chat about Mixture-of-Experts at Scale (arxiv:2401.04088). Compression has just fired on turn 8 (the > 6 && % 4 === 0 gate). Here is what the three layers literally hold when the tenth question arrives:

rolling_summary (≤300 words):

The reader is working through Mixture-of-Experts at Scale. We have covered the router (noisy top-2 gating with a learnable softmax over expert IDs), expert capacity (C = (T/E) × cap_factor, with cap_factor=1.25), and overflow handling (residual passthrough, no drop). Load-balance loss is auxiliary at α=0.01 of cross-entropy main. Training stability came up — router-z loss prevents logit blow-up; bf16 over fp8 for the router because of dynamic range. The reader has been comparing this to Switch-Transformer’s top-1 router and DeepSeek-MoE’s shared-expert variant.

key_qa_pairs (the four kept after pruning):

[
  { "q": "How is expert capacity computed exactly?",
    "a": "C = (T/E) × cap_factor, with cap_factor=1.25.",
    "tags": ["capacity", "routing", "formula"] },
  { "q": "What happens to tokens that overflow capacity?",
    "a": "Residual passthrough — they bypass the MoE layer; no drop.",
    "tags": ["overflow", "routing"] },
  { "q": "Why bf16 and not fp8 for the router?",
    "a": "Router logits are mean-zero with high dynamic range; bf16's wider exponent kept gradients clean. fp8 was reserved for forward activations.",
    "tags": ["precision", "bf16", "training-stability"] },
  { "q": "What is the load-balance loss weight α?",
    "a": "α = 0.01 of main cross-entropy. Higher α destabilized routing entropy in their ablation.",
    "tags": ["load-balance", "auxiliary-loss"] }
]

Last two verbatim turns (kept whole for short-range coherence):

User: What is the load-balance loss weight α?
Assistant: They use α = 0.01 of the main cross-entropy loss. The paper's
  ablation shows higher α destabilized routing entropy — experts collapsed
  toward uniform usage and the model lost specialisation. Lower α (0.001)
  underweighted balance and one or two experts saturated.

User: Did the auxiliary loss interact with the router-z regularizer?
Assistant: Yes — the two stack additively but with different targets.
  α-loss pushes toward uniform expert usage; z-loss caps logit magnitude.
  Their ablation only tuned them independently, so we do not know if a
  joint sweep would surface better Pareto points.

Turn 10 arrives: “Can the auxiliary loss weight be scheduled rather than constant? Did they try that?”

The model receives the summary + the four pairs + the two verbatim turns + the new question — roughly 2–3K tokens total, regardless of how many turns the conversation has had. The previous nine turns occupy under 350 words of summary plus four compressed pairs; only the last two turns are verbatim.

The new question lands on the load-balance tag in the key pair from turn 8 and on the verbatim turn 9 reference to z-loss interaction. The summary anchors the broader paper context (top-2 routing, expert capacity formula). Enough specificity to answer “did they try scheduling?” without re-reading the previous nine turns.

The incremental path runs on every turn. It folds the new Q/A into the rolling summary, appends a key pair, and saves. Cheap. Sub-second on a fast model.

The compression path runs every fourth turn past the sixth. It collapses old verbatim turns into the rolling summary in a heavier pass, and prunes key_qa_pairs back to the recent N. The threshold is empirical, not deep. We tried > 10 && % 5 === 0 first; the smaller numbers reduced “you said earlier” failures by enough that we left them.

That threshold governs when each path fires, but the real design question is what each path actually does — which comes down to the shape of two short registered prompts.

The two prompts that do the work

Just two registered prompts. Both short.

Update this conversation summary with the new Q/A:

CURRENT SUMMARY:
{{currentSummary}}

NEW QUESTION: {{question}}
NEW ANSWER: {{answer}}

Create an updated summary (under 250 words) that integrates the new
information naturally.

That’s the incremental updater. The compression prompt is similar shape (under 300 words, takes the previous summary plus a batch of new messages instead of one Q/A pair). Both run on a fast cheap model since the work is mechanical.

The two-prompt design is deliberate. A single “always summarize the entire conversation” prompt would either be too expensive per turn or too lossy. Splitting the work into a cheap incremental updater plus a rarer heavier compressor means most turns pay sub-cent cost, and the occasional heavier pass amortizes across four turns.

That cost profile is low enough that the same two-prompt pattern is worth reaching for across every long-lived chat surface that hits the same context-window wall — and there are more of those surfaces than the document-chat case makes obvious.

Where this problem shows up

The shape repeats anywhere a long-lived agent interacts with a user past the model’s window. Familiar shows-up-everywhere table:

SurfaceThe conversationA framework facing the same shape
Code assistant chatMulti-day debugging session with the same repoCursor, GitHub Copilot Chat, Claude Code, Cody
Customer support agentTriage across multiple sessions with one userIntercom Fin, Zendesk AI, in-house support copilots
Personal voice assistantMonths of recurring chitchat plus task requestsLetta (formerly MemGPT) is built around exactly this
Document chatReader asks 30 questions about a 90-page paperOur case; also Anthropic’s projects feature, NotebookLM
Multi-agent coordinationLong-running agent-to-agent threadsLangGraph supervisor patterns; CrewAI hierarchical chats

The frameworks that handle this best (Letta, NotebookLM) ship something close to rolling summaries by default. Most ship the naive approaches and let the user feel the failure mode first.

The academic literature has been arriving at the rolling-summary correction from multiple independent directions — and the convergence is recent enough that most frameworks haven’t caught up yet.

Where the literature is converging

Four papers from the last year that landed on something close to our shape:

What we ship is a stripped-down version of MemGPT’s pattern (2310.08560): one rolling string instead of the main-memory / archival-memory split, because most of our conversations are bounded by a single document and don’t need archival lookup. The MemGPT paper is the older ancestor — it added recursive summarization to LLM agents and named the architecture.

Why three layers — the math underneath

The three-layer split is not arbitrary. It is a tractable approximation of the information bottleneck (Tishby, Pereira, Bialek 1999): given a bounded memory budget BB, find the compression MM of the full conversation history HtH_t that preserves the bits most predictive of the next ideal answer AA^{*}.

Mt  =  argminM:MB[I(M;Ht)    βI(M;A)]M_t^{*} \;=\; \arg\min_{M\,:\,|M|\le B}\Big[\,I(M;\,H_t) \;-\; \beta \cdot I(M;\,A^{*})\,\Big]

Read in plain English: keep the bits of history most predictive of the right next answer; throw away the rest. The number β\beta is just a knob — turn it down to compress more aggressively (fewer tokens kept, more lossy), turn it up to preserve more of the original information (more tokens, less lossy). Picking a value of β\beta picks where on the trade-off you land. Solving this exactly requires knowing AA^{*} in advance, which we never do. The escape is to notice that queries arrive in modes, and each mode has a different rate–distortion curve:

The figure below plots the three rate-distortion curves and their lower envelope. Picking any single codec leaves one query mode badly served. The three-layer mixture takes the minimum of all three, so each region of look-back distance is dominated by the codec built for it.

Two-panel figure on cream paper with deep-ink axes. Left panel: rate-distortion curves per layer plotted against look-back distance d (turns ago). The verbatim curve (blue) stays near zero distortion for d ≤ 3 then jumps to 1.0. The rolling-summary curve (sage green) is a smooth log-degradation from about 0.2 to 0.45. The key-pairs curve (peach) stays near 0.1 for d ≤ 6 then ramps to 0.6 by d=12. A dashed red curve traces the lower envelope of all three. Shaded background bands mark the verbatim region (d ≤ 3), key-pair region (3 < d ≤ 9), and summary region (d > 9). Right panel: bar chart of the empirical reference-distance distribution Pr[d] proportional to exp(-lambda d) with lambda ≈ 0.30, with the bars coloured by the layer that covers them — blue for d ≤ 3 (70% of mass), peach for 3 < d ≤ 9 (25% of mass), sage for d > 9 (5% of mass). A dotted black line overlays the exponential curve.
Left: each layer's rate-distortion curve over look-back distance; the dashed envelope is the mixture's distortion (the minimum of all three). Picking any single layer leaves one query mode underserved. Right: empirical reference-distance distribution. Verbatim covers ~70% of references with a 2–3 turn window; key pairs cover the next ~25% past that window; the summary handles the long tail (~5%). Each layer's token budget is proportional to the mass it covers.

A few load-bearing pieces of math fall out of this picture:

Our actual token budget — about 600 tokens of verbatim, 1200 of key pairs, 300 of summary — lands close to the optimum because at that split, spending one more token on any single layer would give roughly the same improvement as spending it on either of the other two. None of the layers is starved; none is gluttonous. That balance is why the split stays stable across very different conversations.

Two recent papers push back on this whole framing — Beyond Static Summarization (2601.04463) and From Lossy to Verified (2602.17913) both argue the same thing: we cannot really know which bits of history to keep until the next question arrives, and the rolling summary commits to a compression too early. They are right. The three-layer mixture is not a proof that this is optimal — it is the cheapest workaround that empirically holds up against the question distribution we see.

Two caveats worth taking seriously

Two papers raise objections to the rolling-summary genre that we paid attention to:

The right move with either caveat is to ship the simple version first, measure where it loses, and add the next layer only when you have data that says the loss is happening.

When this is the right tool

If your chat surface has these properties, the simple rolling summary is probably enough:

If any of those flip, you need more machinery. Letta-style tiered memory if conversations are unbounded; proactive typed extraction if losing one fact is unacceptable; vector search over the conversation history if cross-session lookup matters.

For us, none of those flipped, so the simple version is what’s shipping. Two prompts, one column, a six-message threshold. The literature took two years to converge on this shape; the implementation took an afternoon once we stopped trying to be clever.

Comments

Sign in with GitHub to leave a comment. Threads live on SourceShift/blog-comments — moderated.