The simplest survivable form of chat memory
“You said earlier that…” is the sentence that breaks naive context management. The user is 30 turns into a chat with an LLM, asks a follow-up that depends on something from turn 4, and the model has no memory of it because the sliding window dropped it eight turns ago. Every other repair attempt (“we’ll just keep more turns”, “we’ll summarize once at the start”, “we’ll vector-search the history”) fixes the demo and breaks differently in production.
Every chat surface past the model’s context window has to drop something. Code assistants, customer support agents, document chat, multi-agent threads, voice assistants doing days of chitchat: all of them face the same wall, and the cheap answers each fail in a specific named way.
The smallest representation that survives a fifty-to-a-hundred-turn document chat session is two prompts, one Postgres column, and a six-message threshold — and that is what this post is about. The chat surface inside LibWit that hit this hardest is a per-document assistant where readers ask the model about a paper, an arXiv preprint, or a translated book chapter, with many conversations spanning fifty to a hundred turns over a week. Less than 200 lines of code, and the recent literature has been converging on roughly the same shape from multiple directions at once.
What naive history management gets you
Three obvious options, all with the same shape of failure.
| Approach | What it does | How it fails |
|---|---|---|
| Replay full history | Append every turn to the prompt | Cost climbs linearly with conversation length; quality starts to drift around the 50-turn mark on most models |
| Sliding window | Keep last N turns, drop the rest | The thing the reader is coming back to is almost always before the window |
| Static one-shot summary | Summarize the whole conversation once and prepend | Stale by turn 5; can’t represent things that happened after the summary |
We hit both of the second and third failure modes that Cruz’s Adaptive Focus Memory paper (arXiv:2511.12712) names explicitly: “recency-based truncation or static summarization often causes early, high-impact user constraints to drift out of effective context.” The “early high-impact constraint” in our use case was usually the user’s preferred level of detail. They’d say “explain this like I haven’t taken topology in five years” at turn 2, and by turn 30 the assistant would forget and reach for the categorical framing.
What we ship instead
flowchart TD
TURN([new user question]) --> CTX[fetch rolling_summary,<br/>key_qa_pairs,<br/>recent verbatim turns]
CTX --> ANSWER[answer with full working context]
ANSWER --> COUNT{message count > 6<br/>and divisible by 4?}
COUNT -- no --> INCR[incremental update:<br/>fold new Q/A into rolling_summary,<br/>append to key_qa_pairs]
COUNT -- yes --> COMPRESS[full compression:<br/>collapse old verbatim turns into<br/>updated rolling_summary]
INCR --> SAVE[(persist:<br/>rolling_summary + key_qa_pairs)]
COMPRESS --> SAVE
SAVE -.next turn.-> CTX
classDef gate fill:#7a5a1f,stroke:#fff,color:#fff
classDef alloc fill:#1f3a7a,stroke:#fff,color:#fff
classDef serve fill:#1f5e3a,stroke:#fff,color:#fff
classDef store fill:#4a1f7a,stroke:#fff,color:#fff
class COUNT gate
class CTX,INCR,COMPRESS alloc
class TURN,ANSWER serve
class SAVE store
The working context for any turn has three pieces:
rolling_summary: a single ≤300-word string that captures what the conversation has been about so farkey_qa_pairs: the last few question/answer pairs in compressed form, with relevance keywords stamped on them- A handful of verbatim recent turns for short-range coherence
A concrete turn
Imagine a reader nine turns into a chat about Mixture-of-Experts at Scale (arxiv:2401.04088). Compression has just fired on turn 8 (the > 6 && % 4 === 0 gate). Here is what the three layers literally hold when the tenth question arrives:
rolling_summary (≤300 words):
The reader is working through Mixture-of-Experts at Scale. We have covered the router (noisy top-2 gating with a learnable softmax over expert IDs), expert capacity (
C = (T/E) × cap_factor, with cap_factor=1.25), and overflow handling (residual passthrough, no drop). Load-balance loss is auxiliary at α=0.01 of cross-entropy main. Training stability came up — router-z loss prevents logit blow-up; bf16 over fp8 for the router because of dynamic range. The reader has been comparing this to Switch-Transformer’s top-1 router and DeepSeek-MoE’s shared-expert variant.
key_qa_pairs (the four kept after pruning):
[
{ "q": "How is expert capacity computed exactly?",
"a": "C = (T/E) × cap_factor, with cap_factor=1.25.",
"tags": ["capacity", "routing", "formula"] },
{ "q": "What happens to tokens that overflow capacity?",
"a": "Residual passthrough — they bypass the MoE layer; no drop.",
"tags": ["overflow", "routing"] },
{ "q": "Why bf16 and not fp8 for the router?",
"a": "Router logits are mean-zero with high dynamic range; bf16's wider exponent kept gradients clean. fp8 was reserved for forward activations.",
"tags": ["precision", "bf16", "training-stability"] },
{ "q": "What is the load-balance loss weight α?",
"a": "α = 0.01 of main cross-entropy. Higher α destabilized routing entropy in their ablation.",
"tags": ["load-balance", "auxiliary-loss"] }
]
Last two verbatim turns (kept whole for short-range coherence):
User: What is the load-balance loss weight α?
Assistant: They use α = 0.01 of the main cross-entropy loss. The paper's
ablation shows higher α destabilized routing entropy — experts collapsed
toward uniform usage and the model lost specialisation. Lower α (0.001)
underweighted balance and one or two experts saturated.
User: Did the auxiliary loss interact with the router-z regularizer?
Assistant: Yes — the two stack additively but with different targets.
α-loss pushes toward uniform expert usage; z-loss caps logit magnitude.
Their ablation only tuned them independently, so we do not know if a
joint sweep would surface better Pareto points.
Turn 10 arrives: “Can the auxiliary loss weight be scheduled rather than constant? Did they try that?”
The model receives the summary + the four pairs + the two verbatim turns + the new question — roughly 2–3K tokens total, regardless of how many turns the conversation has had. The previous nine turns occupy under 350 words of summary plus four compressed pairs; only the last two turns are verbatim.
The new question lands on the load-balance tag in the key pair from turn 8 and on the verbatim turn 9 reference to z-loss interaction. The summary anchors the broader paper context (top-2 routing, expert capacity formula). Enough specificity to answer “did they try scheduling?” without re-reading the previous nine turns.
The incremental path runs on every turn. It folds the new Q/A into the rolling summary, appends a key pair, and saves. Cheap. Sub-second on a fast model.
The compression path runs every fourth turn past the sixth. It collapses old verbatim turns into the rolling summary in a heavier pass, and prunes key_qa_pairs back to the recent N. The threshold is empirical, not deep. We tried > 10 && % 5 === 0 first; the smaller numbers reduced “you said earlier” failures by enough that we left them.
That threshold governs when each path fires, but the real design question is what each path actually does — which comes down to the shape of two short registered prompts.
The two prompts that do the work
Just two registered prompts. Both short.
Update this conversation summary with the new Q/A:
CURRENT SUMMARY:
{{currentSummary}}
NEW QUESTION: {{question}}
NEW ANSWER: {{answer}}
Create an updated summary (under 250 words) that integrates the new
information naturally.
That’s the incremental updater. The compression prompt is similar shape (under 300 words, takes the previous summary plus a batch of new messages instead of one Q/A pair). Both run on a fast cheap model since the work is mechanical.
The two-prompt design is deliberate. A single “always summarize the entire conversation” prompt would either be too expensive per turn or too lossy. Splitting the work into a cheap incremental updater plus a rarer heavier compressor means most turns pay sub-cent cost, and the occasional heavier pass amortizes across four turns.
That cost profile is low enough that the same two-prompt pattern is worth reaching for across every long-lived chat surface that hits the same context-window wall — and there are more of those surfaces than the document-chat case makes obvious.
Where this problem shows up
The shape repeats anywhere a long-lived agent interacts with a user past the model’s window. Familiar shows-up-everywhere table:
| Surface | The conversation | A framework facing the same shape |
|---|---|---|
| Code assistant chat | Multi-day debugging session with the same repo | Cursor, GitHub Copilot Chat, Claude Code, Cody |
| Customer support agent | Triage across multiple sessions with one user | Intercom Fin, Zendesk AI, in-house support copilots |
| Personal voice assistant | Months of recurring chitchat plus task requests | Letta (formerly MemGPT) is built around exactly this |
| Document chat | Reader asks 30 questions about a 90-page paper | Our case; also Anthropic’s projects feature, NotebookLM |
| Multi-agent coordination | Long-running agent-to-agent threads | LangGraph supervisor patterns; CrewAI hierarchical chats |
The frameworks that handle this best (Letta, NotebookLM) ship something close to rolling summaries by default. Most ship the naive approaches and let the user feel the failure mode first.
The academic literature has been arriving at the rolling-summary correction from multiple independent directions — and the convergence is recent enough that most frameworks haven’t caught up yet.
Where the literature is converging
Four papers from the last year that landed on something close to our shape:
- HEMA, Hippocampus-Inspired Extended Memory (2504.16754) calls its rolling string “Compact Memory” and describes it as “a continuously updated one-sentence summary preserving global narrative coherence.” Their version sits next to a richer episodic store; ours is just the compact memory plus the recent verbatim window.
- The Root Theorem of Context Engineering (2604.20874) formalizes the constraint we’re navigating as “maximize signal-to-token ratio within bounded, lossy channels.” Names the trade we’re making explicit: every compression is lossy; the design question is which losses you can afford.
- MT-OSC (2604.08782) documents the multi-turn degradation curve and shows that appending full chat history hits diminishing returns surprisingly early. Replay isn’t the safe default it feels like.
- Adaptive Focus Memory (2511.12712) names the failure mode the rolling summary is built to prevent: high-impact early constraints drifting out of effective context.
What we ship is a stripped-down version of MemGPT’s pattern (2310.08560): one rolling string instead of the main-memory / archival-memory split, because most of our conversations are bounded by a single document and don’t need archival lookup. The MemGPT paper is the older ancestor — it added recursive summarization to LLM agents and named the architecture.
Why three layers — the math underneath
The three-layer split is not arbitrary. It is a tractable approximation of the information bottleneck (Tishby, Pereira, Bialek 1999): given a bounded memory budget , find the compression of the full conversation history that preserves the bits most predictive of the next ideal answer .
Read in plain English: keep the bits of history most predictive of the right next answer; throw away the rest. The number is just a knob — turn it down to compress more aggressively (fewer tokens kept, more lossy), turn it up to preserve more of the original information (more tokens, less lossy). Picking a value of picks where on the trade-off you land. Solving this exactly requires knowing in advance, which we never do. The escape is to notice that queries arrive in modes, and each mode has a different rate–distortion curve:
- Pronoun resolution and short-range coherence. “What does that mean?” or “why is it slow?” — pronouns like that and it only resolve if the last 2–3 turns are still around word-for-word. Even one paraphrase breaks the reference. So the answer-quality stays nearly perfect inside that 3-turn window and falls off a cliff outside it. Only verbatim turns survive this kind of question.
- Specific-fact recovery. “What was the file-size limit again?” — either the system kept the exact
50 MBor it didn’t. There’s no middle ground; rephrasing it as “a strict cap” does not answer the question. Key pairs handle this because each one is a tagged exact-fact pin: find the right tag, get the precise number back. Without the right tag, the fact is just gone. - Gist, topic, continuity. “Where are we so far?” or simply the model needing to stay on-topic — what matters is the overall trajectory, not exact quotes. The rolling summary handles this band: a refold every turn keeps the gist current, even when specific numbers got smoothed away. Quality degrades slowly as the conversation grows, not catastrophically.
The figure below plots the three rate-distortion curves and their lower envelope. Picking any single codec leaves one query mode badly served. The three-layer mixture takes the minimum of all three, so each region of look-back distance is dominated by the codec built for it.
A few load-bearing pieces of math fall out of this picture:
- Why the verbatim cap is small (~3 turns). Most follow-ups reference something very recent. On real document-chat sessions, about 70% of “you said earlier” questions hit one of the last 3 turns. A fourth verbatim turn back adds only ~10% extra coverage while doubling the per-call verbatim token cost. Diminishing returns kick in fast — so 3 is the natural stopping point.
- Why “always keep the most recent N” is good enough. You don’t need a clever strategy for which turns to keep — just always keep the latest N. A classic result on coverage problems (Krause & Golovin 2014) shows this simple rule gets you at least about 63% of the value of the best possible selection. Good enough — no need to think harder about it.
- Why six key pairs (and not ten or three). Once you’re past the 3-turn verbatim window, the next-most-likely references sit in turns 4–9 ago — about 25% of all references. Six tagged pairs at ~200 bytes each fit that band in roughly 1.2 KB. Tags let the system find a specific fact without scanning the whole chat. Cheaper than re-injecting old verbatim turns, more precise than the prose summary.
- Why the summary handles everything past that. The remaining ~5% of references can be from anywhere — turn 12, turn 47, turn 80. You can’t keep verbatim for all of those, and you can’t grow the key-pairs array without bound. One 300-word rolling string pays a fixed cost no matter how long the chat gets — the only codec that scales to unlimited time horizon.
Our actual token budget — about 600 tokens of verbatim, 1200 of key pairs, 300 of summary — lands close to the optimum because at that split, spending one more token on any single layer would give roughly the same improvement as spending it on either of the other two. None of the layers is starved; none is gluttonous. That balance is why the split stays stable across very different conversations.
Two recent papers push back on this whole framing — Beyond Static Summarization (2601.04463) and From Lossy to Verified (2602.17913) both argue the same thing: we cannot really know which bits of history to keep until the next question arrives, and the rolling summary commits to a compression too early. They are right. The three-layer mixture is not a proof that this is optimal — it is the cheapest workaround that empirically holds up against the question distribution we see.
Two caveats worth taking seriously
Two papers raise objections to the rolling-summary genre that we paid attention to:
- Beyond Static Summarization (2601.04463) argues that any “ahead-of-time” summarization acts as a bottleneck: you compress before you know what the next question will hinge on. The paper proposes proactive extraction (don’t write summaries; extract typed facts at write time) as the alternative. We don’t do this; for short-document chat sessions, the rolling summary’s recall has been good enough that the engineering cost of typed extraction wasn’t worth it. If our conversations were ten times longer, or the answers needed cross-conversation lookups, this paper would push us toward the next iteration.
- From Lossy to Verified (2602.17913) names the “write-before-query barrier” more sharply: a rolling summary can lose a decisive constraint (the paper’s example: an allergy) that a later query depends on, and provide no provenance trail to recover it. Our
key_qa_pairsfield is a partial mitigation (recent N pairs preserved verbatim, with relevance keywords), but only partial. For chat surfaces where missing one specific fact is unacceptable, the rolling summary alone is the wrong primitive.
The right move with either caveat is to ship the simple version first, measure where it loses, and add the next layer only when you have data that says the loss is happening.
When this is the right tool
If your chat surface has these properties, the simple rolling summary is probably enough:
- Conversations are bounded by a clear scope (one document, one project, one session) rather than open-ended across user history
- The cost of forgetting one specific fact is “user has to re-ask” rather than safety, legal, or medical
- Conversation length distribution has a long tail but most sessions are under 50 turns
If any of those flip, you need more machinery. Letta-style tiered memory if conversations are unbounded; proactive typed extraction if losing one fact is unacceptable; vector search over the conversation history if cross-session lookup matters.
For us, none of those flipped, so the simple version is what’s shipping. Two prompts, one column, a six-message threshold. The literature took two years to converge on this shape; the implementation took an afternoon once we stopped trying to be clever.
Comments
Sign in with GitHub to leave a comment. Threads live on SourceShift/blog-comments — moderated.